SORT BY
- Sorts data per reducer
- Guarantees ordering of rows within a reducer only
- Each reducer can receive overlapping ranges of data
- If reducer > 1 , then it gives partially ordered final result
ORDER BY
- Guarantees total ordering of data
- For this, data passed on to a single reducer
- Performance intensive - takes longer time
- Compulsory to use LIMIT clause in Hive strict mode
If hive.mapred.mode=strict , then use of LIMIT clause is compulsory
If hive.mapred.mode=non-strict , then LIMIT clause is not required
DISTRIBUTE BY
- Ensures each of N reducers gets non-overlapping ranges of columns
- But doesn't sort the output of each reducer
CLUSTER BY
- Ensures each of N reducer get non-overlapping ranges
- Then, sort by those ranges at the reducer
DISTRIBUTE BY + SORT BY
- DISTRIBUTE BY + SORT BY is equivalent to CLUSTER BY when the partition column and sort column are same.
- DISTRIBUTE BY + SORT BY can be best used when the partition column and sort column need to be different.
REFRERENCE -1 : https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by
1 Comments
Amazing post. Informative
ReplyDelete