SORT BY
- Sorts data per reducer
 - Guarantees ordering of rows within a reducer only
 - Each reducer can receive overlapping ranges of data
 - If reducer > 1 , then it gives partially ordered final result
 
ORDER BY
- Guarantees total ordering of data
 - For this, data passed on to a single reducer
 - Performance intensive - takes longer time
 - Compulsory to use LIMIT clause in Hive strict mode
 
If hive.mapred.mode=strict , then use of LIMIT clause is compulsory
If hive.mapred.mode=non-strict , then LIMIT clause is not required
DISTRIBUTE BY
- Ensures each of N reducers gets non-overlapping ranges of columns
 - But doesn't sort the output of each reducer
 
CLUSTER BY
- Ensures each of N reducer get non-overlapping ranges
 - Then, sort by those ranges at the reducer
 
DISTRIBUTE BY + SORT BY
- DISTRIBUTE BY + SORT BY is equivalent to CLUSTER BY when the partition column and sort column are same.
 - DISTRIBUTE BY + SORT BY can be best used when the partition column and sort column need to be different.
 
REFRERENCE -1 : https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by
1 Comments
Amazing post. Informative
ReplyDelete