SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY in HIVE

SORT BY

  • Sorts data per reducer
  • Guarantees ordering of rows within a reducer only
  • Each reducer can receive overlapping ranges of data
  • If reducer > 1 , then it gives partially ordered final result

ORDER BY

  • Guarantees total ordering of data
  • For this, data passed on to a single reducer
  • Performance intensive - takes longer time
  • Compulsory to use LIMIT clause in Hive strict mode
If hive.mapred.mode=strict , then use of LIMIT clause is compulsory 
If hive.mapred.mode=non-strict , then LIMIT clause is not required

DISTRIBUTE BY

  • Ensures each of N reducers gets non-overlapping ranges of columns 
  • But doesn't sort the output of each reducer

CLUSTER BY

  • Ensures each of N reducer get non-overlapping ranges
  • Then, sort by those ranges at the reducer

DISTRIBUTE BY + SORT BY

  • DISTRIBUTE BY + SORT BY is equivalent to CLUSTER BY when the partition column and sort column are same.
  • DISTRIBUTE BY + SORT BY can be best used when the partition column and sort column need to be different.

REFRERENCE -1 : https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by

Post a Comment

1 Comments