SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY in HIVE

SORT BY

Sorts data per reducer
Guarantees ordering of rows within a reducer only
Each reducer can receive overlapping ranges of data
If reducer > 1 , then it gives partially ordered final result

ORDER BY

Guarantees total ordering of data
For this, data passed on to a single reducer
Performance intensive - takes longer time
Compulsory to use LIMIT clause in Hive strict mode

If hive.mapred.mode=strict , then use of LIMIT clause is compulsory

If hive.mapred.mode=non-strict , then LIMIT clause is not required

DISTRIBUTE BY

Ensures each of N reducers gets non-overlapping ranges of columns
But doesn't sort the output of each reducer

CLUSTER BY

Ensures each of N reducer get non-overlapping ranges
Then, sort by those ranges at the reducer

DISTRIBUTE BY + SORT BY

DISTRIBUTE BY + SORT BY is equivalent to CLUSTER BY when the partition column and sort column are same.
DISTRIBUTE BY + SORT BY can be best used when the partition column and sort column need to be different.

REFRERENCE -1 : https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/
REFRERENCE -2 : https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by

SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY in HIVE

SORT BY

ORDER BY

DISTRIBUTE BY

CLUSTER BY

DISTRIBUTE BY + SORT BY

Post a Comment

1 Comments

Popular Posts

Codd's 12 rules in DBMS08:17

Kaggle : ecommerce-events-history-in-cosmetics-shop17:08

DATABASE ARCHITECTURE IN DBMS14:50

C Language

Categories

Tags

Footer Menu Widget

SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY in HIVE

SORT BY

ORDER BY

DISTRIBUTE BY

CLUSTER BY

DISTRIBUTE BY + SORT BY

You may like these posts

Post a Comment

1 Comments

Social Plugin

Popular Posts

Codd's 12 rules in DBMS08:17

Kaggle : ecommerce-events-history-in-cosmetics-shop17:08

DATABASE ARCHITECTURE IN DBMS14:50

C Language

Categories

Tags

Footer Menu Widget

Social Footer Widget