Apache Spark - Joins in Spark SQL or Dataframe or Dataset

Joins are used to combine rows from two or more DATAFRAMEs / DATASETs, based on a related column between them.

Different Types of JOINs in Spark SQL

  • INNER JOIN: It returns rows that have matching values in both LEFT DATAFRAME and RIGHT DATAFRAME.
  • LEFT OUTER JOIN: It returns all rows from the LEFT DATAFRAME, and the matched rows from the RIGHT DATAFRAME.
  • RIGHT OUTER JOIN: It returns all rows  from the RIGHT DATAFRAME, and the matched rows from the LEFT DATAFRAME.
  • FULL OUTER JOIN: It returns all rows when there is a match in either LEFT or RIGHT DATAFRAME
  • LEFT SEMI JOIN : It returns rows (having only columns from the LEFT DATAFRAME)  that have matching values with RIGHT DATAFRAME.
  • LEFT ANTI JOIN : It returns rows (having only columns from the LEFT DATAFRAME)  that have no matching values with RIGHT DATAFRAME.
  • CROSS JOIN : It returns the rows that has the number of rows in the left DATAFRAME multiplied by the number of rows in the RIGHT DATAFRAME. This kind of result is called as Cartesian Product.
Take employee and department dataframe:




Post a Comment

1 Comments

  1. Thanks to explain spark join info. if you are givingspark training in Hyderabad, please share dataset as well to practice. Thanks in advance.
    Venu
    spark training in Hyderabad

    ReplyDelete