0
0
Apache Sparkdata~30 mins

Inner, left, right, and full outer joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Inner, left, right, and full outer joins
📖 Scenario: You work at a small online store. You have two lists of data: one with customer names and their IDs, and another with orders made by some customers. You want to learn how to combine these lists in different ways to see all customers, only those who ordered, or all orders with or without customers.
🎯 Goal: Learn how to use inner join, left join, right join, and full outer join in Apache Spark to combine two datasets and understand the differences between these joins.
📋 What You'll Learn
Create two Spark DataFrames with exact data given
Create a variable for join column name
Use inner join, left join, right join, and full outer join on the DataFrames
Print the results of each join
💡 Why This Matters
🌍 Real World
Combining customer and order data is common in business to analyze sales and customer behavior.
💼 Career
Data scientists and analysts often use joins to merge datasets from different sources for reporting and insights.
Progress0 / 4 steps
1
Create the initial DataFrames
Create a Spark DataFrame called customers with these exact rows: (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'). Also create a Spark DataFrame called orders with these exact rows: (1, 'Book'), (2, 'Pen'), (4, 'Notebook'). Use column names 'customer_id' and 'name' for customers, and 'customer_id' and 'product' for orders. Assume SparkSession is available as spark.
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and a list of column names.

2
Set the join column
Create a variable called join_column and set it to the string 'customer_id'. This will be used to join the DataFrames.
Apache Spark
Need a hint?

Just assign the string 'customer_id' to the variable join_column.

3
Perform the joins
Use the join method on customers and orders with join_column to create four DataFrames: inner_join with how='inner', left_join with how='left', right_join with how='right', and full_outer_join with how='outer'.
Apache Spark
Need a hint?

Use customers.join(orders, join_column, how='inner') and similarly for other join types.

4
Show the join results
Use the show() method to print the results of inner_join, left_join, right_join, and full_outer_join DataFrames in this order.
Apache Spark
Need a hint?

Call show() on each join DataFrame in order: inner_join, left_join, right_join, full_outer_join.