0
0
Apache Sparkdata~30 mins

Multi-column joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Multi-column joins
📖 Scenario: You work at a retail company. You have two tables: one with customer orders and another with customer details. You want to combine these tables to see the orders along with customer information. Both tables share customer_id and region columns. Joining on both columns ensures you match the right customer in the right region.
🎯 Goal: Build a Spark DataFrame join using two columns (customer_id and region) to combine orders and customer details.
📋 What You'll Learn
Create a Spark DataFrame called orders with columns order_id, customer_id, region, and amount.
Create a Spark DataFrame called customers with columns customer_id, region, and customer_name.
Create a variable called join_columns that holds the list of columns ["customer_id", "region"].
Use join_columns to join orders and customers DataFrames on these columns.
Print the resulting joined DataFrame.
💡 Why This Matters
🌍 Real World
Multi-column joins are common when combining data from different sources that share multiple keys, like customer ID and region, to ensure accurate matching.
💼 Career
Data scientists and data engineers often join datasets on multiple columns to prepare clean, combined data for analysis or machine learning.
Progress0 / 4 steps
1
Create the orders DataFrame
Create a Spark DataFrame called orders with these exact rows and columns: order_id, customer_id, region, and amount. Use these rows: (1, 101, 'East', 250), (2, 102, 'West', 450), (3, 103, 'East', 300), (4, 104, 'North', 150).
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

2
Create the customers DataFrame
Create a Spark DataFrame called customers with these exact rows and columns: customer_id, region, and customer_name. Use these rows: (101, 'East', 'Alice'), (102, 'West', 'Bob'), (103, 'East', 'Charlie'), (105, 'South', 'Diana').
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

3
Define the join columns and join DataFrames
Create a variable called join_columns that holds the list ["customer_id", "region"]. Then join the orders DataFrame with the customers DataFrame using join_columns as the join keys. Store the result in a variable called joined_df.
Apache Spark
Need a hint?

Define join_columns as a list of column names. Use orders.join(customers, on=join_columns) to join.

4
Show the joined DataFrame
Print the joined_df DataFrame using the show() method to display the joined data.
Apache Spark
Need a hint?

Use joined_df.show() to print the DataFrame.