Apache Sparkdata~30 mins

Multi-column joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Multi-column joins

📖 Scenario: You work at a retail company. You have two tables: one with customer orders and another with customer details. You want to combine these tables to see the orders along with customer information. Both tables share customer_id and region columns. Joining on both columns ensures you match the right customer in the right region.

🎯 Goal: Build a Spark DataFrame join using two columns (customer_id and region) to combine orders and customer details.

📋 What You'll Learn

Create a Spark DataFrame called orders with columns order_id, customer_id, region, and amount.

Create a Spark DataFrame called customers with columns customer_id, region, and customer_name.

Create a variable called join_columns that holds the list of columns ["customer_id", "region"].

Use join_columns to join orders and customers DataFrames on these columns.

Print the resulting joined DataFrame.

💡 Why This Matters

🌍 Real World

Multi-column joins are common when combining data from different sources that share multiple keys, like customer ID and region, to ensure accurate matching.

💼 Career

Data scientists and data engineers often join datasets on multiple columns to prepare clean, combined data for analysis or machine learning.

Progress0 / 4 steps

Create the orders DataFrame

Create a Spark DataFrame called orders with these exact rows and columns: order_id, customer_id, region, and amount. Use these rows: (1, 101, 'East', 250), (2, 102, 'West', 450), (3, 103, 'East', 300), (4, 104, 'North', 150).

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MultiColumnJoin").getOrCreate()

# Create the orders DataFrame with the specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

Create the customers DataFrame

Create a Spark DataFrame called customers with these exact rows and columns: customer_id, region, and customer_name. Use these rows: (101, 'East', 'Alice'), (102, 'West', 'Bob'), (103, 'East', 'Charlie'), (105, 'South', 'Diana').

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MultiColumnJoin").getOrCreate()

orders = spark.createDataFrame([
    (1, 101, 'East', 250),
    (2, 102, 'West', 450),
    (3, 103, 'East', 300),
    (4, 104, 'North', 150)
], ["order_id", "customer_id", "region", "amount"])

# Create the customers DataFrame with the specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

Define the join columns and join DataFrames

Create a variable called join_columns that holds the list ["customer_id", "region"]. Then join the orders DataFrame with the customers DataFrame using join_columns as the join keys. Store the result in a variable called joined_df.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MultiColumnJoin").getOrCreate()

orders = spark.createDataFrame([
    (1, 101, 'East', 250),
    (2, 102, 'West', 450),
    (3, 103, 'East', 300),
    (4, 104, 'North', 150)
], ["order_id", "customer_id", "region", "amount"])

customers = spark.createDataFrame([
    (101, 'East', 'Alice'),
    (102, 'West', 'Bob'),
    (103, 'East', 'Charlie'),
    (105, 'South', 'Diana')
], ["customer_id", "region", "customer_name"])

# Create join_columns list and join orders and customers DataFrames on these columns
# Your code here

Need a hint?

Define join_columns as a list of column names. Use orders.join(customers, on=join_columns) to join.

Show the joined DataFrame

Print the joined_df DataFrame using the show() method to display the joined data.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MultiColumnJoin").getOrCreate()

orders = spark.createDataFrame([
    (1, 101, 'East', 250),
    (2, 102, 'West', 450),
    (3, 103, 'East', 300),
    (4, 104, 'North', 150)
], ["order_id", "customer_id", "region", "amount"])

customers = spark.createDataFrame([
    (101, 'East', 'Alice'),
    (102, 'West', 'Bob'),
    (103, 'East', 'Charlie'),
    (105, 'South', 'Diana')
], ["customer_id", "region", "customer_name"])

join_columns = ["customer_id", "region"]

joined_df = orders.join(customers, on=join_columns, how="inner")

# Show the joined DataFrame
# Your code here

Need a hint?

Use joined_df.show() to print the DataFrame.