Apache Sparkdata~30 mins

Broadcast joins for small tables in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Broadcast Joins for Small Tables in Apache Spark

📖 Scenario: You work at a retail company. You have a large sales dataset and a small product details dataset. You want to join them to get product names with each sale. Using broadcast join helps Spark run this join faster by sending the small product data to all worker nodes.

🎯 Goal: Learn how to perform a broadcast join in Apache Spark to efficiently join a large sales DataFrame with a small product DataFrame.

📋 What You'll Learn

Create a Spark DataFrame for sales data with columns sale_id, product_id, and quantity.

Create a Spark DataFrame for product data with columns product_id and product_name.

Use a broadcast join to join sales with product data on product_id.

Show the joined DataFrame output.

💡 Why This Matters

🌍 Real World

Broadcast joins are used in big data processing when one table is small enough to fit in memory. This speeds up joins by sending the small table to all worker nodes, avoiding expensive shuffles.

💼 Career

Data engineers and data scientists use broadcast joins to optimize Spark jobs for faster data processing and analysis, especially when working with mixed-size datasets.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: (1, 101, 2), (2, 102, 1), (3, 103, 5). The columns must be sale_id, product_id, and quantity.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()
# Create sales_df DataFrame with specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and a list of column names.

Create the product DataFrame

Create a Spark DataFrame called product_df with these exact rows: (101, "Pen"), (102, "Notebook"), (103, "Eraser"). The columns must be product_id and product_name.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()
sales_data = [(1, 101, 2), (2, 102, 1), (3, 103, 5)]
sales_columns = ["sale_id", "product_id", "quantity"]
sales_df = spark.createDataFrame(sales_data, sales_columns)
# Create product_df DataFrame with specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and a list of column names for products.

Perform the broadcast join

Import broadcast from pyspark.sql.functions. Use sales_df.join() to join sales_df with product_df on product_id. Use broadcast(product_df) inside the join to broadcast the small product table. Save the result as joined_df.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()
sales_data = [(1, 101, 2), (2, 102, 1), (3, 103, 5)]
sales_columns = ["sale_id", "product_id", "quantity"]
sales_df = spark.createDataFrame(sales_data, sales_columns)
product_data = [(101, "Pen"), (102, "Notebook"), (103, "Eraser")]
product_columns = ["product_id", "product_name"]
product_df = spark.createDataFrame(product_data, product_columns)
# Join sales_df with broadcast(product_df) on product_id and save as joined_df
# Your code here

Need a hint?

Use broadcast(product_df) inside the join() method.

Show the joined DataFrame

Use print() and joined_df.show() to display the joined DataFrame output.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()
sales_data = [(1, 101, 2), (2, 102, 1), (3, 103, 5)]
sales_columns = ["sale_id", "product_id", "quantity"]
sales_df = spark.createDataFrame(sales_data, sales_columns)
product_data = [(101, "Pen"), (102, "Notebook"), (103, "Eraser")]
product_columns = ["product_id", "product_name"]
product_df = spark.createDataFrame(product_data, product_columns)
joined_df = sales_df.join(broadcast(product_df), on="product_id")
# Print the joined DataFrame
# Your code here

Need a hint?

Use joined_df.show() to display the DataFrame rows.