Apache Sparkdata~30 mins

Understanding the Catalyst optimizer in Apache Spark - Hands-On Activity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Understanding the Catalyst optimizer

📖 Scenario: You are working with a small dataset of sales records. You want to understand how Spark's Catalyst optimizer improves query performance by analyzing the execution plan.

🎯 Goal: Learn to create a Spark DataFrame, configure a simple filter condition, apply a query, and display the optimized execution plan using the Catalyst optimizer.

📋 What You'll Learn

Create a Spark DataFrame with sales data

Set a filter threshold for sales amount

Apply a filter query using the threshold

Display the optimized execution plan

💡 Why This Matters

🌍 Real World

Data scientists and engineers use Spark to process large datasets efficiently. Understanding the Catalyst optimizer helps them write faster queries.

💼 Career

Knowing how Spark optimizes queries is valuable for roles like data engineer, data analyst, and big data developer.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("Alice", 300), ("Bob", 150), ("Charlie", 200). Use columns named name and amount.

Apache Spark

# Create the sales_df DataFrame with the specified data
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

Set the sales amount threshold

Create a variable called threshold and set it to 200 to filter sales amounts greater than this value.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CatalystOptimizerDemo").getOrCreate()
sales_data = [("Alice", 300), ("Bob", 150), ("Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["name", "amount"])

# Set the threshold variable to 200
# Your code here

Need a hint?

Just assign the number 200 to a variable named threshold.

Filter the DataFrame using the threshold

Create a new DataFrame called filtered_df by filtering sales_df where the amount column is greater than threshold.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CatalystOptimizerDemo").getOrCreate()
sales_data = [("Alice", 300), ("Bob", 150), ("Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["name", "amount"])

threshold = 200

# Filter sales_df where amount > threshold and assign to filtered_df
# Your code here

Need a hint?

Use the filter method on sales_df with the condition sales_df.amount > threshold.

Show the optimized execution plan

Use filtered_df.explain() to print the optimized execution plan that the Catalyst optimizer generates.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CatalystOptimizerDemo").getOrCreate()
sales_data = [("Alice", 300), ("Bob", 150), ("Charlie", 200)]
sales_df = spark.createDataFrame(sales_data, ["name", "amount"])

threshold = 200

filtered_df = sales_df.filter(sales_df.amount > threshold)

# Print the optimized execution plan
# Your code here

Need a hint?

Call explain() on filtered_df to see the plan.