Apache Sparkdata~30 mins

Lazy evaluation in Spark in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Lazy evaluation in Spark

📖 Scenario: You work as a data analyst at a retail company. You want to analyze sales data using Apache Spark. Spark uses lazy evaluation, which means it waits to run your commands until it really needs to. This helps save time and resources.

🎯 Goal: Learn how to create a Spark DataFrame, apply transformations, and see when Spark actually runs the work using lazy evaluation.

📋 What You'll Learn

Create a Spark DataFrame with sales data

Define a filter condition as a configuration variable

Apply a filter transformation using lazy evaluation

Trigger the execution by showing the filtered data

💡 Why This Matters

🌍 Real World

Data analysts use lazy evaluation in Spark to write efficient data processing code that only runs when needed, saving time and computing resources.

💼 Career

Understanding lazy evaluation is key for roles like data engineer, data analyst, and data scientist working with big data tools like Apache Spark.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("Store1", 100), ("Store2", 200), ("Store3", 150). Use columns named store and sales.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEval").getOrCreate()
# Create sales_df DataFrame with given rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

Set the sales threshold

Create a variable called threshold and set it to 150. This will be used to filter stores with sales above this number.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEval").getOrCreate()
sales_data = [("Store1", 100), ("Store2", 200), ("Store3", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
# Create threshold variable with value 150
# Your code here

Need a hint?

Just assign the number 150 to a variable named threshold.

Filter the DataFrame using lazy evaluation

Create a new DataFrame called filtered_df by filtering sales_df to keep only rows where sales is greater than threshold. Use the DataFrame filter() method.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEval").getOrCreate()
sales_data = [("Store1", 100), ("Store2", 200), ("Store3", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
threshold = 150
# Filter sales_df where sales > threshold and assign to filtered_df
# Your code here

Need a hint?

Use filtered_df = sales_df.filter(sales_df.sales > threshold).

Show the filtered results

Use print() to display the result of filtered_df.show(). This will trigger Spark to run the filter and show the rows with sales above the threshold.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEval").getOrCreate()
sales_data = [("Store1", 100), ("Store2", 200), ("Store3", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
threshold = 150
filtered_df = sales_df.filter(sales_df.sales > threshold)
# Print the filtered DataFrame using filtered_df.show()
# Your code here

Need a hint?

Use filtered_df.show() inside a print() to display the filtered rows.