0
0
Apache Sparkdata~30 mins

Transformations vs actions in Apache Spark - Hands-On Comparison

Choose your learning style9 modes available
Understanding Transformations vs Actions in Apache Spark
📖 Scenario: Imagine you work at a company that collects sales data from different stores. You want to analyze this data using Apache Spark to find out how many sales were made in total and which stores had sales above a certain number.
🎯 Goal: Build a simple Spark program that creates a dataset of sales, sets a threshold for high sales, applies a transformation to filter stores with sales above the threshold, and then uses an action to count how many such stores there are.
📋 What You'll Learn
Create an RDD with store sales data
Define a sales threshold variable
Use a transformation to filter stores with sales above the threshold
Use an action to count the filtered stores
Print the count result
💡 Why This Matters
🌍 Real World
Companies use Apache Spark to process large datasets efficiently. Understanding transformations and actions helps in writing optimized data processing pipelines.
💼 Career
Data engineers and data scientists use Spark transformations and actions daily to clean, filter, and analyze big data.
Progress0 / 4 steps
1
Create the sales data RDD
Create an RDD called sales_rdd from the list [('StoreA', 150), ('StoreB', 80), ('StoreC', 200), ('StoreD', 50)] using sc.parallelize().
Apache Spark
Need a hint?

Use sc.parallelize() to create an RDD from a Python list.

2
Set the sales threshold
Create a variable called threshold and set it to 100.
Apache Spark
Need a hint?

Just assign the number 100 to a variable named threshold.

3
Filter stores with sales above threshold
Use a transformation to create a new RDD called high_sales_rdd by filtering sales_rdd for stores where the sales value is greater than threshold. Use filter() with a lambda function that checks if the second item in the tuple is greater than threshold.
Apache Spark
Need a hint?

Use filter() with a lambda that checks if store[1] is greater than threshold.

4
Count and print the number of high sales stores
Use an action to count the number of elements in high_sales_rdd and print the result using print(). Store the count in a variable called count_high_sales.
Apache Spark
Need a hint?

Use count() on high_sales_rdd and print the result.