Apache Sparkdata~30 mins

Why transformations build processing pipelines in Apache Spark - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why transformations build processing pipelines

📖 Scenario: Imagine you work at a company that collects sales data every day. You want to analyze this data step-by-step to find total sales per product category. Apache Spark helps you do this efficiently by building a pipeline of transformations.

🎯 Goal: You will create a Spark DataFrame with sales data, define a filter condition, apply transformations to select and group data, and finally show the result. This will demonstrate how transformations build a processing pipeline in Spark.

📋 What You'll Learn

Create a Spark DataFrame with given sales data

Define a filter condition variable

Apply transformations: filter, select, groupBy, and sum

Show the final aggregated sales per category

💡 Why This Matters

🌍 Real World

Data engineers and data scientists use Spark pipelines to process large datasets efficiently by chaining transformations before running actions.

💼 Career

Understanding how transformations build pipelines is essential for optimizing Spark jobs and writing scalable data processing code.

Progress0 / 4 steps

Create the initial Spark DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("apple", "fruit", 10), ("banana", "fruit", 15), ("carrot", "vegetable", 7), ("broccoli", "vegetable", 5). The columns should be product, category, and quantity.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SalesData").getOrCreate()
# Create the sales_df DataFrame with the given rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

Define a filter condition

Create a variable called filter_category and set it to the string "fruit". This will be used to filter the DataFrame.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SalesData").getOrCreate()
sales_data = [("apple", "fruit", 10), ("banana", "fruit", 15), ("carrot", "vegetable", 7), ("broccoli", "vegetable", 5)]
sales_df = spark.createDataFrame(sales_data, ["product", "category", "quantity"])
# Create filter_category variable with value "fruit"
# Your code here

Need a hint?

Just assign the string "fruit" to the variable filter_category.

Apply transformations to build the pipeline

Use the filter_category variable to filter sales_df where category equals filter_category. Then select category and quantity columns. Group by category and sum the quantity column. Save the result in a new DataFrame called result_df.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SalesData").getOrCreate()
sales_data = [("apple", "fruit", 10), ("banana", "fruit", 15), ("carrot", "vegetable", 7), ("broccoli", "vegetable", 5)]
sales_df = spark.createDataFrame(sales_data, ["product", "category", "quantity"])
filter_category = "fruit"
# Filter sales_df by category, select columns, group by category, and sum quantity
# Your code here

Need a hint?

Use filter, then select, then groupBy with agg(sum(...)). Remember to import sum from pyspark.sql.functions.

Show the final result

Use print to display the contents of result_df by calling its show() method.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

spark = SparkSession.builder.appName("SalesData").getOrCreate()
sales_data = [("apple", "fruit", 10), ("banana", "fruit", 15), ("carrot", "vegetable", 7), ("broccoli", "vegetable", 5)]
sales_df = spark.createDataFrame(sales_data, ["product", "category", "quantity"])
filter_category = "fruit"
result_df = sales_df.filter(sales_df.category == filter_category).select("category", "quantity").groupBy("category").agg(sum("quantity").alias("total_quantity"))
# Print the result_df contents
# Your code here

Need a hint?

Call result_df.show() to display the DataFrame contents.