0
0
Apache Sparkdata~30 mins

Integration testing pipelines in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Integration Testing Pipelines
📖 Scenario: You work as a data engineer building data pipelines using Apache Spark. To ensure your pipelines work correctly, you want to write integration tests that check the data flow and transformations.Imagine you have a small dataset of sales records and you want to test a pipeline that filters and aggregates this data.
🎯 Goal: Build a simple Spark pipeline that filters sales data for a specific product, sums the sales amounts, and write integration tests to verify the pipeline's correctness.
📋 What You'll Learn
Create a Spark DataFrame with sales data
Set a filter condition for the product name
Write a pipeline that filters and sums sales
Print the final aggregated sales amount
💡 Why This Matters
🌍 Real World
Data engineers build pipelines that process and transform data. Integration testing ensures the entire pipeline works correctly end-to-end.
💼 Career
Knowing how to write and test Spark pipelines is essential for roles like data engineer, data analyst, and data scientist working with big data.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with these exact rows: {'product': 'apple', 'amount': 10}, {'product': 'banana', 'amount': 5}, {'product': 'apple', 'amount': 15}.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of dictionaries to create sales_df.

2
Set the product filter
Create a variable called filter_product and set it to the string 'apple'.
Apache Spark
Need a hint?

Just assign the string 'apple' to filter_product.

3
Filter and aggregate the sales
Create a new DataFrame called filtered_df by filtering sales_df where the product column equals filter_product. Then create a variable called total_sales that sums the amount column of filtered_df.
Apache Spark
Need a hint?

Use filter() on sales_df and sum() from pyspark.sql.functions to sum the amounts.

4
Print the total sales
Write a print statement to display the text 'Total sales for apple:' followed by the value of total_sales.
Apache Spark
Need a hint?

Use an f-string in the print statement to show the message and the total_sales value.