0
0
Apache Sparkdata~30 mins

Unit testing Spark transformations in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Unit Testing Spark Transformations
📖 Scenario: You work as a data engineer. You write Spark code to transform data. Your manager asks you to write tests to check your transformations work correctly.Testing helps catch mistakes early and keeps data reliable.
🎯 Goal: You will create a small Spark DataFrame, write a transformation function, and then write a unit test to check the transformation output.
📋 What You'll Learn
Create a Spark DataFrame with exact data
Write a configuration variable for filtering
Write a transformation function using Spark DataFrame API
Write a unit test that checks the transformation output
Print the test result as True or False
💡 Why This Matters
🌍 Real World
Data engineers and data scientists often write Spark code to process big data. Unit testing ensures their code works correctly before running expensive jobs.
💼 Career
Knowing how to write and test Spark transformations is a key skill for data engineers and data scientists working with big data platforms.
Progress0 / 4 steps
1
Create initial Spark DataFrame
Create a Spark DataFrame called df with these exact rows: (1, 'Alice', 29), (2, 'Bob', 31), (3, 'Charlie', 25). Use columns named 'id', 'name', and 'age'. Use SparkSession spark already available.
Apache Spark
Need a hint?

Use spark.createDataFrame(data, columns) to create the DataFrame.

2
Add filter age threshold variable
Create a variable called age_threshold and set it to 30. This will be used to filter people older than this age.
Apache Spark
Need a hint?

Just create a variable age_threshold and assign 30.

3
Write transformation function to filter DataFrame
Write a function called filter_by_age that takes a DataFrame df and an integer threshold. It returns a new DataFrame filtered to rows where age is greater than threshold. Use DataFrame API filter method.
Apache Spark
Need a hint?

Use df.filter(df.age > threshold) inside the function.

4
Write unit test and print result
Write code to test filter_by_age with df and age_threshold. Collect the result to a list of rows. Create a list expected with one tuple (2, 'Bob', 31). Print True if the result matches expected, else False.
Apache Spark
Need a hint?

Use collect() to get rows as a list. Compare with expected list and print the boolean result.