Apache Sparkdata~10 mins

Unit testing Spark transformations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Unit testing Spark transformations

Write transformation function

↓

Create test input DataFrame

↓

Apply transformation function

↓

Collect or compare output DataFrame

↓

Assert output matches expected

↓

Test Pass/Fail

Unit testing Spark transformations means writing a function, applying it to test data, and checking if the output is as expected.

Execution Sample

Apache Spark

def add_one(df):
    return df.withColumn('value_plus_one', df['value'] + 1)

input_df = spark.createDataFrame([(1,), (2,), (3,)], ['value'])
output_df = add_one(input_df)
output_df.show()

This code adds 1 to the 'value' column in a Spark DataFrame and shows the result.

Execution Table

Step	Action	Input DataFrame	Transformation Applied	Output DataFrame
1	Create input DataFrame	[{'value':1}, {'value':2}, {'value':3}]	None	[{'value':1}, {'value':2}, {'value':3}]
2	Call add_one function	[{'value':1}, {'value':2}, {'value':3}]	Add column 'value_plus_one' = value + 1	[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]
3	Show output DataFrame	[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]	Display rows	Output displayed as table
4	Compare output with expected	Output DataFrame	Check equality with expected DataFrame	Test Pass if equal
5	Test ends	N/A	N/A	Test result: Pass

💡 Test ends after output matches expected DataFrame

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	Final
input_df	None	[{'value':1}, {'value':2}, {'value':3}]	[{'value':1}, {'value':2}, {'value':3}]	[{'value':1}, {'value':2}, {'value':3}]	[{'value':1}, {'value':2}, {'value':3}]
output_df	None	None	[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]	[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]	[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]

Key Moments - 3 Insights

Why do we create a small input DataFrame instead of using the full dataset?

How do we check if the transformation worked correctly?

Why do we not run the full Spark job in unit tests?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the output DataFrame after step 2?

A[{'value':1}, {'value':2}, {'value':3}]

B[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]

CEmpty DataFrame

DDataFrame with only 'value_plus_one' column

Concept Snapshot

Unit testing Spark transformations:
- Write a function for your transformation
- Create a small input DataFrame
- Apply the function to input
- Compare output DataFrame to expected
- Assert equality to pass test

Full Transcript

Unit testing Spark transformations means writing a small function that changes a DataFrame, then testing it with a small example DataFrame. We apply the function, get the output, and check if it matches what we expect. This helps catch mistakes early and keeps tests fast. We track variables like input_df and output_df step by step. Key moments include why we use small data, how we compare outputs, and why we avoid full jobs in unit tests. The execution table shows each step clearly, from creating input to test pass.