0
0
Apache Sparkdata~10 mins

Unit testing Spark transformations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Unit testing Spark transformations
Write transformation function
Create test input DataFrame
Apply transformation function
Collect or compare output DataFrame
Assert output matches expected
Test Pass/Fail
Unit testing Spark transformations means writing a function, applying it to test data, and checking if the output is as expected.
Execution Sample
Apache Spark
def add_one(df):
    return df.withColumn('value_plus_one', df['value'] + 1)

input_df = spark.createDataFrame([(1,), (2,), (3,)], ['value'])
output_df = add_one(input_df)
output_df.show()
This code adds 1 to the 'value' column in a Spark DataFrame and shows the result.
Execution Table
StepActionInput DataFrameTransformation AppliedOutput DataFrame
1Create input DataFrame[{'value':1}, {'value':2}, {'value':3}]None[{'value':1}, {'value':2}, {'value':3}]
2Call add_one function[{'value':1}, {'value':2}, {'value':3}]Add column 'value_plus_one' = value + 1[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]
3Show output DataFrame[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]Display rowsOutput displayed as table
4Compare output with expectedOutput DataFrameCheck equality with expected DataFrameTest Pass if equal
5Test endsN/AN/ATest result: Pass
💡 Test ends after output matches expected DataFrame
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3Final
input_dfNone[{'value':1}, {'value':2}, {'value':3}][{'value':1}, {'value':2}, {'value':3}][{'value':1}, {'value':2}, {'value':3}][{'value':1}, {'value':2}, {'value':3}]
output_dfNoneNone[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}][{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}][{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]
Key Moments - 3 Insights
Why do we create a small input DataFrame instead of using the full dataset?
We use a small input DataFrame to keep tests fast and focused on the transformation logic, as shown in execution_table step 1.
How do we check if the transformation worked correctly?
We compare the output DataFrame to an expected DataFrame to see if they match, as in execution_table step 4.
Why do we not run the full Spark job in unit tests?
Unit tests focus on small parts (transformations) to catch errors early and run quickly, avoiding full job overhead.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the output DataFrame after step 2?
A[{'value':1}, {'value':2}, {'value':3}]
B[{'value':1, 'value_plus_one':2}, {'value':2, 'value_plus_one':3}, {'value':3, 'value_plus_one':4}]
CEmpty DataFrame
DDataFrame with only 'value_plus_one' column
💡 Hint
Check the 'Output DataFrame' column in execution_table row for step 2
At which step do we compare the output DataFrame to the expected DataFrame?
AStep 1
BStep 3
CStep 4
DStep 5
💡 Hint
Look for the step mentioning 'Compare output with expected' in execution_table
If the input DataFrame had an extra column, how would the output DataFrame change after step 2?
AIt would have the extra column plus 'value_plus_one'
BIt would only have 'value_plus_one' column
CIt would drop all columns
DIt would be empty
💡 Hint
Transformation adds a new column but keeps existing ones, see execution_table step 2
Concept Snapshot
Unit testing Spark transformations:
- Write a function for your transformation
- Create a small input DataFrame
- Apply the function to input
- Compare output DataFrame to expected
- Assert equality to pass test
Full Transcript
Unit testing Spark transformations means writing a small function that changes a DataFrame, then testing it with a small example DataFrame. We apply the function, get the output, and check if it matches what we expect. This helps catch mistakes early and keeps tests fast. We track variables like input_df and output_df step by step. Key moments include why we use small data, how we compare outputs, and why we avoid full jobs in unit tests. The execution table shows each step clearly, from creating input to test pass.