Unit testing Spark transformations in Apache Spark - Time & Space Complexity
When we test Spark transformations, we want to know how the time to run tests grows as data grows.
We ask: How does the test execution time change when input data size increases?
Analyze the time complexity of the following Spark transformation test code.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
input_df = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])
result_df = input_df.filter(input_df.id > 1).select("value")
assert result_df.count() == 1
This code creates a small DataFrame, applies a filter and select transformation, then checks the count.
Look at what repeats when the input grows.
- Primary operation: The filter and select run on each row of the input DataFrame.
- How many times: Once per row, so as many times as there are rows (n).
As input rows increase, the number of operations grows roughly the same.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 filter checks and selects |
| 100 | About 100 filter checks and selects |
| 1000 | About 1000 filter checks and selects |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the test time grows linearly as the input data size grows.
[X] Wrong: "Unit tests run instantly no matter how big the data is."
[OK] Correct: Even tests run transformations on all rows, so bigger data means more work and longer test time.
Understanding how test time grows helps you write tests that stay fast and reliable as data grows.
"What if we cache the DataFrame before filtering? How would the time complexity change?"