Select, filter, and where operations in Apache Spark - Time & Space Complexity
We want to understand how the time to run select, filter, and where operations changes as the data grows.
How does the work increase when we have more rows in our data?
Analyze the time complexity of the following code snippet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
filtered_df = df.filter(df['age'] > 30)
selected_df = filtered_df.select('name', 'age')
result = selected_df.where(selected_df['age'] < 50)
This code reads data, filters rows where age is over 30, selects only name and age columns, then filters again where age is less than 50.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each row to apply filter conditions.
- How many times: Each row is checked once per filter operation, so twice in total here.
As the number of rows grows, the time to check each row grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 checks (2 filters x 10 rows) |
| 100 | About 200 checks |
| 1000 | About 2000 checks |
Pattern observation: The number of operations grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time grows linearly with the number of rows in the data.
[X] Wrong: "Filtering multiple times multiplies the time complexity exponentially."
[OK] Correct: Each filter scans the data once, so time adds up linearly, not exponentially.
Understanding how filtering and selecting scale helps you explain data processing speed clearly and confidently.
"What if we added a join operation after the filters? How would the time complexity change?"