Apache Sparkdata~5 mins

Select, filter, and where operations in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: Select, filter, and where operations

O(n)

Understanding Time Complexity

We want to understand how the time to run select, filter, and where operations changes as the data grows.

How does the work increase when we have more rows in our data?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)

filtered_df = df.filter(df['age'] > 30)
selected_df = filtered_df.select('name', 'age')
result = selected_df.where(selected_df['age'] < 50)

This code reads data, filters rows where age is over 30, selects only name and age columns, then filters again where age is less than 50.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Scanning each row to apply filter conditions.
How many times: Each row is checked once per filter operation, so twice in total here.

How Execution Grows With Input

As the number of rows grows, the time to check each row grows too.

Input Size (n)	Approx. Operations
10	About 20 checks (2 filters x 10 rows)
100	About 200 checks
1000	About 2000 checks

Pattern observation: The number of operations grows roughly in direct proportion to the number of rows.

Final Time Complexity

Time Complexity: O(n)

This means the time grows linearly with the number of rows in the data.

Common Mistake

[X] Wrong: "Filtering multiple times multiplies the time complexity exponentially."

[OK] Correct: Each filter scans the data once, so time adds up linearly, not exponentially.

Interview Connect

Understanding how filtering and selecting scale helps you explain data processing speed clearly and confidently.

Self-Check

"What if we added a join operation after the filters? How would the time complexity change?"