What is Apache Spark - Complexity Analysis
We want to understand how the time it takes to run Apache Spark tasks changes as the data size grows.
How does Spark handle bigger data and what costs come with it?
Analyze the time complexity of the following Spark code snippet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
data = spark.read.csv('data.csv', header=True, inferSchema=True)
filtered = data.filter(data['age'] > 30)
result = filtered.groupBy('city').count()
result.show()
This code reads a CSV file, filters rows where age is over 30, groups by city, counts entries, and shows the result.
Look at what repeats as data grows.
- Primary operation: Filtering and grouping over all rows in the dataset.
- How many times: Each row is checked once for filtering, then grouped once.
As the number of rows increases, Spark must check each row and group them, so work grows with data size.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks and groupings |
| 100 | About 100 checks and groupings |
| 1000 | About 1000 checks and groupings |
Pattern observation: The work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to run grows linearly as the data size grows.
[X] Wrong: "Spark runs instantly no matter how big the data is."
[OK] Correct: Spark processes each row, so bigger data means more work and more time.
Understanding how Spark scales with data size shows you know how big data tools work in real projects.
"What if we added a join with another large dataset? How would the time complexity change?"