SQL queries on DataFrames in Apache Spark - Time & Space Complexity
When we run SQL queries on DataFrames, we want to know how the time to get results changes as the data grows.
We ask: How does the work increase when the number of rows gets bigger?
Analyze the time complexity of the following code snippet.
df.createOrReplaceTempView("people")
spark.sql("SELECT age, COUNT(*) FROM people GROUP BY age").show()
This code runs a SQL query on a DataFrame to count how many people are in each age group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all rows in the DataFrame to group by age.
- How many times: Each row is processed once during grouping and counting.
As the number of rows grows, the query needs to look at each row to group and count.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 row checks and group updates |
| 100 | About 100 row checks and group updates |
| 1000 | About 1000 row checks and group updates |
Pattern observation: The work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to run the query grows linearly as the number of rows increases.
[X] Wrong: "Grouping by a column makes the query run in constant time no matter the data size."
[OK] Correct: The query still needs to look at every row to group and count, so time grows with data size.
Understanding how SQL queries scale on DataFrames helps you explain performance in real projects and shows you know how data size affects work done.
"What if we added a filter before grouping? How would the time complexity change?"