Parquet format and columnar storage in Apache Spark - Time & Space Complexity
We want to understand how reading data stored in Parquet format grows as the data size increases.
How does the columnar storage affect the time it takes to read specific columns?
Analyze the time complexity of the following Apache Spark code reading Parquet data.
df = spark.read.parquet("/data/events.parquet")
selected_df = df.select("user_id", "event_type")
result = selected_df.filter(selected_df.event_type == "click").collect()
This code reads a Parquet file, selects two columns, filters rows where event_type is "click", and collects the results.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning rows in the Parquet file to filter and select columns.
- How many times: Once for each row in the dataset.
Reading and filtering grows roughly in proportion to the number of rows because each row is checked once.
| Input Size (n rows) | Approx. Operations |
|---|---|
| 10 | About 10 row checks |
| 100 | About 100 row checks |
| 1000 | About 1000 row checks |
Pattern observation: The work grows linearly as the number of rows increases.
Time Complexity: O(n)
This means the time to read and filter grows directly with the number of rows in the data.
[X] Wrong: "Because Parquet is columnar, reading any column is always constant time regardless of data size."
[OK] Correct: Even though Parquet stores columns separately, reading a column still requires scanning all rows in that column, so time grows with the number of rows.
Understanding how columnar storage affects data reading helps you explain performance in real data tasks clearly and confidently.
"What if we only read a small subset of columns from a very wide Parquet file? How would the time complexity change?"