Writing output with partitioning in Apache Spark - Time & Space Complexity
When saving data in Apache Spark, partitioning splits data into parts. This affects how long writing takes.
We want to know how the time to write grows as data size and partitions change.
Analyze the time complexity of the following code snippet.
df.repartition(10).write.mode("overwrite").parquet("output/path")
This code repartitions the data into 10 parts and writes it as parquet files.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Writing each partition's data to disk.
- How many times: Once per partition (here, 10 times).
As data size grows, each partition holds more data, so writing takes longer per partition.
| Input Size (n) | Approx. Operations |
|---|---|
| 10,000 rows | Writing 10 partitions, each with ~1,000 rows |
| 100,000 rows | Writing 10 partitions, each with ~10,000 rows |
| 1,000,000 rows | Writing 10 partitions, each with ~100,000 rows |
Pattern observation: Total work grows roughly with data size, split across fixed partitions.
Time Complexity: O(n)
This means the time to write grows linearly with the total data size, divided among partitions.
[X] Wrong: "More partitions always make writing faster regardless of data size."
[OK] Correct: Too many partitions add overhead and small files, which can slow writing and later reading.
Understanding how partitioning affects write time helps you explain data scaling in Spark jobs clearly and confidently.
"What if we change repartition(10) to repartition(100)? How would the time complexity change?"