Data pipeline patterns in GCP - Time & Space Complexity
When building data pipelines in the cloud, it's important to know how the time to process data grows as the data size grows.
We want to understand how the number of steps or operations changes when we add more data to the pipeline.
Analyze the time complexity of the following data pipeline pattern using GCP services.
// Pseudocode for a batch data pipeline
1. Read data from Cloud Storage (multiple files)
2. Process data with Dataflow job
3. Write results to BigQuery
4. Repeat for each batch
This sequence reads batches of files, processes them, and stores results in a database.
Look at what happens repeatedly as data grows.
- Primary operation: Processing each batch of data files with Dataflow.
- How many times: Once per batch, which depends on the number of data batches.
As the number of data batches increases, the pipeline runs more processing jobs.
| Input Size (n) | Approx. API Calls/Operations |
|---|---|
| 10 | 10 Dataflow jobs, 10 reads, 10 writes |
| 100 | 100 Dataflow jobs, 100 reads, 100 writes |
| 1000 | 1000 Dataflow jobs, 1000 reads, 1000 writes |
Each new batch adds a similar amount of work, so the total work grows directly with the number of batches.
Time Complexity: O(n)
This means the time to complete the pipeline grows in direct proportion to the number of data batches processed.
[X] Wrong: "Processing more data files will only take a little more time, almost constant."
[OK] Correct: Each batch requires a full processing job, so time grows with the number of batches, not stays the same.
Understanding how pipeline steps scale with data size shows you can design systems that handle growth smoothly and predictably.
"What if we combined all data files into one big batch before processing? How would the time complexity change?"