Lambda architecture (batch + streaming) in Hadoop - Time & Space Complexity
We want to understand how the time needed to process data grows in a Lambda architecture using Hadoop.
Specifically, how batch and streaming parts affect the total work as data size increases.
Analyze the time complexity of the following Hadoop code snippet for batch and streaming layers.
// Batch layer: process large data in Hadoop MapReduce
job.setMapperClass(BatchMapper.class);
job.setReducerClass(BatchReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Speed layer: process streaming data with small batches
StreamingJob streamingJob = new StreamingJob();
streamingJob.setMapperClass(StreamMapper.class);
streamingJob.setReducerClass(StreamReducer.class);
streamingJob.run();
This code runs two parts: batch jobs on big data sets and streaming jobs on small, fast data chunks.
Look at the loops and repeated processing steps.
- Primary operation: Batch layer runs MapReduce over all data once per batch.
- How many times: Batch runs periodically on full data; streaming runs continuously on small data chunks.
Batch layer time grows with total data size, streaming layer time grows with incoming data rate.
| Input Size (n) | Approx. Batch Operations | Approx. Streaming Operations |
|---|---|---|
| 10 GB | 10 units | 1 unit per small chunk |
| 100 GB | 100 units | 1 unit per small chunk |
| 1000 GB | 1000 units | 1 unit per small chunk |
Batch work grows linearly with total data size; streaming work depends on how fast data arrives, not total size.
Time Complexity: O(n)
This means the batch processing time grows linearly with the size of the data processed.
[X] Wrong: "Streaming layer time grows the same as batch because it processes all data."
[OK] Correct: Streaming processes small, recent data chunks continuously, not the entire dataset, so its time depends on data arrival rate, not total size.
Understanding how batch and streaming parts scale helps you explain real-world data processing systems clearly and confidently.
"What if the streaming layer started processing larger batches instead of small chunks? How would the time complexity change?"