Batch vs real-time ingestion in Hadoop - Performance Comparison
We want to understand how the time to process data changes when using batch or real-time ingestion in Hadoop.
How does the way data is handled affect the work done as data size grows?
Analyze the time complexity of the following Hadoop batch ingestion job.
// Batch ingestion example
Job job = Job.getInstance(conf, "BatchIngestion");
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path(inputPath));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path(outputPath));
job.waitForCompletion(true);
This code runs a batch job that reads all input data at once, processes it, and writes output.
Look at what repeats during the batch job.
- Primary operation: Reading and processing each data record once in the mapper and reducer.
- How many times: Once for every record in the input data set.
As the input data grows, the job processes more records one by one.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 processing steps |
| 100 | 100 processing steps |
| 1000 | 1000 processing steps |
Pattern observation: The work grows directly with the number of records; doubling data doubles work.
Time Complexity: O(n)
This means the time to finish the batch job grows in a straight line with the amount of data.
[X] Wrong: "Real-time ingestion always takes less time than batch ingestion."
[OK] Correct: Real-time ingestion handles data in small pieces continuously, but the total work over time can be similar or more depending on data volume and processing steps.
Understanding how batch and real-time ingestion scale helps you explain trade-offs clearly and shows you know how data size affects processing time in Hadoop jobs.
"What if we changed the batch job to process data in smaller chunks repeatedly instead of all at once? How would the time complexity change?"