0
0
Hadoopdata~5 mins

Batch vs real-time ingestion in Hadoop - Performance Comparison

Choose your learning style9 modes available
Time Complexity: Batch vs real-time ingestion
O(n)
Understanding Time Complexity

We want to understand how the time to process data changes when using batch or real-time ingestion in Hadoop.

How does the way data is handled affect the work done as data size grows?

Scenario Under Consideration

Analyze the time complexity of the following Hadoop batch ingestion job.

// Batch ingestion example
Job job = Job.getInstance(conf, "BatchIngestion");
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path(inputPath));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path(outputPath));
job.waitForCompletion(true);

This code runs a batch job that reads all input data at once, processes it, and writes output.

Identify Repeating Operations

Look at what repeats during the batch job.

  • Primary operation: Reading and processing each data record once in the mapper and reducer.
  • How many times: Once for every record in the input data set.
How Execution Grows With Input

As the input data grows, the job processes more records one by one.

Input Size (n)Approx. Operations
1010 processing steps
100100 processing steps
10001000 processing steps

Pattern observation: The work grows directly with the number of records; doubling data doubles work.

Final Time Complexity

Time Complexity: O(n)

This means the time to finish the batch job grows in a straight line with the amount of data.

Common Mistake

[X] Wrong: "Real-time ingestion always takes less time than batch ingestion."

[OK] Correct: Real-time ingestion handles data in small pieces continuously, but the total work over time can be similar or more depending on data volume and processing steps.

Interview Connect

Understanding how batch and real-time ingestion scale helps you explain trade-offs clearly and shows you know how data size affects processing time in Hadoop jobs.

Self-Check

"What if we changed the batch job to process data in smaller chunks repeatedly instead of all at once? How would the time complexity change?"