Why ingestion pipelines feed the data lake in Hadoop - Performance Analysis
We want to understand how the time to move data into a data lake grows as the data size increases.
How does the ingestion pipeline handle larger amounts of data efficiently?
Analyze the time complexity of this Hadoop ingestion pipeline snippet.
// Hadoop MapReduce job to ingest data into data lake
public class DataIngestionMapper extends Mapper {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Process each line of input data
context.write(value, NullWritable.get());
}
}
// No reducer needed, data is directly stored in data lake
This code reads input data line by line and writes it directly to the data lake storage.
Look at what repeats as data size grows.
- Primary operation: Processing each line of input data once in the mapper.
- How many times: Once per input line, so as many times as there are lines.
As the number of input lines grows, the total work grows in the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 processing steps |
| 100 | 100 processing steps |
| 1000 | 1000 processing steps |
Pattern observation: The work grows directly with the number of input lines.
Time Complexity: O(n)
This means the time to ingest data grows linearly with the amount of data.
[X] Wrong: "The ingestion time stays the same no matter how much data we have."
[OK] Correct: Each line must be processed, so more data means more work and more time.
Understanding how data ingestion scales helps you explain real-world data pipeline performance clearly and confidently.
"What if the ingestion pipeline added a reducer step that aggregates data? How would the time complexity change?"