Why data lake architecture centralizes data in Hadoop - Performance Analysis
We want to understand how the time to process data grows when using a data lake architecture.
Specifically, how centralizing data affects the work Hadoop does as data size grows.
Analyze the time complexity of this Hadoop job reading from a centralized data lake.
// Hadoop MapReduce job reading from a data lake
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "DataLakeRead");
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/data-lake/centralized/"));
// Mapper and Reducer setup
job.waitForCompletion(true);
This job reads all data stored centrally in the data lake, then processes it with map and reduce steps.
Look at what repeats as data size grows.
- Primary operation: Reading and processing each data record from the centralized storage.
- How many times: Once for each record stored in the data lake.
As the data lake grows, the job reads and processes more records.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 reads and map operations |
| 100 | 100 reads and map operations |
| 1000 | 1000 reads and map operations |
Pattern observation: The work grows directly with the number of records; doubling data doubles work.
Time Complexity: O(n)
This means the time to process data grows in a straight line with the amount of data centralized in the lake.
[X] Wrong: "Centralizing data means processing time stays the same no matter how much data there is."
[OK] Correct: Even though data is centralized, the job still reads and processes every record, so more data means more work.
Understanding how data size affects processing time in a data lake helps you explain real-world data workflows clearly and confidently.
"What if the data lake was split into multiple smaller lakes instead of centralized? How would the time complexity change?"