0
0
Hadoopdata~5 mins

Why data lake architecture centralizes data in Hadoop - Performance Analysis

Choose your learning style9 modes available
Time Complexity: Why data lake architecture centralizes data
O(n)
Understanding Time Complexity

We want to understand how the time to process data grows when using a data lake architecture.

Specifically, how centralizing data affects the work Hadoop does as data size grows.

Scenario Under Consideration

Analyze the time complexity of this Hadoop job reading from a centralized data lake.


// Hadoop MapReduce job reading from a data lake
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "DataLakeRead");
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/data-lake/centralized/"));
// Mapper and Reducer setup
job.waitForCompletion(true);
    

This job reads all data stored centrally in the data lake, then processes it with map and reduce steps.

Identify Repeating Operations

Look at what repeats as data size grows.

  • Primary operation: Reading and processing each data record from the centralized storage.
  • How many times: Once for each record stored in the data lake.
How Execution Grows With Input

As the data lake grows, the job reads and processes more records.

Input Size (n)Approx. Operations
1010 reads and map operations
100100 reads and map operations
10001000 reads and map operations

Pattern observation: The work grows directly with the number of records; doubling data doubles work.

Final Time Complexity

Time Complexity: O(n)

This means the time to process data grows in a straight line with the amount of data centralized in the lake.

Common Mistake

[X] Wrong: "Centralizing data means processing time stays the same no matter how much data there is."

[OK] Correct: Even though data is centralized, the job still reads and processes every record, so more data means more work.

Interview Connect

Understanding how data size affects processing time in a data lake helps you explain real-world data workflows clearly and confidently.

Self-Check

"What if the data lake was split into multiple smaller lakes instead of centralized? How would the time complexity change?"