Small files problem and solutions in Hadoop - Time & Space Complexity
When Hadoop processes many small files, it spends extra time managing each file separately.
We want to know how this extra work grows as the number of small files increases.
Analyze the time complexity of reading many small files in Hadoop.
// Pseudo Hadoop code to list and read small files
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.listStatus(new Path("/input/smallfiles"));
for (FileStatus file : files) {
FSDataInputStream in = fs.open(file.getPath());
// read file content
in.close();
}
This code lists all small files in a directory and reads each one separately.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each small file to open and read it.
- How many times: Once for each small file in the directory.
As the number of small files grows, the time to open and read each file adds up.
| Input Size (number of files) | Approx. Operations (file opens) |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: The total work grows directly with the number of files.
Time Complexity: O(n)
This means the time to process all files grows linearly as the number of small files increases.
[X] Wrong: "Reading many small files is as fast as reading one big file of the same total size."
[OK] Correct: Each file open has overhead, so many small files cause much more work than one big file.
Understanding how small files affect Hadoop helps you explain real-world data challenges clearly and confidently.
"What if we combined small files into fewer larger files before processing? How would the time complexity change?"