In Hadoop, why is having many small files a problem?
Think about how Hadoop manages file information in its master node.
Hadoop's NameNode stores metadata for each file. Many small files mean many metadata entries, which can overload the NameNode's memory.
What is the output of the following Hadoop command?
hadoop archive -archiveName smallfiles.har -p /user/hadoop/smallfiles /user/hadoop/harfiles
Assuming /user/hadoop/smallfiles contains 1000 small files.
HAR archives combine files logically without deleting originals.
The Hadoop Archive (HAR) command creates a single archive file that logically groups many small files, improving NameNode efficiency without deleting originals.
Given a Hadoop job configured with CombineFileInputFormat to process 1000 small files of 1MB each, what is the expected number of input splits?
CombineFileInputFormat groups small files into splits based on size limits.
CombineFileInputFormat merges small files into fewer splits to reduce overhead. If the split size is about 128MB, 1000 files of 1MB each will form about 10 splits.
A MapReduce job processing many small files runs very slowly. Which of the following is the most likely cause?
Think about how many tasks are created for many small files.
Many small files cause many map tasks, increasing scheduling and startup overhead, which slows down the job.
You have millions of small log files generated daily. You want to optimize Hadoop processing and reduce NameNode memory usage. Which approach is best?
Think about combining files logically before processing.
Merging small files into larger sequence files reduces the number of files and metadata entries, improving NameNode efficiency and processing speed.