0
0
Hadoopdata~20 mins

Small files problem and solutions in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Small Files Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Why does the small files problem occur in Hadoop?

In Hadoop, why is having many small files a problem?

ABecause each file requires metadata storage in the NameNode, causing memory overload.
BBecause small files are automatically merged by Hadoop, causing data loss.
CBecause small files increase the block size, reducing storage efficiency.
DBecause small files are compressed by default, slowing down processing.
Attempts:
2 left
💡 Hint

Think about how Hadoop manages file information in its master node.

Predict Output
intermediate
1:30remaining
Output of merging small files using Hadoop Archive (HAR)

What is the output of the following Hadoop command?

hadoop archive -archiveName smallfiles.har -p /user/hadoop/smallfiles /user/hadoop/harfiles

Assuming /user/hadoop/smallfiles contains 1000 small files.

AThe small files remain unchanged; no archive is created.
BAll small files are deleted and replaced by a single large file in /user/hadoop/harfiles.
CA HAR file named smallfiles.har is created in /user/hadoop/harfiles containing all small files.
DThe command fails because HAR does not support more than 500 files.
Attempts:
2 left
💡 Hint

HAR archives combine files logically without deleting originals.

data_output
advanced
2:00remaining
Result of using CombineFileInputFormat on small files

Given a Hadoop job configured with CombineFileInputFormat to process 1000 small files of 1MB each, what is the expected number of input splits?

ANo input splits are created; job fails to start.
BExactly 1000 input splits, one per file.
COne input split containing all 1000 files combined.
DApproximately 10 input splits, each combining about 100MB of data.
Attempts:
2 left
💡 Hint

CombineFileInputFormat groups small files into splits based on size limits.

🔧 Debug
advanced
1:30remaining
Identify the cause of slow MapReduce job with many small files

A MapReduce job processing many small files runs very slowly. Which of the following is the most likely cause?

AThe small files are compressed, causing decompression delays.
BToo many map tasks are created, causing overhead in task scheduling and startup.
CThe job uses CombineFileInputFormat, which increases the number of splits.
DThe NameNode is running out of disk space due to large files.
Attempts:
2 left
💡 Hint

Think about how many tasks are created for many small files.

🚀 Application
expert
2:30remaining
Best approach to handle millions of small log files in Hadoop

You have millions of small log files generated daily. You want to optimize Hadoop processing and reduce NameNode memory usage. Which approach is best?

AUse a daily batch job to merge small files into larger sequence files before processing.
BIncrease the NameNode heap size to handle more metadata entries.
CProcess files as-is using default TextInputFormat to avoid data loss.
DDelete small files older than one day to reduce file count.
Attempts:
2 left
💡 Hint

Think about combining files logically before processing.