Challenge - 5 Problems

🎖️

Small Files Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

Why does the small files problem occur in Hadoop?

In Hadoop, why is having many small files a problem?

ABecause each file requires metadata storage in the NameNode, causing memory overload.

BBecause small files are automatically merged by Hadoop, causing data loss.

CBecause small files increase the block size, reducing storage efficiency.

DBecause small files are compressed by default, slowing down processing.

Attempts:

2 left

❓ Predict Output

intermediate

1:30remaining

Output of merging small files using Hadoop Archive (HAR)

What is the output of the following Hadoop command?

hadoop archive -archiveName smallfiles.har -p /user/hadoop/smallfiles /user/hadoop/harfiles

Assuming /user/hadoop/smallfiles contains 1000 small files.

AThe small files remain unchanged; no archive is created.

BAll small files are deleted and replaced by a single large file in /user/hadoop/harfiles.

CA HAR file named smallfiles.har is created in /user/hadoop/harfiles containing all small files.

DThe command fails because HAR does not support more than 500 files.

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of using CombineFileInputFormat on small files

Given a Hadoop job configured with CombineFileInputFormat to process 1000 small files of 1MB each, what is the expected number of input splits?

ANo input splits are created; job fails to start.

BExactly 1000 input splits, one per file.

COne input split containing all 1000 files combined.

DApproximately 10 input splits, each combining about 100MB of data.

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identify the cause of slow MapReduce job with many small files

A MapReduce job processing many small files runs very slowly. Which of the following is the most likely cause?

AThe small files are compressed, causing decompression delays.

BToo many map tasks are created, causing overhead in task scheduling and startup.

CThe job uses CombineFileInputFormat, which increases the number of splits.

DThe NameNode is running out of disk space due to large files.

Attempts:

2 left

🚀 Application

expert

2:30remaining

Best approach to handle millions of small log files in Hadoop

You have millions of small log files generated daily. You want to optimize Hadoop processing and reduce NameNode memory usage. Which approach is best?

AUse a daily batch job to merge small files into larger sequence files before processing.

BIncrease the NameNode heap size to handle more metadata entries.

CProcess files as-is using default TextInputFormat to avoid data loss.

DDelete small files older than one day to reduce file count.

Attempts:

2 left