0
0
Hadoopdata~5 mins

Small files problem and solutions in Hadoop - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the 'small files problem' in Hadoop?
It happens when Hadoop stores many tiny files instead of fewer large files. This causes overhead because each file needs metadata and resources, slowing down processing.
Click to reveal answer
beginner
Why do many small files cause performance issues in Hadoop?
Because the NameNode must keep metadata for each file, many small files increase memory use and slow down file access and job execution.
Click to reveal answer
beginner
Name one common solution to the small files problem.
Combine small files into larger files using tools like Hadoop Archive (HAR) or SequenceFile to reduce the number of files Hadoop manages.
Click to reveal answer
intermediate
What is Hadoop Archive (HAR) and how does it help?
HAR packs many small files into a single archive file. It reduces metadata overhead and improves performance while keeping files accessible.
Click to reveal answer
intermediate
How does using SequenceFile format solve the small files problem?
SequenceFile stores many small files as key-value pairs in one large file, reducing the number of files and improving read efficiency.
Click to reveal answer
What causes the small files problem in Hadoop?
AUsing too much memory for large files
BNot enough disk space on DataNodes
CRunning too many MapReduce jobs simultaneously
DStoring many tiny files instead of fewer large files
Which Hadoop component struggles with many small files?
ADataNode
BNameNode
CResourceManager
DJobTracker
Which tool can combine small files into a single archive in Hadoop?
ASqoop
BPig
CHadoop Archive (HAR)
DHive
SequenceFile format stores data as:
AKey-value pairs in one large file
BSeparate small files
CCompressed text files
DBinary blobs without structure
What is a main benefit of solving the small files problem?
AImproved Hadoop job performance
BMore DataNodes required
CIncreased network traffic
DSlower file access
Explain the small files problem in Hadoop and why it affects performance.
Think about how Hadoop handles file metadata and what happens when there are many tiny files.
You got /4 concepts.
    Describe two solutions to the small files problem and how they help.
    Focus on how combining files reduces the number of files Hadoop manages.
    You got /4 concepts.