Bird
0
0

You have 10,000 small log files in HDFS causing slow MapReduce jobs. Which combined approach best solves the small files problem while keeping data query-friendly?

hard📝 Application Q15 of 15
Hadoop - Performance Tuning
You have 10,000 small log files in HDFS causing slow MapReduce jobs. Which combined approach best solves the small files problem while keeping data query-friendly?
AConvert small files into a single Parquet file using a Spark job
BMerge files into one large text file using hadoop fs -getmerge
CDelete half the files to reduce count
DKeep files as-is and increase cluster size
Step-by-Step Solution
Solution:
  1. Step 1: Evaluate merging methods

    Merging into one large text file reduces file count but loses schema and query efficiency.
  2. Step 2: Consider Parquet format benefits

    Parquet is a columnar, compressed format that merges files and improves query speed and storage efficiency.
  3. Step 3: Assess other options

    Deleting files loses data; increasing cluster size doesn't fix small files overhead.
  4. Final Answer:

    Convert small files into a single Parquet file using a Spark job -> Option A
  5. Quick Check:

    Parquet merges files and boosts query performance [OK]
Quick Trick: Use Parquet format with Spark to merge and optimize small files [OK]
Common Mistakes:
  • Merging into plain text loses query benefits
  • Deleting files causes data loss
  • Scaling cluster ignores file overhead

Want More Practice?

15+ quiz questions · All difficulty levels · Free

Free Signup - Practice All Questions
More Hadoop Quizzes