You have 10,000 small log files in HDFS causing slow MapReduce jobs. Which combined approach best solves the small files problem while keeping data query-friendly?

hard📝 Application Q15 of 15

Hadoop - Performance Tuning

AConvert small files into a single Parquet file using a Spark job

BMerge files into one large text file using hadoop fs -getmerge

CDelete half the files to reduce count

DKeep files as-is and increase cluster size

Step-by-Step Solution

Solution:

Step 1: Evaluate merging methods
Merging into one large text file reduces file count but loses schema and query efficiency.
Step 2: Consider Parquet format benefits
Parquet is a columnar, compressed format that merges files and improves query speed and storage efficiency.
Step 3: Assess other options
Deleting files loses data; increasing cluster size doesn't fix small files overhead.
Final Answer:
Convert small files into a single Parquet file using a Spark job -> Option A
Quick Check:
Parquet merges files and boosts query performance [OK]

Quick Trick: Use Parquet format with Spark to merge and optimize small files [OK]

Common Mistakes:

Merging into plain text loses query benefits
Deleting files causes data loss
Scaling cluster ignores file overhead

Master "Performance Tuning" in Hadoop

9 interactive learning modes - each teaches the same concept differently

Learn Why Deep Visual Try Challenge Project Recall Time

Want More Practice?

15+ quiz questions · All difficulty levels · Free

Free Signup - Practice All Questions

More Hadoop Quizzes

You have 10,000 small log files in HDFS causing slow MapReduce jobs. Which combined approach best solves the small files problem while keeping data query-friendly?

Step 1: Evaluate merging methods

Step 2: Consider Parquet format benefits

Step 3: Assess other options

Final Answer:

Quick Check:

Want More Practice?