Hadoopdata~30 mins

Small files problem and solutions in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Small Files Problem and Solutions in Hadoop

📖 Scenario: Imagine you work at a company that collects many small log files every day. These files are stored in Hadoop's HDFS. But having too many small files causes slow processing and wastes space.We want to learn how to handle this small files problem by combining small files into bigger ones for better performance.

🎯 Goal: You will create a simple Hadoop MapReduce job setup that reads multiple small files, combines their contents, and writes a larger output file. This will help you understand the small files problem and one common solution: file merging.

📋 What You'll Learn

Create a list of small file names with sample content

Set a configuration variable for minimum combined file size

Write a function to merge small files into bigger chunks

Print the names and sizes of merged files

💡 Why This Matters

🌍 Real World

Companies often collect many small data files from sensors, logs, or user uploads. Storing and processing these many small files in Hadoop slows down jobs and wastes storage.

💼 Career

Data engineers and Hadoop administrators must solve the small files problem to optimize big data pipelines and improve cluster performance.

Progress0 / 4 steps

Create a dictionary of small files with their sizes

Create a dictionary called small_files with these exact entries: 'file1.txt': 100, 'file2.txt': 150, 'file3.txt': 80, 'file4.txt': 120, 'file5.txt': 90. The numbers represent file sizes in KB.

Hadoop

# Create the dictionary small_files with file names and sizes
# Your code here

Need a hint?

Use curly braces {} to create a dictionary with file names as keys and sizes as values.

Set a minimum combined file size for merging

Create a variable called min_combined_size and set it to 200 (KB). This will be the target size to merge small files.

Hadoop

small_files = {'file1.txt': 100, 'file2.txt': 150, 'file3.txt': 80, 'file4.txt': 120, 'file5.txt': 90}
# Set the minimum combined file size
# Your code here

Need a hint?

Just assign the number 200 to the variable min_combined_size.

Write code to merge small files into bigger chunks

Write code that creates a list called merged_files. Use a loop to combine files from small_files so that each merged file has a total size at least min_combined_size. Store each merged file as a tuple of (list_of_file_names, total_size).

Hadoop

small_files = {'file1.txt': 100, 'file2.txt': 150, 'file3.txt': 80, 'file4.txt': 120, 'file5.txt': 90}
min_combined_size = 200

# Merge small files into bigger chunks
# Your code here

Need a hint?

Use a loop over small_files.items(), keep adding files until the size reaches min_combined_size, then start a new group.

Print the merged files and their total sizes

Write a for loop to print each merged file group from merged_files. For each group, print the list of file names and the total combined size in KB.

Hadoop

small_files = {'file1.txt': 100, 'file2.txt': 150, 'file3.txt': 80, 'file4.txt': 120, 'file5.txt': 90}
min_combined_size = 200

merged_files = []
current_files = []
current_size = 0

for file_name, size in small_files.items():
    current_files.append(file_name)
    current_size += size
    if current_size >= min_combined_size:
        merged_files.append((current_files, current_size))
        current_files = []
        current_size = 0

if current_files:
    merged_files.append((current_files, current_size))

# Print merged files and sizes
# Your code here

Need a hint?

Use a for loop over merged_files and print the file list and size using an f-string.