Hadoopdata~30 mins

Why MapReduce parallelizes data processing in Hadoop - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why MapReduce Parallelizes Data Processing

📖 Scenario: Imagine you have a huge list of sales records from many stores. You want to find out how many sales happened in each city. Doing this alone on one computer would take a long time. MapReduce helps by splitting the work across many computers to finish faster.

🎯 Goal: You will create a simple MapReduce style program that counts sales per city by splitting data, processing parts in parallel, and combining results.

📋 What You'll Learn

Create a list of sales records with city names

Set a variable for the number of parts to split the data

Write a map function to count sales per city in each part

Combine the counts from all parts to get total sales per city

Print the final sales count per city

💡 Why This Matters

🌍 Real World

Companies use MapReduce to quickly analyze huge data sets like sales, web logs, or social media posts by splitting work across many computers.

💼 Career

Understanding MapReduce helps in data engineering and big data jobs where processing speed and handling large data are important.

Progress0 / 4 steps

Create the sales data list

Create a list called sales_data with these exact city names as strings: 'New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York'.

Hadoop

# Create the sales_data list with city names
# Your code here

Need a hint?

Use a Python list with the exact city names as strings inside square brackets.

Set the number of data splits

Create a variable called num_splits and set it to 3 to split the sales data into three parts.

Hadoop

sales_data = ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York']
# Set num_splits to 3
# Your code here

Need a hint?

Just assign the number 3 to the variable num_splits.

Map: Count sales per city in each split

Write code to split sales_data into num_splits parts, then use a for loop with variable part to count sales per city in each part. Store each part's counts as a dictionary in a list called mapped_counts. Use a dictionary to count cities in each part.

Hadoop

sales_data = ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York']
num_splits = 3
# Split sales_data into parts and count sales per city in each part
# Your code here

Need a hint?

Divide the list into parts, then count cities in each part using a dictionary inside a loop.

Reduce: Combine counts and print result

Create an empty dictionary called final_counts. Use a for loop with variable counts to go through each dictionary in mapped_counts. For each city in counts, add its count to final_counts. Finally, print final_counts.

Hadoop

sales_data = ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York']
num_splits = 3

chunk_size = len(sales_data) // num_splits
mapped_counts = []
for i in range(num_splits):
    start = i * chunk_size
    end = (i + 1) * chunk_size if i < num_splits - 1 else len(sales_data)
    part = sales_data[start:end]
    counts = {}
    for city in part:
        counts[city] = counts.get(city, 0) + 1
    mapped_counts.append(counts)

# Combine counts from all parts and print final_counts
# Your code here

Need a hint?

Use nested loops to add counts from each part into final_counts, then print it.