0
0
Apache Sparkdata~30 mins

Spark vs Hadoop MapReduce in Apache Spark - Hands-On Comparison

Choose your learning style9 modes available
Comparing Spark and Hadoop MapReduce Performance
📖 Scenario: You work as a data analyst at a company that processes large amounts of sales data. Your manager wants to understand how Apache Spark and Hadoop MapReduce handle data processing differently by comparing their performance on a simple task.
🎯 Goal: You will create a small dataset of sales, configure a threshold for filtering, apply both Spark and Hadoop MapReduce style filtering, and then output the filtered results to compare how Spark simplifies the process.
📋 What You'll Learn
Create a list of sales records with product names and sales amounts
Set a sales threshold to filter products with sales above this value
Use Apache Spark to filter the sales data based on the threshold
Print the filtered sales records
💡 Why This Matters
🌍 Real World
Filtering large sales datasets quickly to find products with high sales is common in retail analytics.
💼 Career
Data scientists and analysts use Spark to process big data efficiently, making this skill valuable for roles in data engineering and analytics.
Progress0 / 4 steps
1
Create the sales data list
Create a list called sales_data with these exact tuples: ("apple", 50), ("banana", 30), ("orange", 70), ("grape", 20), ("mango", 90).
Apache Spark
Need a hint?

Use a list of tuples with product names as strings and sales amounts as integers.

2
Set the sales threshold
Create a variable called threshold and set it to 40.
Apache Spark
Need a hint?

Just assign the number 40 to the variable named threshold.

3
Filter sales data using Spark
Use Apache Spark to create a SparkSession called spark. Then create a DataFrame called df from sales_data with columns product and sales. Filter df to keep rows where sales is greater than threshold and save it as filtered_df.
Apache Spark
Need a hint?

Use SparkSession.builder.master("local").appName("SalesFilter").getOrCreate() to create spark. Use createDataFrame with columns ["product", "sales"]. Use filter with condition df.sales > threshold.

4
Show the filtered sales results
Use filtered_df.show() to display the filtered sales records.
Apache Spark
Need a hint?

Call show() on filtered_df to print the filtered rows.