Apache Sparkdata~30 mins

Spark vs Hadoop MapReduce in Apache Spark - Hands-On Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Comparing Spark and Hadoop MapReduce Performance

📖 Scenario: You work as a data analyst at a company that processes large amounts of sales data. Your manager wants to understand how Apache Spark and Hadoop MapReduce handle data processing differently by comparing their performance on a simple task.

🎯 Goal: You will create a small dataset of sales, configure a threshold for filtering, apply both Spark and Hadoop MapReduce style filtering, and then output the filtered results to compare how Spark simplifies the process.

📋 What You'll Learn

Create a list of sales records with product names and sales amounts

Set a sales threshold to filter products with sales above this value

Use Apache Spark to filter the sales data based on the threshold

Print the filtered sales records

💡 Why This Matters

🌍 Real World

Filtering large sales datasets quickly to find products with high sales is common in retail analytics.

💼 Career

Data scientists and analysts use Spark to process big data efficiently, making this skill valuable for roles in data engineering and analytics.

Progress0 / 4 steps

Create the sales data list

Create a list called sales_data with these exact tuples: ("apple", 50), ("banana", 30), ("orange", 70), ("grape", 20), ("mango", 90).

Apache Spark

# Create the sales_data list with product and sales amount tuples
# Your code here

Need a hint?

Use a list of tuples with product names as strings and sales amounts as integers.

Set the sales threshold

Create a variable called threshold and set it to 40.

Apache Spark

sales_data = [("apple", 50), ("banana", 30), ("orange", 70), ("grape", 20), ("mango", 90)]
# Set the threshold variable to 40
# Your code here

Need a hint?

Just assign the number 40 to the variable named threshold.

Filter sales data using Spark

Use Apache Spark to create a SparkSession called spark. Then create a DataFrame called df from sales_data with columns product and sales. Filter df to keep rows where sales is greater than threshold and save it as filtered_df.

Apache Spark

sales_data = [("apple", 50), ("banana", 30), ("orange", 70), ("grape", 20), ("mango", 90)]
threshold = 40
# Import SparkSession and create spark
# Create DataFrame df from sales_data with columns 'product' and 'sales'
# Filter df where sales > threshold and save as filtered_df
# Your code here

Need a hint?

Use SparkSession.builder.master("local").appName("SalesFilter").getOrCreate() to create spark. Use createDataFrame with columns ["product", "sales"]. Use filter with condition df.sales > threshold.

Show the filtered sales results

Use filtered_df.show() to display the filtered sales records.

Apache Spark

from pyspark.sql import SparkSession

sales_data = [("apple", 50), ("banana", 30), ("orange", 70), ("grape", 20), ("mango", 90)]
threshold = 40

spark = SparkSession.builder.master("local").appName("SalesFilter").getOrCreate()
df = spark.createDataFrame(sales_data, ["product", "sales"])
filtered_df = df.filter(df.sales > threshold)
# Show the filtered DataFrame
# Your code here

Need a hint?

Call show() on filtered_df to print the filtered rows.