Apache Sparkdata~30 mins

Why data quality prevents downstream failures in Apache Spark - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why data quality prevents downstream failures

📖 Scenario: You work as a data analyst in a company that collects sales data from different stores. Sometimes, the data has missing or wrong values, which causes errors in reports and decisions.Your job is to check the quality of the sales data before it is used for analysis. You will identify and filter out bad data to prevent problems later.

🎯 Goal: Build a Spark program that loads sales data, sets a quality threshold, filters out bad data based on that threshold, and shows the clean data. This will help avoid failures in later steps.

📋 What You'll Learn

Create a Spark DataFrame with sales data including some bad entries

Set a threshold for minimum valid sales amount

Filter the DataFrame to keep only rows with sales amount above the threshold

Print the filtered clean data

💡 Why This Matters

🌍 Real World

In real companies, data often has errors or missing values. Checking data quality early stops bad data from causing wrong reports or system crashes.

💼 Career

Data analysts and engineers must clean and validate data before analysis or machine learning to ensure reliable results and avoid costly mistakes.

Progress0 / 4 steps

Create sales data DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("Store1", 100), ("Store2", -50), ("Store3", 200), ("Store4", 0), ("Store5", 150). The columns should be store and sales.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataQuality").getOrCreate()
# Create sales_df DataFrame with given rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names.

Set quality threshold

Create a variable called min_sales and set it to 1. This will be the minimum valid sales amount to keep.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataQuality").getOrCreate()
sales_data = [("Store1", 100), ("Store2", -50), ("Store3", 200), ("Store4", 0), ("Store5", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
# Create min_sales variable and set it to 1
# Your code here

Need a hint?

Just create a variable named min_sales and assign the value 1.

Filter bad data using threshold

Create a new DataFrame called clean_sales_df by filtering sales_df to keep only rows where the sales column is greater than or equal to min_sales.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataQuality").getOrCreate()
sales_data = [("Store1", 100), ("Store2", -50), ("Store3", 200), ("Store4", 0), ("Store5", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
min_sales = 1
# Filter sales_df to keep rows with sales >= min_sales and assign to clean_sales_df
# Your code here

Need a hint?

Use the filter() method on sales_df with the condition sales_df.sales >= min_sales.

Show clean sales data

Use print() to display the rows of clean_sales_df by collecting them and printing the list.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataQuality").getOrCreate()
sales_data = [("Store1", 100), ("Store2", -50), ("Store3", 200), ("Store4", 0), ("Store5", 150)]
sales_df = spark.createDataFrame(sales_data, ["store", "sales"])
min_sales = 1
clean_sales_df = sales_df.filter(sales_df.sales >= min_sales)
# Print the collected rows of clean_sales_df
# Your code here

Need a hint?

Use clean_sales_df.collect() inside print() to show the filtered rows.