0
0
Apache Sparkdata~30 mins

Why data quality prevents downstream failures in Apache Spark - See It in Action

Choose your learning style9 modes available
Why data quality prevents downstream failures
📖 Scenario: You work as a data analyst in a company that collects sales data from different stores. Sometimes, the data has missing or wrong values, which causes errors in reports and decisions.Your job is to check the quality of the sales data before it is used for analysis. You will identify and filter out bad data to prevent problems later.
🎯 Goal: Build a Spark program that loads sales data, sets a quality threshold, filters out bad data based on that threshold, and shows the clean data. This will help avoid failures in later steps.
📋 What You'll Learn
Create a Spark DataFrame with sales data including some bad entries
Set a threshold for minimum valid sales amount
Filter the DataFrame to keep only rows with sales amount above the threshold
Print the filtered clean data
💡 Why This Matters
🌍 Real World
In real companies, data often has errors or missing values. Checking data quality early stops bad data from causing wrong reports or system crashes.
💼 Career
Data analysts and engineers must clean and validate data before analysis or machine learning to ensure reliable results and avoid costly mistakes.
Progress0 / 4 steps
1
Create sales data DataFrame
Create a Spark DataFrame called sales_df with these exact rows: ("Store1", 100), ("Store2", -50), ("Store3", 200), ("Store4", 0), ("Store5", 150). The columns should be store and sales.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names.

2
Set quality threshold
Create a variable called min_sales and set it to 1. This will be the minimum valid sales amount to keep.
Apache Spark
Need a hint?

Just create a variable named min_sales and assign the value 1.

3
Filter bad data using threshold
Create a new DataFrame called clean_sales_df by filtering sales_df to keep only rows where the sales column is greater than or equal to min_sales.
Apache Spark
Need a hint?

Use the filter() method on sales_df with the condition sales_df.sales >= min_sales.

4
Show clean sales data
Use print() to display the rows of clean_sales_df by collecting them and printing the list.
Apache Spark
Need a hint?

Use clean_sales_df.collect() inside print() to show the filtered rows.