0
0
Apache Sparkdata~30 mins

Data quality assertions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Data Quality Assertions with Apache Spark
📖 Scenario: You work as a data analyst at a retail company. You receive daily sales data and want to make sure the data is clean and reliable before using it for reports.
🎯 Goal: Build a simple Apache Spark program that loads sales data, sets up quality checks (assertions) on the data, and prints whether the data passes these checks.
📋 What You'll Learn
Create a Spark DataFrame with given sales data
Define a threshold for minimum sales amount
Write assertions to check data quality conditions
Print the results of the data quality checks
💡 Why This Matters
🌍 Real World
Data quality checks are essential in real-world data pipelines to ensure reports and decisions are based on accurate data.
💼 Career
Data analysts and data engineers often write data quality assertions to catch errors early and maintain trust in data systems.
Progress0 / 4 steps
1
Create the sales data DataFrame
Create a Spark DataFrame called sales_df with these exact rows: ("2024-06-01", "StoreA", 150), ("2024-06-01", "StoreB", 80), ("2024-06-02", "StoreA", 200), ("2024-06-02", "StoreB", 50). Use columns named date, store, and sales.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

2
Set the minimum sales threshold
Create a variable called min_sales and set it to 100. This will be the minimum acceptable sales amount for quality checks.
Apache Spark
Need a hint?

Just create a variable named min_sales and assign the value 100.

3
Write data quality assertions
Create a variable called low_sales_count that counts rows in sales_df where sales is less than min_sales. Use the filter method and count().
Apache Spark
Need a hint?

Use sales_df.filter(sales_df.sales < min_sales).count() to count low sales rows.

4
Print the data quality check result
Print the message "Number of low sales records:" followed by the value of low_sales_count.
Apache Spark
Need a hint?

Use print("Number of low sales records:", low_sales_count) to show the result.