0
0
Apache Sparkdata~5 mins

Why data quality prevents downstream failures in Apache Spark

Choose your learning style9 modes available
Introduction

Good data quality helps avoid mistakes later in analysis or reports. It keeps results correct and trustworthy.

When cleaning data before analysis to avoid wrong conclusions.
When preparing data for machine learning models to improve accuracy.
When combining data from different sources to ensure consistency.
When generating reports that business decisions depend on.
When automating data pipelines to prevent crashes or errors.
Syntax
Apache Spark
from pyspark.sql.functions import col, isnan, when

# Example: Check for missing or invalid data
clean_data = raw_data.filter(
    (col('column_name').isNotNull()) & (~isnan(col('column_name')))
)

Use filter to remove bad data rows.

isNotNull() and isnan() help find missing or invalid values.

Examples
Keep only rows where 'age' is not missing.
Apache Spark
clean_data = raw_data.filter(col('age').isNotNull())
Remove rows where 'salary' is NaN (not a number).
Apache Spark
clean_data = raw_data.filter(~isnan(col('salary')))
Keep rows where 'score' is between 0 and 100.
Apache Spark
clean_data = raw_data.filter((col('score') >= 0) & (col('score') <= 100))
Sample Program

This program creates a small dataset with missing values. It then removes rows where 'age' or 'salary' is missing or invalid. This prevents errors in later steps that use these columns.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan

# Start Spark session
spark = SparkSession.builder.appName('DataQualityExample').getOrCreate()

# Sample data with some bad rows
data = [
    (1, 'Alice', 25, 50000),
    (2, 'Bob', None, 60000),
    (3, 'Charlie', 30, None),
    (4, None, 22, 45000),
    (5, 'Eve', 28, 70000)
]

columns = ['id', 'name', 'age', 'salary']

raw_data = spark.createDataFrame(data, columns)

# Show raw data
print('Raw data:')
raw_data.show()

# Clean data: remove rows with missing age or salary
clean_data = raw_data.filter(
    (col('age').isNotNull()) & (~isnan(col('age'))) &
    (col('salary').isNotNull()) & (~isnan(col('salary')))
)

print('Clean data:')
clean_data.show()

spark.stop()
OutputSuccess
Important Notes

Always check for missing or invalid data before analysis.

Cleaning data early saves time and prevents errors later.

Use Spark functions like isNotNull() and isnan() to find bad data.

Summary

Good data quality stops errors in later steps.

Remove or fix missing and invalid data early.

Spark has tools to help find and clean bad data.