0
0
Apache Sparkdata~20 mins

Why data quality prevents downstream failures in Apache Spark - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Data Quality Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Spark DataFrame after filtering nulls
Given the Spark DataFrame code below, what will be the output after filtering out rows with null values in the 'age' column?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30), (4, 'David', None)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
df_filtered = df.filter(col('age').isNotNull())
df_filtered.show()
A
+---+-------+----+
| id|   name| age|
+---+-------+----+
|  2|    Bob|null|
|  4|  David|null|
+---+-------+----+
B
Empty DataFrame
Columns: [id, name, age]
[]
C
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob|null|
|  3|Charlie| 30|
|  4|  David|null|
+---+-------+---+
D
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  3|Charlie| 30|
+---+-------+---+
Attempts:
2 left
💡 Hint
Filtering removes rows where 'age' is null.
data_output
intermediate
1:30remaining
Count of distinct values after cleaning data
After removing duplicate rows and rows with null 'email' values from a Spark DataFrame, what is the count of distinct emails?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 'alice@example.com'), (2, 'bob@example.com'), (3, None), (4, 'alice@example.com'), (5, None)]
df = spark.createDataFrame(data, ['id', 'email'])
df_clean = df.dropDuplicates(['email']).filter(col('email').isNotNull())
count = df_clean.select('email').distinct().count()
print(count)
A2
B5
C4
D3
Attempts:
2 left
💡 Hint
Duplicates and nulls are removed before counting distinct emails.
🔧 Debug
advanced
2:00remaining
Identify the error in Spark data validation code
What error will this Spark code raise when checking for negative values in the 'salary' column?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 5000), (2, -1000), (3, 7000)]
df = spark.createDataFrame(data, ['id', 'salary'])
invalid_rows = df.filter(col('salary') < 0).collect()
print(invalid_rows[0]['salary'])
ANo error, prints -1000
BTypeError
CIndexError
DKeyError
Attempts:
2 left
💡 Hint
Check if filter returns rows and how to access them.
visualization
advanced
1:30remaining
Interpreting Spark DataFrame summary statistics
Given the summary statistics output below from a Spark DataFrame's 'score' column, what is the median score?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 50), (2, 80), (3, 90), (4, 70), (5, 60)]
df = spark.createDataFrame(data, ['id', 'score'])
summary = df.describe('score')
summary.show()
A75
BCannot determine from describe() output
C80
D70
Attempts:
2 left
💡 Hint
Spark's describe() does not provide median.
🚀 Application
expert
2:30remaining
Choosing the best approach to prevent downstream failures
You have a Spark pipeline that fails downstream due to unexpected nulls in a critical column. Which approach best prevents these failures?
AReplace nulls with zeros without checking if zeros are valid
BIgnore nulls and handle errors only when failures occur downstream
CAdd a filter step early to remove rows with nulls in the critical column
DSkip data quality checks to improve pipeline speed
Attempts:
2 left
💡 Hint
Prevent problems early by cleaning data before processing.