Challenge - 5 Problems
Data Quality Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Spark DataFrame after filtering nulls
Given the Spark DataFrame code below, what will be the output after filtering out rows with null values in the 'age' column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('DataQuality').getOrCreate() data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30), (4, 'David', None)] df = spark.createDataFrame(data, ['id', 'name', 'age']) df_filtered = df.filter(col('age').isNotNull()) df_filtered.show()
Attempts:
2 left
💡 Hint
Filtering removes rows where 'age' is null.
✗ Incorrect
The filter keeps only rows where 'age' is not null, so rows with Bob and David are removed.
❓ data_output
intermediate1:30remaining
Count of distinct values after cleaning data
After removing duplicate rows and rows with null 'email' values from a Spark DataFrame, what is the count of distinct emails?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('DataQuality').getOrCreate() data = [(1, 'alice@example.com'), (2, 'bob@example.com'), (3, None), (4, 'alice@example.com'), (5, None)] df = spark.createDataFrame(data, ['id', 'email']) df_clean = df.dropDuplicates(['email']).filter(col('email').isNotNull()) count = df_clean.select('email').distinct().count() print(count)
Attempts:
2 left
💡 Hint
Duplicates and nulls are removed before counting distinct emails.
✗ Incorrect
Only 'alice@example.com' and 'bob@example.com' remain after cleaning, so count is 2.
🔧 Debug
advanced2:00remaining
Identify the error in Spark data validation code
What error will this Spark code raise when checking for negative values in the 'salary' column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('DataQuality').getOrCreate() data = [(1, 5000), (2, -1000), (3, 7000)] df = spark.createDataFrame(data, ['id', 'salary']) invalid_rows = df.filter(col('salary') < 0).collect() print(invalid_rows[0]['salary'])
Attempts:
2 left
💡 Hint
Check if filter returns rows and how to access them.
✗ Incorrect
The filter returns rows with salary < 0, collect() returns a list, accessing first element and 'salary' key works fine.
❓ visualization
advanced1:30remaining
Interpreting Spark DataFrame summary statistics
Given the summary statistics output below from a Spark DataFrame's 'score' column, what is the median score?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('DataQuality').getOrCreate() data = [(1, 50), (2, 80), (3, 90), (4, 70), (5, 60)] df = spark.createDataFrame(data, ['id', 'score']) summary = df.describe('score') summary.show()
Attempts:
2 left
💡 Hint
Spark's describe() does not provide median.
✗ Incorrect
The describe() method shows count, mean, stddev, min, max but not median.
🚀 Application
expert2:30remaining
Choosing the best approach to prevent downstream failures
You have a Spark pipeline that fails downstream due to unexpected nulls in a critical column. Which approach best prevents these failures?
Attempts:
2 left
💡 Hint
Prevent problems early by cleaning data before processing.
✗ Incorrect
Filtering out nulls early avoids errors later. Ignoring or skipping checks causes failures. Blind replacement may cause wrong data.