Challenge - 5 Problems
Data Quality Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Data Quality Assertion with Null Check
What will be the output of the following Apache Spark code snippet that asserts no null values in the 'age' column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30)] df = spark.createDataFrame(data, ['id', 'name', 'age']) # Assert no nulls in 'age' df.select('age').filter(col('age').isNull()).count() == 0
Attempts:
2 left
💡 Hint
Check if any rows have null in the 'age' column.
✗ Incorrect
The dataframe contains one row where 'age' is None (null). The filter selects that row, so count() returns 1, making the expression False.
❓ data_output
intermediate2:00remaining
Number of Rows Failing a Data Quality Assertion
Given the following Spark DataFrame, how many rows fail the assertion that 'salary' must be greater than 3000?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 'John', 4000), (2, 'Jane', 2500), (3, 'Doe', 3500), (4, 'Smith', 3000)] df = spark.createDataFrame(data, ['id', 'name', 'salary']) failing_rows = df.filter(col('salary') <= 3000).count()
Attempts:
2 left
💡 Hint
Count rows where salary is less than or equal to 3000.
✗ Incorrect
Rows with salary 2500 and 3000 fail the assertion, so count is 2.
🔧 Debug
advanced2:00remaining
Identify the Error in Data Quality Assertion Code
What error will the following Spark code raise when checking if all 'email' values contain '@'?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 'alice@example.com'), (2, 'bobexample.com')] df = spark.createDataFrame(data, ['id', 'email']) # Check if all emails contain '@' assertion = df.filter(~col('email').contains('@')).count() == 0 print(assertion)
Attempts:
2 left
💡 Hint
Check if the 'contains' method is valid on a Column object.
✗ Incorrect
The 'contains' method does not exist on Column objects in PySpark. The correct method is 'like' or 'rlike'. This causes AttributeError.
❓ visualization
advanced2:00remaining
Visualizing Data Quality Assertion Failures
Which plot best shows the count of rows failing a data quality assertion that 'score' must be between 0 and 100?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col import matplotlib.pyplot as plt spark = SparkSession.builder.appName('Test').getOrCreate() data = [(1, 95), (2, 105), (3, -5), (4, 50)] df = spark.createDataFrame(data, ['id', 'score']) fail_df = df.filter((col('score') < 0) | (col('score') > 100)) fail_count = fail_df.count() # Plot code here
Attempts:
2 left
💡 Hint
Count of failing rows is 2; a bar chart can show this count clearly.
✗ Incorrect
Two rows fail the assertion (scores 105 and -5). A bar chart with one bar labeled 'Failing Rows' showing 2 is the clearest visualization of failure count.
🧠 Conceptual
expert2:00remaining
Best Practice for Data Quality Assertions in Spark Pipelines
Which option describes the best practice for implementing data quality assertions in Apache Spark ETL pipelines?
Attempts:
2 left
💡 Hint
Consider when errors should be caught to avoid propagating bad data.
✗ Incorrect
Running assertions after transformations and before writing ensures data quality is verified after all changes, catching errors early and preventing bad data storage.