0
0
Apache Sparkdata~20 mins

Data quality assertions in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Data Quality Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Data Quality Assertion with Null Check
What will be the output of the following Apache Spark code snippet that asserts no null values in the 'age' column?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])

# Assert no nulls in 'age'
df.select('age').filter(col('age').isNull()).count() == 0
AThrows an exception
BTrue
CFalse
DNone
Attempts:
2 left
💡 Hint
Check if any rows have null in the 'age' column.
data_output
intermediate
2:00remaining
Number of Rows Failing a Data Quality Assertion
Given the following Spark DataFrame, how many rows fail the assertion that 'salary' must be greater than 3000?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'John', 4000), (2, 'Jane', 2500), (3, 'Doe', 3500), (4, 'Smith', 3000)]
df = spark.createDataFrame(data, ['id', 'name', 'salary'])

failing_rows = df.filter(col('salary') <= 3000).count()
A2
B3
C1
D0
Attempts:
2 left
💡 Hint
Count rows where salary is less than or equal to 3000.
🔧 Debug
advanced
2:00remaining
Identify the Error in Data Quality Assertion Code
What error will the following Spark code raise when checking if all 'email' values contain '@'?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'alice@example.com'), (2, 'bobexample.com')]
df = spark.createDataFrame(data, ['id', 'email'])

# Check if all emails contain '@'
assertion = df.filter(~col('email').contains('@')).count() == 0
print(assertion)
ASyntaxError
BTrue
CFalse
DAttributeError
Attempts:
2 left
💡 Hint
Check if the 'contains' method is valid on a Column object.
visualization
advanced
2:00remaining
Visualizing Data Quality Assertion Failures
Which plot best shows the count of rows failing a data quality assertion that 'score' must be between 0 and 100?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 95), (2, 105), (3, -5), (4, 50)]
df = spark.createDataFrame(data, ['id', 'score'])

fail_df = df.filter((col('score') < 0) | (col('score') > 100))
fail_count = fail_df.count()

# Plot code here
AA line chart showing scores over ids
BA bar chart with one bar labeled 'Failing Rows' showing value 2
CA pie chart with two slices: 'Failing' 2 and 'Passing' 2
DA scatter plot of id vs score
Attempts:
2 left
💡 Hint
Count of failing rows is 2; a bar chart can show this count clearly.
🧠 Conceptual
expert
2:00remaining
Best Practice for Data Quality Assertions in Spark Pipelines
Which option describes the best practice for implementing data quality assertions in Apache Spark ETL pipelines?
ARun assertions after all transformations and before writing data to storage to catch errors early
BRun assertions only once at the very end of the pipeline to minimize performance impact
CSkip assertions and rely on schema validation only to improve speed
DRun assertions before any transformations to validate raw data only
Attempts:
2 left
💡 Hint
Consider when errors should be caught to avoid propagating bad data.