Challenge - 5 Problems

🎖️

Data Quality Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of Data Quality Assertion with Null Check

What will be the output of the following Apache Spark code snippet that asserts no null values in the 'age' column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])

# Assert no nulls in 'age'
df.select('age').filter(col('age').isNull()).count() == 0

AThrows an exception

BTrue

CFalse

DNone

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Number of Rows Failing a Data Quality Assertion

Given the following Spark DataFrame, how many rows fail the assertion that 'salary' must be greater than 3000?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'John', 4000), (2, 'Jane', 2500), (3, 'Doe', 3500), (4, 'Smith', 3000)]
df = spark.createDataFrame(data, ['id', 'name', 'salary'])

failing_rows = df.filter(col('salary') <= 3000).count()

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the Error in Data Quality Assertion Code

What error will the following Spark code raise when checking if all 'email' values contain '@'?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'alice@example.com'), (2, 'bobexample.com')]
df = spark.createDataFrame(data, ['id', 'email'])

# Check if all emails contain '@'
assertion = df.filter(~col('email').contains('@')).count() == 0
print(assertion)

ASyntaxError

BTrue

CFalse

DAttributeError

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Visualizing Data Quality Assertion Failures

Which plot best shows the count of rows failing a data quality assertion that 'score' must be between 0 and 100?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 95), (2, 105), (3, -5), (4, 50)]
df = spark.createDataFrame(data, ['id', 'score'])

fail_df = df.filter((col('score') < 0) | (col('score') > 100))
fail_count = fail_df.count()

# Plot code here

AA line chart showing scores over ids

BA bar chart with one bar labeled 'Failing Rows' showing value 2

CA pie chart with two slices: 'Failing' 2 and 'Passing' 2

DA scatter plot of id vs score

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Best Practice for Data Quality Assertions in Spark Pipelines

Which option describes the best practice for implementing data quality assertions in Apache Spark ETL pipelines?

ARun assertions after all transformations and before writing data to storage to catch errors early

BRun assertions only once at the very end of the pipeline to minimize performance impact

CSkip assertions and rely on schema validation only to improve speed

DRun assertions before any transformations to validate raw data only

Attempts:

2 left