Challenge - 5 Problems

🎖️

Data Quality Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of Spark DataFrame after filtering nulls

Given the Spark DataFrame code below, what will be the output after filtering out rows with null values in the 'age' column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 'Alice', 25), (2, 'Bob', None), (3, 'Charlie', 30), (4, 'David', None)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
df_filtered = df.filter(col('age').isNotNull())
df_filtered.show()

+---+-------+----+
| id|   name| age|
+---+-------+----+
|  2|    Bob|null|
|  4|  David|null|
+---+-------+----+

Empty DataFrame
Columns: [id, name, age]
[]

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob|null|
|  3|Charlie| 30|
|  4|  David|null|
+---+-------+---+

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  3|Charlie| 30|
+---+-------+---+

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Count of distinct values after cleaning data

After removing duplicate rows and rows with null 'email' values from a Spark DataFrame, what is the count of distinct emails?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 'alice@example.com'), (2, 'bob@example.com'), (3, None), (4, 'alice@example.com'), (5, None)]
df = spark.createDataFrame(data, ['id', 'email'])
df_clean = df.dropDuplicates(['email']).filter(col('email').isNotNull())
count = df_clean.select('email').distinct().count()
print(count)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in Spark data validation code

What error will this Spark code raise when checking for negative values in the 'salary' column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 5000), (2, -1000), (3, 7000)]
df = spark.createDataFrame(data, ['id', 'salary'])
invalid_rows = df.filter(col('salary') < 0).collect()
print(invalid_rows[0]['salary'])

ANo error, prints -1000

BTypeError

CIndexError

DKeyError

Attempts:

2 left

❓ visualization

advanced

1:30remaining

Interpreting Spark DataFrame summary statistics

Given the summary statistics output below from a Spark DataFrame's 'score' column, what is the median score?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DataQuality').getOrCreate()
data = [(1, 50), (2, 80), (3, 90), (4, 70), (5, 60)]
df = spark.createDataFrame(data, ['id', 'score'])
summary = df.describe('score')
summary.show()

A75

BCannot determine from describe() output

C80

D70

Attempts:

2 left

🚀 Application

expert

2:30remaining

Choosing the best approach to prevent downstream failures

You have a Spark pipeline that fails downstream due to unexpected nulls in a critical column. Which approach best prevents these failures?

AReplace nulls with zeros without checking if zeros are valid

BIgnore nulls and handle errors only when failures occur downstream

CAdd a filter step early to remove rows with nulls in the critical column

DSkip data quality checks to improve pipeline speed

Attempts:

2 left