Challenge - 5 Problems
Null and Duplicate Detection Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Detecting Null Values Count in Spark DataFrame
What is the output of this Spark code snippet that counts null values in each column?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum spark = SparkSession.builder.getOrCreate() data = [(1, None), (2, 'a'), (None, 'b'), (4, None)] df = spark.createDataFrame(data, ['id', 'value']) null_counts = df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns]) null_counts.show()
Attempts:
2 left
💡 Hint
Remember that isNull() returns a boolean, which we cast to int to sum nulls.
✗ Incorrect
The code counts nulls in each column by checking isNull(), casting True to 1 and False to 0, then summing. The 'id' column has one null, 'value' has two nulls.
❓ data_output
intermediate1:30remaining
Number of Duplicate Rows in Spark DataFrame
Given this Spark DataFrame, how many duplicate rows does it contain?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 'a'), (2, 'b'), (1, 'a'), (3, 'c'), (2, 'b'), (4, 'd')] df = spark.createDataFrame(data, ['id', 'value']) duplicates_count = df.count() - df.dropDuplicates().count() print(duplicates_count)
Attempts:
2 left
💡 Hint
Duplicates are rows that appear more than once.
✗ Incorrect
The DataFrame has 6 rows. After dropping duplicates, 4 unique rows remain. So duplicates count is 6 - 4 = 2.
🔧 Debug
advanced2:00remaining
Identify the Error in Null Filtering Code
What error does this Spark code raise when trying to filter rows with nulls in 'age' column?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 25), (2, None), (3, 30)] df = spark.createDataFrame(data, ['id', 'age']) filtered = df.filter(df['age'] == None) filtered.show()
Attempts:
2 left
💡 Hint
In Spark, '==' does not work to check null values.
✗ Incorrect
Using '==' to compare with None returns no rows because null comparisons require 'isNull()' method in Spark.
🚀 Application
advanced2:00remaining
Remove Duplicate Rows Based on Subset of Columns
Which code snippet correctly removes duplicates based only on the 'email' column in a Spark DataFrame?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 'a@example.com'), (2, 'b@example.com'), (3, 'a@example.com'), (4, 'c@example.com')] df = spark.createDataFrame(data, ['id', 'email'])
Attempts:
2 left
💡 Hint
dropDuplicates expects a list of column names to consider for duplicates.
✗ Incorrect
Option B correctly passes a list with 'email' to dropDuplicates, removing rows with duplicate emails.
🧠 Conceptual
expert1:30remaining
Understanding Null Handling in Spark Aggregations
When using Spark's aggregation functions like sum() on a column with null values, what is the behavior?
Attempts:
2 left
💡 Hint
Think about how SQL aggregates handle nulls.
✗ Incorrect
Spark aggregation functions ignore nulls by default, so sum() adds only non-null values.