0
0
Apache Sparkdata~20 mins

Null and duplicate detection in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Null and Duplicate Detection Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Detecting Null Values Count in Spark DataFrame
What is the output of this Spark code snippet that counts null values in each column?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

spark = SparkSession.builder.getOrCreate()
data = [(1, None), (2, 'a'), (None, 'b'), (4, None)]
df = spark.createDataFrame(data, ['id', 'value'])

null_counts = df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])
null_counts.show()
A
+---+-----+
| id|value|
+---+-----+
|  1|    2|
+---+-----+
B
+---+-----+
| id|value|
+---+-----+
|  2|    1|
+---+-----+
C
+---+-----+
| id|value|
+---+-----+
|  1|    1|
+---+-----+
D
+---+-----+
| id|value|
+---+-----+
|  0|    2|
+---+-----+
Attempts:
2 left
💡 Hint
Remember that isNull() returns a boolean, which we cast to int to sum nulls.
data_output
intermediate
1:30remaining
Number of Duplicate Rows in Spark DataFrame
Given this Spark DataFrame, how many duplicate rows does it contain?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 'a'), (2, 'b'), (1, 'a'), (3, 'c'), (2, 'b'), (4, 'd')]
df = spark.createDataFrame(data, ['id', 'value'])
duplicates_count = df.count() - df.dropDuplicates().count()
print(duplicates_count)
A2
B3
C1
D4
Attempts:
2 left
💡 Hint
Duplicates are rows that appear more than once.
🔧 Debug
advanced
2:00remaining
Identify the Error in Null Filtering Code
What error does this Spark code raise when trying to filter rows with nulls in 'age' column?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 25), (2, None), (3, 30)]
df = spark.createDataFrame(data, ['id', 'age'])

filtered = df.filter(df['age'] == None)
filtered.show()
ANo error, but filtered DataFrame is empty
BSyntaxError
CTypeError
DThe filter returns no rows because '==' does not detect nulls
Attempts:
2 left
💡 Hint
In Spark, '==' does not work to check null values.
🚀 Application
advanced
2:00remaining
Remove Duplicate Rows Based on Subset of Columns
Which code snippet correctly removes duplicates based only on the 'email' column in a Spark DataFrame?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 'a@example.com'), (2, 'b@example.com'), (3, 'a@example.com'), (4, 'c@example.com')]
df = spark.createDataFrame(data, ['id', 'email'])
Adf.dropDuplicates(['id']).show()
Bdf.dropDuplicates(['email']).show()
Cdf.dropDuplicates().show()
Ddf.dropDuplicates('email').show()
Attempts:
2 left
💡 Hint
dropDuplicates expects a list of column names to consider for duplicates.
🧠 Conceptual
expert
1:30remaining
Understanding Null Handling in Spark Aggregations
When using Spark's aggregation functions like sum() on a column with null values, what is the behavior?
ANull values are treated as zero in the aggregation.
BNull values cause the aggregation to return null.
CNull values are ignored in the aggregation, sum returns the total of non-null values.
DAggregation raises an error if null values exist.
Attempts:
2 left
💡 Hint
Think about how SQL aggregates handle nulls.