Challenge - 5 Problems

🎖️

Null and Duplicate Detection Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Detecting Null Values Count in Spark DataFrame

What is the output of this Spark code snippet that counts null values in each column?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

spark = SparkSession.builder.getOrCreate()
data = [(1, None), (2, 'a'), (None, 'b'), (4, None)]
df = spark.createDataFrame(data, ['id', 'value'])

null_counts = df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])
null_counts.show()

+---+-----+
| id|value|
+---+-----+
|  1|    2|
+---+-----+

+---+-----+
| id|value|
+---+-----+
|  2|    1|
+---+-----+

+---+-----+
| id|value|
+---+-----+
|  1|    1|
+---+-----+

+---+-----+
| id|value|
+---+-----+
|  0|    2|
+---+-----+

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Number of Duplicate Rows in Spark DataFrame

Given this Spark DataFrame, how many duplicate rows does it contain?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 'a'), (2, 'b'), (1, 'a'), (3, 'c'), (2, 'b'), (4, 'd')]
df = spark.createDataFrame(data, ['id', 'value'])
duplicates_count = df.count() - df.dropDuplicates().count()
print(duplicates_count)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the Error in Null Filtering Code

What error does this Spark code raise when trying to filter rows with nulls in 'age' column?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 25), (2, None), (3, 30)]
df = spark.createDataFrame(data, ['id', 'age'])

filtered = df.filter(df['age'] == None)
filtered.show()

ANo error, but filtered DataFrame is empty

BSyntaxError

CTypeError

DThe filter returns no rows because '==' does not detect nulls

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Remove Duplicate Rows Based on Subset of Columns

Which code snippet correctly removes duplicates based only on the 'email' column in a Spark DataFrame?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 'a@example.com'), (2, 'b@example.com'), (3, 'a@example.com'), (4, 'c@example.com')]
df = spark.createDataFrame(data, ['id', 'email'])

Adf.dropDuplicates(['id']).show()

Bdf.dropDuplicates(['email']).show()

Cdf.dropDuplicates().show()

Ddf.dropDuplicates('email').show()

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Understanding Null Handling in Spark Aggregations

When using Spark's aggregation functions like sum() on a column with null values, what is the behavior?

ANull values are treated as zero in the aggregation.

BNull values cause the aggregation to return null.

CNull values are ignored in the aggregation, sum returns the total of non-null values.

DAggregation raises an error if null values exist.

Attempts:

2 left