Recall & Review
beginner
What is the purpose of detecting null values in a dataset?
Detecting null values helps identify missing or incomplete data, which can affect analysis and model accuracy. Handling nulls properly ensures cleaner, more reliable results.
Click to reveal answer
beginner
How do you check for null values in a Spark DataFrame?
You can use the
filter or where method with isNull() on a column, for example: df.filter(df['column'].isNull()).show().Click to reveal answer
intermediate
What is the difference between
dropDuplicates() and distinct() in Spark?dropDuplicates() removes duplicate rows based on specified columns, while distinct() removes duplicate rows considering all columns.Click to reveal answer
beginner
Why is it important to detect duplicate rows in data?
Duplicate rows can bias analysis and models by over-representing some data points. Removing duplicates ensures fair and accurate insights.
Click to reveal answer
intermediate
How can you count the number of nulls in each column of a Spark DataFrame?
Use aggregation with
sum and when functions: df.select([F.sum(F.when(F.col(c).isNull(),1).otherwise(0)).alias(c) for c in df.columns]).show().Click to reveal answer
Which Spark function helps to find rows with null values in a column?
✗ Incorrect
The
isNull() function checks if a column value is null.What does
dropDuplicates(['col1', 'col2']) do in Spark?✗ Incorrect
It removes rows that have the same values in both col1 and col2.
Which method removes duplicate rows considering all columns in a Spark DataFrame?
✗ Incorrect
distinct() removes duplicate rows based on all columns.Why should you handle null values before analysis?
✗ Incorrect
Null values can cause errors or bias if not handled properly.
How can you count nulls in each column of a Spark DataFrame?
✗ Incorrect
This aggregation counts nulls by summing 1 for nulls and 0 otherwise.
Explain how to detect and handle null values in a Spark DataFrame.
Think about checking, counting, and then cleaning nulls.
You got /4 concepts.
Describe the difference between removing duplicates with dropDuplicates() and distinct() in Spark.
Focus on which columns each method considers.
You got /3 concepts.