0
0
Apache Sparkdata~5 mins

Null and duplicate detection in Apache Spark - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the purpose of detecting null values in a dataset?
Detecting null values helps identify missing or incomplete data, which can affect analysis and model accuracy. Handling nulls properly ensures cleaner, more reliable results.
Click to reveal answer
beginner
How do you check for null values in a Spark DataFrame?
You can use the filter or where method with isNull() on a column, for example: df.filter(df['column'].isNull()).show().
Click to reveal answer
intermediate
What is the difference between dropDuplicates() and distinct() in Spark?
dropDuplicates() removes duplicate rows based on specified columns, while distinct() removes duplicate rows considering all columns.
Click to reveal answer
beginner
Why is it important to detect duplicate rows in data?
Duplicate rows can bias analysis and models by over-representing some data points. Removing duplicates ensures fair and accurate insights.
Click to reveal answer
intermediate
How can you count the number of nulls in each column of a Spark DataFrame?
Use aggregation with sum and when functions: df.select([F.sum(F.when(F.col(c).isNull(),1).otherwise(0)).alias(c) for c in df.columns]).show().
Click to reveal answer
Which Spark function helps to find rows with null values in a column?
AisNull()
BisEmpty()
CisNaN()
DisDuplicate()
What does dropDuplicates(['col1', 'col2']) do in Spark?
ARemoves duplicate rows based on col1 and col2 values
BRemoves all duplicates in the DataFrame
CRemoves rows with nulls in col1 or col2
DRemoves rows where col1 equals col2
Which method removes duplicate rows considering all columns in a Spark DataFrame?
AdropDuplicates()
Bdistinct()
CdropNulls()
DfilterDuplicates()
Why should you handle null values before analysis?
ANulls are always replaced with zeros automatically
BNulls improve model accuracy
CNulls increase dataset size
DNulls can cause errors or bias in results
How can you count nulls in each column of a Spark DataFrame?
AUsing <code>count()</code>
BUsing <code>dropDuplicates()</code>
CUsing <code>sum(when(col.isNull(),1).otherwise(0))</code>
DUsing <code>filter(col.isNotNull())</code>
Explain how to detect and handle null values in a Spark DataFrame.
Think about checking, counting, and then cleaning nulls.
You got /4 concepts.
    Describe the difference between removing duplicates with dropDuplicates() and distinct() in Spark.
    Focus on which columns each method considers.
    You got /3 concepts.