beginner

What is the purpose of detecting null values in a dataset?

Detecting null values helps identify missing or incomplete data, which can affect analysis and model accuracy. Handling nulls properly ensures cleaner, more reliable results.

Click to reveal answer

beginner

How do you check for null values in a Spark DataFrame?

You can use the filter or where method with isNull() on a column, for example: df.filter(df['column'].isNull()).show().

Click to reveal answer

intermediate

What is the difference between dropDuplicates() and distinct() in Spark?

dropDuplicates() removes duplicate rows based on specified columns, while distinct() removes duplicate rows considering all columns.

Click to reveal answer

beginner

Why is it important to detect duplicate rows in data?

Duplicate rows can bias analysis and models by over-representing some data points. Removing duplicates ensures fair and accurate insights.

Click to reveal answer

intermediate

How can you count the number of nulls in each column of a Spark DataFrame?

Use aggregation with sum and when functions: df.select([F.sum(F.when(F.col(c).isNull(),1).otherwise(0)).alias(c) for c in df.columns]).show().

Click to reveal answer

Which Spark function helps to find rows with null values in a column?

AisNull()

BisEmpty()

CisNaN()

DisDuplicate()

What does dropDuplicates(['col1', 'col2']) do in Spark?

ARemoves duplicate rows based on col1 and col2 values

BRemoves all duplicates in the DataFrame

CRemoves rows with nulls in col1 or col2

DRemoves rows where col1 equals col2

Which method removes duplicate rows considering all columns in a Spark DataFrame?

AdropDuplicates()

Bdistinct()

CdropNulls()

DfilterDuplicates()

Why should you handle null values before analysis?

ANulls are always replaced with zeros automatically

BNulls improve model accuracy

CNulls increase dataset size

DNulls can cause errors or bias in results

How can you count nulls in each column of a Spark DataFrame?

AUsing <code>count()</code>

BUsing <code>dropDuplicates()</code>

CUsing <code>sum(when(col.isNull(),1).otherwise(0))</code>

DUsing <code>filter(col.isNotNull())</code>

Explain how to detect and handle null values in a Spark DataFrame.

Describe the difference between removing duplicates with dropDuplicates() and distinct() in Spark.