0
0
Apache Sparkdata~10 mins

Null and duplicate detection in Apache Spark - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to count the number of null values in the 'age' column of the DataFrame.

Apache Spark
null_count = df.filter(df['age'].[1]()).count()
Drag options to blanks, or click blank then click option'
AisNull
BisNotNull
Cdropna
Ddistinct
Attempts:
3 left
💡 Hint
Common Mistakes
Using '==' or '!=' with None, which do not work due to Spark's three-valued logic.
Confusing with DataFrame methods like dropna() or distinct().
2fill in blank
medium

Complete the code to drop duplicate rows from the DataFrame.

Apache Spark
df_no_duplicates = df.[1]()
Drag options to blanks, or click blank then click option'
Adrop
BdropDuplicates
Cdistinct
Ddropna
Attempts:
3 left
💡 Hint
Common Mistakes
Using 'dropna' which removes rows with nulls, not duplicates.
Using 'distinct' which also removes duplicates but on all columns explicitly.
3fill in blank
hard

Fix the code to filter rows where the 'salary' column is not null.

Apache Spark
filtered_df = df.filter(df['salary'].[1]())
Drag options to blanks, or click blank then click option'
Adropna
BisNull
Cdistinct
DisNotNull
Attempts:
3 left
💡 Hint
Common Mistakes
Using '!=' or '==' with None, which do not work in Spark.
Using DataFrame-level methods like dropna().
4fill in blank
hard

Complete the code to drop duplicate rows based on the 'name' and 'age' columns.

Apache Spark
df_no_duplicates = df.dropDuplicates([[1], [2]])
Drag options to blanks, or click blank then click option'
A'name'
B'age'
C'salary'
Ddf['name']
Attempts:
3 left
💡 Hint
Common Mistakes
Using Column objects like df['name'] instead of strings.
Selecting incorrect columns like 'salary'.
Forgetting quotes around column names.
5fill in blank
hard

Fill all three blanks to count the number of duplicate groups based on 'name' and 'age' columns.

Apache Spark
dupe_groups_count = df.groupBy([1], [2]).count().filter(col('[3]') > 1).count()
Drag options to blanks, or click blank then click option'
A'name'
B'age'
Ccount
D'salary'
Attempts:
3 left
💡 Hint
Common Mistakes
Using Column objects df['name'] instead of strings in groupBy.
Incorrect column name in filter (not 'name' or 'age').
Omitting col() or using wrong syntax in filter.