0
0
Apache Sparkdata~10 mins

Data quality assertions in Apache Spark - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to check if the DataFrame has any null values.

Apache Spark
df.selectExpr('count(*) as total', 'count([1]) as non_null').show()
Drag options to blanks, or click blank then click option'
Anull
Bcolumn_name
C*
Dcount
Attempts:
3 left
💡 Hint
Common Mistakes
Using a specific column name without context causes errors.
Using 'null' as a string is invalid in selectExpr.
2fill in blank
medium

Complete the code to assert that the column 'age' has no null values.

Apache Spark
assert df.filter(df.age.[1](None)).count() == 0, 'Null values found in age column'
Drag options to blanks, or click blank then click option'
AisNotNull
BisNull
Cisnan
DisEmpty
Attempts:
3 left
💡 Hint
Common Mistakes
Using isNotNull() filters non-null rows, which is opposite of the goal.
Using isnan() is for NaN values, not nulls.
3fill in blank
hard

Fix the error in the code to assert that all values in 'salary' are positive.

Apache Spark
assert df.filter(df.salary [1] 0).count() == 0, 'Negative or zero salary found'
Drag options to blanks, or click blank then click option'
A<=
B>=
C<
D>
Attempts:
3 left
💡 Hint
Common Mistakes
Using '>=' filters salaries greater or equal to zero, which is incorrect here.
Using '>' misses zero values.
4fill in blank
hard

Fill both blanks to create a dictionary of counts for each unique value in 'department' where count is greater than 5.

Apache Spark
dept_counts = {row['[1]']: row['[2]'] for row in df.groupBy('department').count().collect() if row['count'] > 5}
Drag options to blanks, or click blank then click option'
Adepartment
Bcount
Cdept
Dvalue
Attempts:
3 left
💡 Hint
Common Mistakes
Using incorrect keys like 'dept' or 'value' which don't exist in rows.
Confusing 'count' with other column names.
5fill in blank
hard

Fill all three blanks to create a filtered DataFrame with no nulls in 'email' and 'phone' columns and only rows where 'age' is greater than 18.

Apache Spark
filtered_df = df.filter(df.email.[1]() & df.phone.[2]() & (df.age [3] 18))
Drag options to blanks, or click blank then click option'
AisNotNull
B>
CisNull
D<
Attempts:
3 left
💡 Hint
Common Mistakes
Using isNull() instead of isNotNull() includes nulls.
Using '<' instead of '>' filters wrong age range.