Challenge - 5 Problems

🎖️

DataFrames Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Advantages of DataFrames over RDDs

Which of the following is a key reason why DataFrames are preferred over RDDs in Apache Spark?

ADataFrames do not support SQL queries, unlike RDDs which do.

BDataFrames provide schema and optimize queries using Catalyst optimizer, improving performance.

CRDDs support schema and automatic query optimization, making them faster than DataFrames.

DRDDs automatically cache data by default, while DataFrames do not.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of DataFrame vs RDD count

What will be the output of the following Spark code snippet?

Apache Spark

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
df = rdd.toDF(['numbers'])
count_rdd = rdd.count()
count_df = df.count()
print(count_rdd, count_df)

A4 4

B4 0

CError because RDD and DataFrame counts differ

D0 4

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Schema inference in DataFrames

Given the following code, what is the schema of the DataFrame?

Apache Spark

data = [(1, 'Alice', 29), (2, 'Bob', 31)]
df = spark.createDataFrame(data, schema=['id', 'name', 'age'])
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- age: integer (nullable = false)

root
 |-- id: double (nullable = true)
 |-- name: string (nullable = true)
 |-- age: double (nullable = true)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Error when applying SQL functions on RDD

What error will occur when running this code?

Apache Spark

rdd = spark.sparkContext.parallelize([(1, 'a'), (2, 'b')])
rdd.select('1').show()

ANo error, outputs the first column

BTypeError: select() missing 1 required positional argument

CSyntaxError: invalid syntax

DAttributeError: 'RDD' object has no attribute 'select'

Attempts:

2 left

🚀 Application

expert

3:00remaining

Performance difference in aggregation

You have a large dataset and want to calculate the average age grouped by city. Which approach will be faster and why?

AUsing RDD filter and collect because filtering reduces data size before aggregation.

BUsing RDD map and reduceByKey because RDDs are lower level and faster for all operations.

CUsing DataFrame groupBy and avg functions because Spark optimizes these operations with Catalyst and Tungsten.

DUsing DataFrame collect and then Python aggregation because local processing is faster.

Attempts:

2 left