Challenge - 5 Problems
Kafka Spark Reader Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
What is the output schema of this Spark Kafka read?
Given the following Spark code to read from Kafka, what is the schema of the resulting DataFrame?
Apache Spark
df = spark.read.format("kafka")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "topic1")\ .load() df.printSchema()
Attempts:
2 left
💡 Hint
Kafka message keys and values are binary by default in Spark.
✗ Incorrect
When reading from Kafka, Spark returns a DataFrame with columns key and value as binary types, and metadata columns like topic, partition, offset, timestamp, and timestampType with their respective types and nullable true.
❓ data_output
intermediate1:30remaining
What is the count of messages read from Kafka?
Assuming the Kafka topic 'topic1' has 100 messages, what will be the output of this code?
Apache Spark
df = spark.read.format("kafka")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "topic1")\ .load() count = df.count() print(count)
Attempts:
2 left
💡 Hint
Spark reads all available messages in the topic partition(s) when loading.
✗ Incorrect
The count() action returns the number of rows in the DataFrame, which corresponds to the number of messages read from Kafka. Since the topic has 100 messages, the count is 100.
🔧 Debug
advanced2:00remaining
Why does this Spark Kafka read code fail with a timeout?
This code snippet to read from Kafka fails with a timeout error. What is the most likely cause?
Apache Spark
df = spark.read.format("kafka")\ .option("kafka.bootstrap.servers", "wronghost:9092")\ .option("subscribe", "topic1")\ .load()
Attempts:
2 left
💡 Hint
Timeout errors usually relate to network or server connection issues.
✗ Incorrect
A timeout error when reading from Kafka usually means Spark cannot connect to the Kafka bootstrap server. An incorrect or unreachable server address causes this.
🚀 Application
advanced2:30remaining
How to convert Kafka message values to strings in Spark?
You read Kafka messages with Spark and want to convert the 'value' column from binary to string. Which code snippet correctly does this?
Apache Spark
from pyspark.sql.functions import col, expr # df is the DataFrame read from Kafka converted_df = ???
Attempts:
2 left
💡 Hint
Use Spark SQL functions or DataFrame API to cast binary to string.
✗ Incorrect
Casting the 'value' column from binary to string using col().cast('string') is the correct way. selectExpr with CAST(value AS STRING) works but is less flexible. value.toString() is not a valid Spark SQL expression.
🧠 Conceptual
expert1:30remaining
What happens if you use 'startingOffsets' option with 'latest' in Spark Kafka read?
When reading from Kafka with Spark, setting option startingOffsets to 'latest' means:
Attempts:
2 left
💡 Hint
Think about what 'latest' means in Kafka offset context.
✗ Incorrect
Setting startingOffsets to 'latest' tells Spark to start reading only new messages that arrive after the streaming query starts, ignoring older messages.