Challenge - 5 Problems
Structured Streaming Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a simple streaming aggregation
What will be the output of the following Structured Streaming code snippet after processing the input data?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import window, col spark = SparkSession.builder.appName('Test').getOrCreate() # Simulated streaming input input_data = spark.readStream.format('rate').option('rowsPerSecond', 1).load() # Aggregate count of rows per 10 second window agg = input_data.groupBy(window(col('timestamp'), '10 seconds')).count() query = agg.writeStream.format('console').outputMode('complete').start() query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Think about how the 'rate' source generates data and how window aggregation works in streaming.
✗ Incorrect
The 'rate' source generates rows with timestamps continuously. Grouping by a 10-second window and counting rows produces a table with counts per window. The output mode 'complete' prints the full aggregation result to console.
❓ data_output
intermediate1:30remaining
Number of rows in streaming output with append mode
Given a streaming DataFrame reading from a socket source with the following code, how many rows will be output after processing 3 lines of input if the output mode is 'append'?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SocketTest').getOrCreate() lines = spark.readStream.format('socket').option('host', 'localhost').option('port', 9999).load() query = lines.writeStream.format('console').outputMode('append').start() # Assume 3 lines are sent to the socket query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Append mode outputs only new rows as they arrive.
✗ Incorrect
In append mode, each new row received from the socket is output as a separate row. So 3 lines produce 3 output rows.
🔧 Debug
advanced2:00remaining
Identify the error in this streaming query
What error will this Structured Streaming code produce when executed?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ErrorTest').getOrCreate() input_stream = spark.readStream.format('jdbc').option('header', 'true').load('/path/to/folder') query = input_stream.writeStream.format('console').outputMode('append').start() query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Check if the 'jdbc' format supports streaming input.
✗ Incorrect
The 'jdbc' format is not a supported streaming source in Structured Streaming, so Spark throws an AnalysisException.
🧠 Conceptual
advanced1:00remaining
Understanding output modes in Structured Streaming
Which output mode should be used to update only the rows that have changed in a streaming aggregation query?
Attempts:
2 left
💡 Hint
Think about which mode outputs only changed rows without re-outputting all data.
✗ Incorrect
Update mode outputs only the rows that have changed since the last trigger, making it efficient for aggregations.
🚀 Application
expert2:30remaining
Predict the output schema of a streaming aggregation
Given the following streaming aggregation code, what will be the schema of the output DataFrame?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import window, col spark = SparkSession.builder.appName('SchemaTest').getOrCreate() input_stream = spark.readStream.format('rate').load() agg = input_stream.groupBy(window(col('timestamp'), '5 minutes'), (col('value') % 10).alias('mod_value')).count() # What is the schema of 'agg'?
Attempts:
2 left
💡 Hint
Consider the groupBy keys and aggregation columns.
✗ Incorrect
The output schema includes the window column as a struct with start and end timestamps, the grouping key (value % 10) as long, and the count as long type.