Challenge - 5 Problems
Output Modes Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output mode behavior with streaming aggregation
Given a streaming DataFrame that counts words in a stream, what will be the output after the first batch if the output mode is set to 'append'?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode spark = SparkSession.builder.appName('Test').getOrCreate() # Simulate streaming input input_data = [("hello world hello",)] # Create static DataFrame to simulate streaming df = spark.createDataFrame(input_data, ['line']) words = df.select(explode(split(df.line, ' ')).alias('word')) word_counts = words.groupBy('word').count() query = word_counts.writeStream.outputMode('append').format('console').start() query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Think about which output modes support aggregation in streaming queries.
✗ Incorrect
The 'append' output mode only outputs new rows added to the result table. Aggregations require 'complete' or 'update' modes because the counts update existing rows. Therefore, 'append' mode with aggregation causes no output.
❓ data_output
intermediate2:00remaining
Result of 'complete' output mode on streaming aggregation
What is the output of a streaming word count query after processing the first batch when using 'complete' output mode?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode spark = SparkSession.builder.appName('Test').getOrCreate() input_data = [("apple banana apple",)] df = spark.createDataFrame(input_data, ['line']) words = df.select(explode(split(df.line, ' ')).alias('word')) word_counts = words.groupBy('word').count() query = word_counts.writeStream.outputMode('complete').format('console').start() query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Remember that 'complete' mode outputs the full aggregated result table after each batch.
✗ Incorrect
'Complete' mode outputs the entire aggregation result after each batch. Since 'apple' appears twice and 'banana' once, the output shows counts 2 and 1 respectively.
🧠 Conceptual
advanced1:30remaining
Understanding 'update' output mode in streaming
Which statement correctly describes the 'update' output mode in Apache Spark Structured Streaming?
Attempts:
2 left
💡 Hint
Think about how 'update' differs from 'append' and 'complete' modes.
✗ Incorrect
'Update' mode outputs only rows that have changed since the last trigger. It includes new rows and updates to existing rows but does not output the full table like 'complete' mode.
🔧 Debug
advanced2:00remaining
Error caused by using 'append' mode with aggregation
What error will this code raise when running a streaming aggregation query with output mode set to 'append'?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode spark = SparkSession.builder.appName('Test').getOrCreate() input_data = [("cat dog cat",)] df = spark.createDataFrame(input_data, ['line']) words = df.select(explode(split(df.line, ' ')).alias('word')) word_counts = words.groupBy('word').count() query = word_counts.writeStream.outputMode('append').format('console').start() query.processAllAvailable() query.stop()
Attempts:
2 left
💡 Hint
Consider which output modes support aggregation in streaming queries.
✗ Incorrect
Using 'append' mode with aggregation causes an AnalysisException because 'append' mode does not support streaming aggregations.
🚀 Application
expert2:30remaining
Choosing output mode for a streaming deduplication job
You have a streaming job that removes duplicate records based on a unique key and outputs the latest unique records. Which output mode should you use to ensure only updated unique records are output after each batch?
Attempts:
2 left
💡 Hint
Think about which mode outputs only changed rows without the full table.
✗ Incorrect
'Update' mode outputs only the rows that have changed since the last trigger, which is ideal for deduplication where only updated unique records should be output.