0
0
Apache Sparkdata~20 mins

Output modes (append, complete, update) in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Output Modes Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output mode behavior with streaming aggregation
Given a streaming DataFrame that counts words in a stream, what will be the output after the first batch if the output mode is set to 'append'?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

# Simulate streaming input
input_data = [("hello world hello",)]

# Create static DataFrame to simulate streaming
df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('append').format('console').start()

query.processAllAvailable()
query.stop()
A
[word=hello, count=2]
[word=world, count=1]
BNo output, because 'append' mode does not support aggregation queries
C
[word=hello, count=1]
[word=world, count=1]
D
[word=hello, count=2]
[word=world, count=1]
[word=hello, count=3]
Attempts:
2 left
💡 Hint
Think about which output modes support aggregation in streaming queries.
data_output
intermediate
2:00remaining
Result of 'complete' output mode on streaming aggregation
What is the output of a streaming word count query after processing the first batch when using 'complete' output mode?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

input_data = [("apple banana apple",)]

df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('complete').format('console').start()

query.processAllAvailable()
query.stop()
A
[word=apple, count=2]
[word=banana, count=1]
B
[word=apple, count=1]
[word=banana, count=1]
C
[word=apple, count=2]
[word=banana, count=2]
DNo output because 'complete' mode is not supported for aggregation
Attempts:
2 left
💡 Hint
Remember that 'complete' mode outputs the full aggregated result table after each batch.
🧠 Conceptual
advanced
1:30remaining
Understanding 'update' output mode in streaming
Which statement correctly describes the 'update' output mode in Apache Spark Structured Streaming?
A'Update' mode outputs the entire result table after every trigger, regardless of changes.
B'Update' mode outputs only new rows appended to the result table, ignoring updates to existing rows.
C'Update' mode outputs only the rows that have changed since the last trigger, including new and updated rows, but not the entire result table.
D'Update' mode is not supported for streaming aggregations.
Attempts:
2 left
💡 Hint
Think about how 'update' differs from 'append' and 'complete' modes.
🔧 Debug
advanced
2:00remaining
Error caused by using 'append' mode with aggregation
What error will this code raise when running a streaming aggregation query with output mode set to 'append'?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

input_data = [("cat dog cat",)]

df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('append').format('console').start()

query.processAllAvailable()
query.stop()
Aorg.apache.spark.sql.AnalysisException: Append output mode not supported with aggregations
BNo error, outputs counts correctly
CSyntaxError: invalid syntax in outputMode parameter
DRuntimeError: Streaming source not found
Attempts:
2 left
💡 Hint
Consider which output modes support aggregation in streaming queries.
🚀 Application
expert
2:30remaining
Choosing output mode for a streaming deduplication job
You have a streaming job that removes duplicate records based on a unique key and outputs the latest unique records. Which output mode should you use to ensure only updated unique records are output after each batch?
Aappend
Bcomplete
Coverwrite
Dupdate
Attempts:
2 left
💡 Hint
Think about which mode outputs only changed rows without the full table.