Challenge - 5 Problems

🎖️

Output Modes Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output mode behavior with streaming aggregation

Given a streaming DataFrame that counts words in a stream, what will be the output after the first batch if the output mode is set to 'append'?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

# Simulate streaming input
input_data = [("hello world hello",)]

# Create static DataFrame to simulate streaming
df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('append').format('console').start()

query.processAllAvailable()
query.stop()

[word=hello, count=2]
[word=world, count=1]

BNo output, because 'append' mode does not support aggregation queries

[word=hello, count=1]
[word=world, count=1]

[word=hello, count=2]
[word=world, count=1]
[word=hello, count=3]

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of 'complete' output mode on streaming aggregation

What is the output of a streaming word count query after processing the first batch when using 'complete' output mode?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

input_data = [("apple banana apple",)]

df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('complete').format('console').start()

query.processAllAvailable()
query.stop()

[word=apple, count=2]
[word=banana, count=1]

[word=apple, count=1]
[word=banana, count=1]

[word=apple, count=2]
[word=banana, count=2]

DNo output because 'complete' mode is not supported for aggregation

Attempts:

2 left

🧠 Conceptual

advanced

1:30remaining

Understanding 'update' output mode in streaming

Which statement correctly describes the 'update' output mode in Apache Spark Structured Streaming?

A'Update' mode outputs the entire result table after every trigger, regardless of changes.

B'Update' mode outputs only new rows appended to the result table, ignoring updates to existing rows.

C'Update' mode outputs only the rows that have changed since the last trigger, including new and updated rows, but not the entire result table.

D'Update' mode is not supported for streaming aggregations.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Error caused by using 'append' mode with aggregation

What error will this code raise when running a streaming aggregation query with output mode set to 'append'?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode

spark = SparkSession.builder.appName('Test').getOrCreate()

input_data = [("cat dog cat",)]

df = spark.createDataFrame(input_data, ['line'])

words = df.select(explode(split(df.line, ' ')).alias('word'))

word_counts = words.groupBy('word').count()

query = word_counts.writeStream.outputMode('append').format('console').start()

query.processAllAvailable()
query.stop()

Aorg.apache.spark.sql.AnalysisException: Append output mode not supported with aggregations

BNo error, outputs counts correctly

CSyntaxError: invalid syntax in outputMode parameter

DRuntimeError: Streaming source not found

Attempts:

2 left

🚀 Application

expert

2:30remaining

Choosing output mode for a streaming deduplication job

You have a streaming job that removes duplicate records based on a unique key and outputs the latest unique records. Which output mode should you use to ensure only updated unique records are output after each batch?

Aappend

Bcomplete

Coverwrite

Dupdate

Attempts:

2 left