Challenge - 5 Problems

🎖️

Streaming Analytics Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why does streaming data support real-time analytics?

Choose the best explanation for why streaming data processing enables real-time analytics.

AStreaming collects all data first, then processes it in large batches to improve accuracy.

BStreaming processes data continuously as it arrives, allowing immediate insights without waiting for batch completion.

CStreaming delays data processing until a fixed time interval to reduce system load.

DStreaming stores data permanently before any analysis can begin.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Spark Structured Streaming query

What will be the output of the following Spark Structured Streaming code snippet after processing 3 micro-batches?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName('streaming').getOrCreate()

# Simulated streaming input
input_data = spark.readStream.format('rate').option('rowsPerSecond', 1).load()

# Simple transformation
transformed = input_data.select(expr('value % 2 as parity'))

query = transformed.writeStream.format('console').outputMode('append').start()

query.processAllAvailable()
query.processAllAvailable()
query.processAllAvailable()
query.stop()

AAn error because 'processAllAvailable()' is not a valid method.

BNo output because the query is not started.

CThree rows printed with parity values 0, 1, 0 respectively.

DInfinite output because the stream never stops.

Attempts:

2 left

❓ data_output

advanced

2:30remaining

Resulting DataFrame after windowed aggregation

Given a streaming DataFrame with timestamps and values, what is the output after applying a 1-minute tumbling window aggregation counting events?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col

spark = SparkSession.builder.appName('streaming').getOrCreate()

# Sample static data simulating streaming input
data = [
  ('2024-06-01 12:00:10', 1),
  ('2024-06-01 12:00:40', 1),
  ('2024-06-01 12:01:05', 1),
  ('2024-06-01 12:01:50', 1)
]

schema = 'timestamp STRING, value INT'

df = spark.createDataFrame(data, schema=schema)

# Convert timestamp string to timestamp type
from pyspark.sql.functions import to_timestamp

df = df.withColumn('timestamp', to_timestamp(col('timestamp')))

# Apply window aggregation
windowed = df.groupBy(window(col('timestamp'), '1 minute')).count().orderBy('window')

result = windowed.collect()

A[Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 0), end=datetime.datetime(2024, 6, 1, 12, 1)), count=2), Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 1), end=datetime.datetime(2024, 6, 1, 12, 2)), count=2)]

B[Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 0), end=datetime.datetime(2024, 6, 1, 12, 1)), count=4)]

C[Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 0), end=datetime.datetime(2024, 6, 1, 12, 1)), count=1), Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 1), end=datetime.datetime(2024, 6, 1, 12, 2)), count=3)]

D[Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 0), end=datetime.datetime(2024, 6, 1, 12, 1)), count=0), Row(window=Row(start=datetime.datetime(2024, 6, 1, 12, 1), end=datetime.datetime(2024, 6, 1, 12, 2)), count=4)]

Attempts:

2 left

🔧 Debug

advanced

2:30remaining

Identify the error in this streaming aggregation code

What error will this Spark Structured Streaming code raise?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import count

spark = SparkSession.builder.appName('streaming').getOrCreate()

input_stream = spark.readStream.format('socket').option('host', 'localhost').option('port', 9999).load()

agg = input_stream.groupBy('value').agg(count('*').alias('count'))

query = agg.writeStream.format('console').outputMode('complete').start()

query.awaitTermination()

ANo error; code runs and outputs counts correctly.

BAnalysisException: 'value' column does not exist in input stream.

CStreamingQueryException due to socket connection failure.

DTypeError because count('*') is invalid syntax.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Choosing streaming for real-time fraud detection

You want to build a fraud detection system that flags suspicious transactions immediately after they occur. Which streaming feature is most critical to enable this real-time analytics?

AUsing static datasets updated daily for model training.

BHigh throughput batch processing to handle large volumes of data at once.

CStoring all transactions first for offline analysis later.

DLow latency processing that analyzes each transaction as it arrives without waiting for batches.

Attempts:

2 left