Choose the best explanation for why streaming data processing enables real-time analytics.
Think about how data is handled in streaming versus batch processing.
Streaming processes data continuously as it arrives, enabling analytics to happen immediately. This contrasts with batch processing, which waits for all data before starting analysis.
What will be the output of the following Spark Structured Streaming code snippet after processing 3 micro-batches?
from pyspark.sql import SparkSession from pyspark.sql.functions import expr spark = SparkSession.builder.appName('streaming').getOrCreate() # Simulated streaming input input_data = spark.readStream.format('rate').option('rowsPerSecond', 1).load() # Simple transformation transformed = input_data.select(expr('value % 2 as parity')) query = transformed.writeStream.format('console').outputMode('append').start() query.processAllAvailable() query.processAllAvailable() query.processAllAvailable() query.stop()
Consider what the 'rate' source generates and how the modulo operation works.
The 'rate' source generates rows with increasing 'value'. The modulo 2 operation alternates parity between 0 and 1. After 3 micro-batches, three rows with parity 0,1,0 are printed.
Given a streaming DataFrame with timestamps and values, what is the output after applying a 1-minute tumbling window aggregation counting events?
from pyspark.sql import SparkSession from pyspark.sql.functions import window, col spark = SparkSession.builder.appName('streaming').getOrCreate() # Sample static data simulating streaming input data = [ ('2024-06-01 12:00:10', 1), ('2024-06-01 12:00:40', 1), ('2024-06-01 12:01:05', 1), ('2024-06-01 12:01:50', 1) ] schema = 'timestamp STRING, value INT' df = spark.createDataFrame(data, schema=schema) # Convert timestamp string to timestamp type from pyspark.sql.functions import to_timestamp df = df.withColumn('timestamp', to_timestamp(col('timestamp'))) # Apply window aggregation windowed = df.groupBy(window(col('timestamp'), '1 minute')).count().orderBy('window') result = windowed.collect()
Count how many timestamps fall into each 1-minute window.
The first window covers 12:00:00 to 12:01:00 and includes two timestamps. The second window covers 12:01:00 to 12:02:00 and includes two timestamps.
What error will this Spark Structured Streaming code raise?
from pyspark.sql import SparkSession from pyspark.sql.functions import count spark = SparkSession.builder.appName('streaming').getOrCreate() input_stream = spark.readStream.format('socket').option('host', 'localhost').option('port', 9999).load() agg = input_stream.groupBy('value').agg(count('*').alias('count')) query = agg.writeStream.format('console').outputMode('complete').start() query.awaitTermination()
Check the schema of the input stream from socket source.
The socket source creates a DataFrame with a single column named 'value' of type string. The code groups by 'value' and counts occurrences, which is valid and runs correctly.
You want to build a fraud detection system that flags suspicious transactions immediately after they occur. Which streaming feature is most critical to enable this real-time analytics?
Think about what 'real-time' means for fraud detection.
Real-time fraud detection requires analyzing transactions immediately as they arrive. Low latency streaming processing enables this by avoiding delays from batch waits.