0
0
Apache Sparkdata~5 mins

Structured Streaming basics in Apache Spark

Choose your learning style9 modes available
Introduction

Structured Streaming helps you process live data as it arrives. It makes working with real-time data easy and reliable.

When you want to analyze live sensor data from machines.
When you need to monitor social media feeds in real-time.
When you want to track website clicks as they happen.
When you want to update dashboards with fresh data continuously.
When you want to detect fraud instantly from transaction streams.
Syntax
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamingExample").getOrCreate()

# Read streaming data
streamingDF = spark.readStream.format("source_format").option("option_name", "value").load()

# Process data (example: select columns)
processedDF = streamingDF.select("column1", "column2")

# Write streaming output
query = processedDF.writeStream.format("console").start()

query.awaitTermination()

Use readStream to read live data.

Use writeStream to output results continuously.

Examples
Read CSV files as they arrive in a folder.
Apache Spark
streamingDF = spark.readStream.format("csv").option("header", "true").load("/path/to/folder")
Print streaming data to the console for quick checks.
Apache Spark
query = streamingDF.writeStream.format("console").start()
Store streaming results in memory for SQL queries.
Apache Spark
query = streamingDF.writeStream.format("memory").queryName("tableName").start()
Sample Program

This program reads live text data from a network socket, converts all text to uppercase, and prints it to the console continuously.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import upper

spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate()

# Read streaming data from a socket (for example, localhost:9999)
streamingDF = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Convert the text column to uppercase
processedDF = streamingDF.select(upper(streamingDF.value).alias("upper_value"))

# Write the results to the console
query = processedDF.writeStream.format("console").start()

query.awaitTermination()
OutputSuccess
Important Notes

Structured Streaming treats live data as an unending table.

Always call awaitTermination() to keep the stream running.

Use checkpoints to save progress and avoid data loss.

Summary

Structured Streaming lets you process live data easily.

Use readStream to get data and writeStream to output results.

It works like running SQL queries continuously on new data.