Streaming lets us see and analyze data as it arrives. This helps make quick decisions based on fresh information.
Why streaming enables real-time analytics in Apache Spark
spark.readStream.format("source_format").option("option_name", "value").load()
This syntax starts reading data continuously from a source.
Replace "source_format" with your data source like "kafka", "socket", or "file".
spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
spark.readStream.format("kafka").option("kafka.bootstrap.servers", "server:9092").option("subscribe", "topic1").load()
This program reads text data from a network socket as it arrives. It splits the text into words and counts how many times each word appears in real-time. The counts print to the console continuously.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("StreamingExample").getOrCreate() # Read streaming data from a socket lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() # Split lines into words words = lines.selectExpr("explode(split(value, ' ')) as word") # Count words in real-time wordCounts = words.groupBy("word").count() # Start running the query that prints the counts to the console query = wordCounts.writeStream.outputMode("complete").format("console").start() query.awaitTermination()
Streaming data is processed in small chunks called micro-batches.
Output modes like "complete" show all counts, while "append" shows only new data.
Streaming requires a continuous source of data to keep running.
Streaming lets you analyze data as it arrives, not after it is stored.
This helps make fast decisions and see live updates.
Apache Spark makes streaming easy with simple commands to read and process data continuously.