What is Micro Batch in Spark: Explanation and Example
micro batch is a technique used in Structured Streaming where data is processed in small, fixed-size batches at regular intervals. This allows Spark to handle streaming data efficiently by treating it like a series of mini batch jobs.How It Works
Imagine you are watching a video, but instead of seeing it all at once, you get small clips every few seconds. Micro batching in Spark works similarly by collecting streaming data into small chunks or batches over short time windows. Spark then processes each batch as a mini job.
This approach balances real-time processing and system efficiency. Instead of processing each record one by one, Spark groups data into micro batches, which reduces overhead and improves throughput. It’s like delivering mail in small bundles rather than one letter at a time.
Example
This example shows how to create a simple Spark Structured Streaming job that reads data from a socket and processes it in micro batches every 5 seconds.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MicroBatchExample").getOrCreate() # Read streaming data from a socket lines = spark.readStream.format("socket") \ .option("host", "localhost") \ .option("port", 9999) \ .load() # Split lines into words words = lines.selectExpr("explode(split(value, ' ')) as word") # Count words in each micro batch wordCounts = words.groupBy("word").count() # Start streaming query with trigger interval of 5 seconds (micro batch duration) query = wordCounts.writeStream \ .outputMode("complete") \ .format("console") \ .trigger(processingTime='5 seconds') \ .start() query.awaitTermination()
When to Use
Micro batching is ideal when you need near real-time processing but can tolerate small delays (seconds). It works well for applications like log processing, monitoring, and alerting where data arrives continuously but can be processed in small groups.
Use micro batching when you want a simple, fault-tolerant streaming solution that leverages Spark’s batch engine. It is less complex than pure event-by-event streaming and offers good performance for many real-world streaming tasks.
Key Points
- Micro batch processes streaming data in small fixed intervals.
- It balances latency and throughput for efficient streaming.
- Implemented in Spark Structured Streaming using triggers.
- Good for near real-time applications with small delay tolerance.
- Simpler and more fault-tolerant than continuous streaming.