0
0
Apache-sparkConceptBeginner · 4 min read

What is Micro Batch in Spark: Explanation and Example

In Apache Spark, micro batch is a technique used in Structured Streaming where data is processed in small, fixed-size batches at regular intervals. This allows Spark to handle streaming data efficiently by treating it like a series of mini batch jobs.
⚙️

How It Works

Imagine you are watching a video, but instead of seeing it all at once, you get small clips every few seconds. Micro batching in Spark works similarly by collecting streaming data into small chunks or batches over short time windows. Spark then processes each batch as a mini job.

This approach balances real-time processing and system efficiency. Instead of processing each record one by one, Spark groups data into micro batches, which reduces overhead and improves throughput. It’s like delivering mail in small bundles rather than one letter at a time.

💻

Example

This example shows how to create a simple Spark Structured Streaming job that reads data from a socket and processes it in micro batches every 5 seconds.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MicroBatchExample").getOrCreate()

# Read streaming data from a socket
lines = spark.readStream.format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split lines into words
words = lines.selectExpr("explode(split(value, ' ')) as word")

# Count words in each micro batch
wordCounts = words.groupBy("word").count()

# Start streaming query with trigger interval of 5 seconds (micro batch duration)
query = wordCounts.writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime='5 seconds') \
    .start()

query.awaitTermination()
Output
------------------------------------------- Batch: 0 ------------------------------------------- +----+-----+ |word|count| +----+-----+ |foo | 3 | |bar | 2 | +----+-----+ ------------------------------------------- Batch: 1 ------------------------------------------- +----+-----+ |word|count| +----+-----+ |foo | 5 | |baz | 1 | +----+-----+
🎯

When to Use

Micro batching is ideal when you need near real-time processing but can tolerate small delays (seconds). It works well for applications like log processing, monitoring, and alerting where data arrives continuously but can be processed in small groups.

Use micro batching when you want a simple, fault-tolerant streaming solution that leverages Spark’s batch engine. It is less complex than pure event-by-event streaming and offers good performance for many real-world streaming tasks.

Key Points

  • Micro batch processes streaming data in small fixed intervals.
  • It balances latency and throughput for efficient streaming.
  • Implemented in Spark Structured Streaming using triggers.
  • Good for near real-time applications with small delay tolerance.
  • Simpler and more fault-tolerant than continuous streaming.

Key Takeaways

Micro batching processes streaming data in small fixed time intervals for efficiency.
It is used in Spark Structured Streaming with trigger intervals controlling batch size.
Ideal for near real-time applications that can tolerate small delays.
Simplifies streaming by leveraging Spark’s batch processing engine.
Balances system performance and latency for continuous data streams.