0
0
Hadoopdata~10 mins

Lambda architecture (batch + streaming) in Hadoop

Choose your learning style9 modes available
Introduction

Lambda architecture helps process large data by combining fast real-time data and slower batch data. It gives quick and accurate results.

When you want to analyze data as it arrives and also keep a full history.
When you need quick insights but also want to correct errors later.
When your data is too big to process all at once.
When you want to combine live updates with detailed reports.
When you want a system that is fault-tolerant and scalable.
Syntax
Hadoop
Lambda Architecture has three layers:

1. Batch Layer:
   - Stores all raw data.
   - Processes data in large batches.
   - Creates a master dataset and batch views.

2. Speed Layer (Streaming Layer):
   - Processes data in real-time.
   - Handles recent data quickly.
   - Creates real-time views.

3. Serving Layer:
   - Merges batch views and real-time views.
   - Answers queries combining both data sources.

The batch layer handles big data but is slower.

The speed layer handles new data fast but with less accuracy.

Examples
This example shows how Hadoop tools fit into Lambda architecture.
Hadoop
Batch Layer: Use Hadoop MapReduce to process all data every hour.
Speed Layer: Use Apache Storm to process data as it arrives.
Serving Layer: Use HBase to combine batch and speed views for queries.
This example uses Kafka and Spark Streaming for real-time data.
Hadoop
Batch Layer: Store raw logs in HDFS.
Speed Layer: Stream logs with Kafka and process with Spark Streaming.
Serving Layer: Query combined results with Apache Hive.
Sample Program

This code shows a simple Lambda architecture example using Spark on Hadoop. Batch data is processed once, streaming data is processed continuously. Results are printed to console.

Hadoop
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col

# Initialize Spark session
spark = SparkSession.builder.appName('LambdaExample').getOrCreate()

# Batch Layer: Read historical data from HDFS
batch_data = spark.read.json('hdfs://path/to/historical/data')

# Process batch data: count events per day
batch_counts = batch_data.groupBy('eventType').count()

# Speed Layer: Simulate streaming data
streaming_data = spark.readStream.schema(batch_data.schema).json('hdfs://path/to/streaming/data')

# Process streaming data: count events in 1 minute windows
stream_counts = streaming_data.groupBy(window(col('timestamp'), '1 minute'), 'eventType').count()

# Start streaming query to console (for demo)
query = stream_counts.writeStream.outputMode('complete').format('console').start()

# Show batch results
batch_counts.show()

# Wait for streaming to finish (in real use, streaming runs continuously)
query.awaitTermination(10)

spark.stop()
OutputSuccess
Important Notes

Batch layer is slow but accurate.

Speed layer is fast but may have approximate data.

Serving layer combines both for best results.

Summary

Lambda architecture mixes batch and streaming to handle big data.

Batch layer processes all data; speed layer processes new data fast.

Serving layer merges results for queries.