0
0
Hadoopdata~10 mins

Lambda architecture (batch + streaming) in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Lambda architecture (batch + streaming)
Raw Data Input
Batch Layer
Batch Views
Serving Layer
User Queries
Data flows from raw input into two paths: batch for large-scale processing and speed for real-time updates. Both results combine in the serving layer to answer user queries.
Execution Sample
Hadoop
1. Collect raw data continuously
2. Batch layer processes data in large chunks
3. Speed layer processes data in real-time
4. Serving layer merges batch and speed views
5. User queries get combined results
This shows how data moves through batch and speed layers, then merges for user queries.
Execution Table
StepLayerActionData ProcessedOutput Produced
1Raw Data InputCollect data streamEvents 1-1000Raw data stored
2Batch LayerProcess batch dataEvents 1-1000Batch view updated
3Speed LayerProcess real-time dataEvents 1001-1100Real-time view updated
4Serving LayerMerge batch and speed viewsBatch + Real-time viewsUnified view ready
5User QueriesQuery unified viewUnified viewQuery results returned
6Batch LayerNext batch processingEvents 1-2000Batch view updated
7Speed LayerProcess new real-time dataEvents 2001-2100Real-time view updated
8Serving LayerMerge updated viewsBatch + Real-time viewsUnified view refreshed
9User QueriesQuery refreshed viewUnified viewUpdated query results
10EndNo more data or queries--
💡 Execution stops when no new data arrives or queries are made.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 6After Step 7After Step 8
Raw DataEmptyEvents 1-1000Events 1-1100Events 1-1100Events 1-2000Events 1-2100Events 1-2100
Batch ViewEmptyProcessed Events 1-1000Processed Events 1-1000Processed Events 1-1000Processed Events 1-2000Processed Events 1-2000Processed Events 1-2000
Speed ViewEmptyEmptyProcessed Events 1001-1100Processed Events 1001-1100Processed Events 1001-1100Processed Events 2001-2100Processed Events 2001-2100
Unified ViewEmptyEmptyEmptyBatch + Speed Views mergedEmptyEmptyBatch + Speed Views merged
Key Moments - 3 Insights
Why do we need both batch and speed layers instead of just one?
Batch layer handles large data accurately but slowly (see Step 2 and 6). Speed layer handles recent data quickly but less accurately (see Step 3 and 7). Combining both gives fast and accurate results (Step 4 and 8).
How does the serving layer combine data from batch and speed layers?
Serving layer merges batch views (historical data) and speed views (real-time data) to create a unified view for queries (Step 4 and 8 in execution_table).
What happens if the speed layer misses some data?
The batch layer will eventually process all data in large chunks, correcting any misses from the speed layer, ensuring accuracy over time (compare Step 2 and 6).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what data does the speed layer process at Step 3?
AEvents 1001-1100
BEvents 1-1000
CEvents 1-1100
DEvents 2001-2100
💡 Hint
Check the 'Data Processed' column for Step 3 in the execution_table.
At which step does the serving layer first merge batch and speed views?
AStep 2
BStep 3
CStep 4
DStep 6
💡 Hint
Look for 'Serving Layer' and 'Merge batch and speed views' in the execution_table.
If the batch layer processes data more frequently, how would the batch view change in variable_tracker?
ABatch view updates less often
BBatch view updates more often with more data
CSpeed view updates more often
DUnified view stops updating
💡 Hint
Refer to the 'Batch View' row in variable_tracker and think about batch processing frequency.
Concept Snapshot
Lambda Architecture combines batch and streaming data processing.
Batch layer processes large data sets slowly but accurately.
Speed layer processes recent data quickly but less accurately.
Serving layer merges both views for fast and accurate queries.
This design balances latency and accuracy in big data systems.
Full Transcript
Lambda architecture splits data processing into batch and speed layers. Raw data flows into both layers. The batch layer processes large chunks of data to create accurate batch views. The speed layer processes data in real-time to create fast but approximate views. The serving layer merges these views to provide users with up-to-date and accurate query results. This approach ensures low latency and high accuracy by combining the strengths of both batch and streaming processing.