0
0
Hadoopdata~10 mins

Flume for log collection in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Flume for log collection
Log Source (e.g., server logs)
Flume Agent: Source
Flume Agent: Channel (Memory/Disk)
Flume Agent: Sink
HDFS or other storage system
Logs stored
Logs flow from their source into Flume agents, which buffer them in channels and then send them to storage like HDFS.
Execution Sample
Hadoop
agent.sources = source1
agent.channels = channel1
agent.sinks = sink1

agent.sources.source1.type = exec
agent.sources.source1.command = tail -F /var/log/syslog

agent.channels.channel1.type = memory

agent.sinks.sink1.type = hdfs
agent.sinks.sink1.hdfs.path = /logs/syslog/
This config sets up a Flume agent to collect syslog entries and store them in HDFS.
Execution Table
StepComponentActionData StateResult
1SourceReads new log lines from /var/log/syslogNew log lines availableLines captured by source
2ChannelBuffers log lines in memoryLines stored in channelData safely held for sink
3SinkWrites buffered lines to HDFS path /logs/syslog/Lines written to HDFSLogs stored persistently
4SourceWaits for new log linesNo new linesIdle until new data arrives
5ChannelEmpty after sink writesNo buffered linesReady for next batch
6SinkWaits for new data in channelNo dataIdle until channel fills
💡 Process runs continuously; stops only if Flume agent is stopped or source ends.
Variable Tracker
ComponentStartAfter Step 1After Step 2After Step 3After Step 5
SourceNo dataNew log lines capturedSameSameWaiting
ChannelEmptyEmptyBuffered linesEmptyEmpty
SinkIdleIdleIdleLogs writtenIdle
Key Moments - 3 Insights
Why does Flume use a channel between source and sink?
The channel buffers data so if the sink is slow or temporarily unavailable, the source can keep collecting logs without losing data, as shown in execution_table step 2 and 3.
What happens if the source has no new log lines?
The source waits idle for new data, as seen in execution_table step 4, so Flume does not consume resources unnecessarily.
Why is memory channel used in this example?
Memory channel is fast for buffering but data is lost if agent crashes; this is suitable for logs where speed matters more than guaranteed delivery, as implied in variable_tracker.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the state of the channel after step 2?
AChannel is empty
BSink is writing to HDFS
CBuffered lines stored in memory
DSource is idle
💡 Hint
Check the 'Data State' column for step 2 in execution_table.
At which step does the sink write logs to HDFS?
AStep 3
BStep 1
CStep 5
DStep 6
💡 Hint
Look at the 'Action' column describing sink activity in execution_table.
If the source stops receiving new logs, what happens next according to the execution_table?
AChannel buffers more data
BSource waits idle
CSink writes more logs
DFlume agent stops automatically
💡 Hint
See step 4 in execution_table for source behavior when no new data arrives.
Concept Snapshot
Flume collects logs by reading from sources,
buffers them in channels,
and writes to storage sinks like HDFS.
Channels ensure data is not lost if sinks are slow.
Configuration defines sources, channels, and sinks.
Runs continuously to gather live log data.
Full Transcript
Flume is a tool that collects logs from sources like server log files. It uses agents that have three parts: source, channel, and sink. The source reads new log lines, the channel temporarily stores them, and the sink writes them to storage such as HDFS. This process runs continuously, buffering data to avoid loss if the sink is slow. The example configuration shows how to set up a Flume agent to tail syslog and store it in HDFS. The execution table traces how data moves step-by-step from source to channel to sink. Key points include the role of the channel as a buffer and what happens when no new logs arrive. The visual quiz tests understanding of these steps and states.