Hadoopdata~10 mins

Flume for log collection in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Flume for log collection

Log Source (e.g., server logs)

↓

Flume Agent: Source

↓

Flume Agent: Channel (Memory/Disk)

↓

Flume Agent: Sink

↓

HDFS or other storage system

↓

Logs stored

Logs flow from their source into Flume agents, which buffer them in channels and then send them to storage like HDFS.

Execution Sample

Hadoop

agent.sources = source1
agent.channels = channel1
agent.sinks = sink1

agent.sources.source1.type = exec
agent.sources.source1.command = tail -F /var/log/syslog

agent.channels.channel1.type = memory

agent.sinks.sink1.type = hdfs
agent.sinks.sink1.hdfs.path = /logs/syslog/

This config sets up a Flume agent to collect syslog entries and store them in HDFS.

Execution Table

Step	Component	Action	Data State	Result
1	Source	Reads new log lines from /var/log/syslog	New log lines available	Lines captured by source
2	Channel	Buffers log lines in memory	Lines stored in channel	Data safely held for sink
3	Sink	Writes buffered lines to HDFS path /logs/syslog/	Lines written to HDFS	Logs stored persistently
4	Source	Waits for new log lines	No new lines	Idle until new data arrives
5	Channel	Empty after sink writes	No buffered lines	Ready for next batch
6	Sink	Waits for new data in channel	No data	Idle until channel fills

💡 Process runs continuously; stops only if Flume agent is stopped or source ends.

Variable Tracker

Component	Start	After Step 1	After Step 2	After Step 3	After Step 5
Source	No data	New log lines captured	Same	Same	Waiting
Channel	Empty	Empty	Buffered lines	Empty	Empty
Sink	Idle	Idle	Idle	Logs written	Idle

Key Moments - 3 Insights

Why does Flume use a channel between source and sink?

What happens if the source has no new log lines?

Why is memory channel used in this example?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what is the state of the channel after step 2?

AChannel is empty

BSink is writing to HDFS

CBuffered lines stored in memory

DSource is idle

Concept Snapshot

Flume collects logs by reading from sources,
buffers them in channels,
and writes to storage sinks like HDFS.
Channels ensure data is not lost if sinks are slow.
Configuration defines sources, channels, and sinks.
Runs continuously to gather live log data.

Full Transcript

Flume is a tool that collects logs from sources like server log files. It uses agents that have three parts: source, channel, and sink. The source reads new log lines, the channel temporarily stores them, and the sink writes them to storage such as HDFS. This process runs continuously, buffering data to avoid loss if the sink is slow. The example configuration shows how to set up a Flume agent to tail syslog and store it in HDFS. The execution table traces how data moves step-by-step from source to channel to sink. Key points include the role of the channel as a buffer and what happens when no new logs arrive. The visual quiz tests understanding of these steps and states.