Hadoopdata~30 mins

Flume for log collection in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Collecting Logs Using Flume in Hadoop

📖 Scenario: You are working as a system administrator for a company that needs to collect and store application logs efficiently. You will use Apache Flume to collect logs from a source and send them to Hadoop's HDFS for storage and analysis.

🎯 Goal: Build a simple Flume configuration to collect logs from a local file source and write them into HDFS.

📋 What You'll Learn

Create a Flume agent configuration file with a source, channel, and sink

Configure the source to read from a local log file

Configure the channel as a memory channel

Configure the sink to write logs to HDFS

Use exact names for the agent, source, channel, and sink as specified

💡 Why This Matters

🌍 Real World

Companies use Flume to collect logs from many servers and store them centrally in Hadoop for analysis and monitoring.

💼 Career

Understanding Flume configuration is important for roles in big data engineering, system administration, and data pipeline development.

Progress0 / 4 steps

Create Flume Agent and Source Configuration

Create a Flume agent named agent1. Add a source named src1 of type exec that runs the command tail -F /var/log/syslog to collect logs continuously.

Hadoop

# Define agent1 and source src1 with exec type and tail command
# Your code here

Need a hint?

The source type exec runs a shell command to collect logs. Use tail -F /var/log/syslog to follow the system log file.

Configure Memory Channel

Add a channel named ch1 of type memory to the agent agent1. Set the capacity to 10000 and the transaction capacity to 1000.

Hadoop

agent1.sources = src1
agent1.sources.src1.type = exec
agent1.sources.src1.command = tail -F /var/log/syslog
# Add channel ch1 of type memory with capacity 10000 and transactionCapacity 1000
# Your code here

Need a hint?

The memory channel temporarily stores events in memory. Set capacity and transactionCapacity to control buffer sizes.

Configure HDFS Sink

Add a sink named sink1 of type hdfs to the agent agent1. Set the HDFS path to /user/logs/ and the file prefix to log-.

Hadoop

agent1.sources = src1
agent1.sources.src1.type = exec
agent1.sources.src1.command = tail -F /var/log/syslog
agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 10000
agent1.channels.ch1.transactionCapacity = 1000
# Add sink sink1 of type hdfs with path /user/logs/ and filePrefix log-
# Your code here

Need a hint?

The sink writes events to HDFS. Set the hdfs.path and hdfs.filePrefix to organize files.

Connect Source, Channel, and Sink

Connect the source src1, channel ch1, and sink sink1 in the agent agent1 by setting the source's channel to ch1 and the sink's channel to ch1.

Hadoop

agent1.sources = src1
agent1.sources.src1.type = exec
agent1.sources.src1.command = tail -F /var/log/syslog
agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 10000
agent1.channels.ch1.transactionCapacity = 1000
agent1.sinks = sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/logs/
agent1.sinks.sink1.hdfs.filePrefix = log-
# Connect src1 and sink1 to channel ch1
# Your code here

Need a hint?

Sources and sinks must be connected to channels by setting sources.src1.channels and sinks.sink1.channel to the channel name.