0
0
Hadoopdata~30 mins

Log management and troubleshooting in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Log Management and Troubleshooting with Hadoop
📖 Scenario: You are a data engineer working with Hadoop. You receive a log file from a Hadoop cluster. The log contains entries with timestamps, log levels (INFO, WARN, ERROR), and messages. Your task is to analyze the log data to find how many ERROR messages occurred each hour.
🎯 Goal: Build a simple Hadoop MapReduce job to count the number of ERROR log entries per hour from the log data.
📋 What You'll Learn
Create a sample log dataset with timestamps and log levels
Set a filter to select only ERROR log entries
Write the MapReduce logic to count ERROR entries per hour
Output the count of ERROR messages grouped by hour
💡 Why This Matters
🌍 Real World
Analyzing logs helps find problems in Hadoop clusters quickly by showing when errors happen most often.
💼 Career
Data engineers and system administrators use log analysis to monitor system health and troubleshoot issues.
Progress0 / 4 steps
1
Create the sample log data
Create a list called logs with these exact entries as strings: "2024-06-01 10:15:23 INFO Job started", "2024-06-01 10:45:00 ERROR Disk failure", "2024-06-01 11:05:12 WARN Memory usage high", "2024-06-01 11:20:45 ERROR Network timeout", "2024-06-01 12:00:00 INFO Job finished"
Hadoop
Need a hint?

Use a Python list with the exact log strings inside double quotes.

2
Set the filter for ERROR logs
Create a variable called error_logs that contains only the entries from logs which include the exact substring "ERROR".
Hadoop
Need a hint?

Use a list comprehension to filter logs containing the substring "ERROR".

3
Count ERROR logs per hour
Create a dictionary called error_counts. Use a for loop with variable log to iterate over error_logs. Extract the hour from the timestamp (the substring from index 11 to 13) and count how many ERROR logs occur in each hour.
Hadoop
Need a hint?

Use string slicing to get the hour from the timestamp and a dictionary to count occurrences.

4
Print the ERROR counts per hour
Write a print statement to display the error_counts dictionary.
Hadoop
Need a hint?

Use print(error_counts) to show the dictionary.