Hadoopdata~30 mins

Log management and troubleshooting in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Log Management and Troubleshooting with Hadoop

📖 Scenario: You are a data engineer working with Hadoop. You receive a log file from a Hadoop cluster. The log contains entries with timestamps, log levels (INFO, WARN, ERROR), and messages. Your task is to analyze the log data to find how many ERROR messages occurred each hour.

🎯 Goal: Build a simple Hadoop MapReduce job to count the number of ERROR log entries per hour from the log data.

📋 What You'll Learn

Create a sample log dataset with timestamps and log levels

Set a filter to select only ERROR log entries

Write the MapReduce logic to count ERROR entries per hour

Output the count of ERROR messages grouped by hour

💡 Why This Matters

🌍 Real World

Analyzing logs helps find problems in Hadoop clusters quickly by showing when errors happen most often.

💼 Career

Data engineers and system administrators use log analysis to monitor system health and troubleshoot issues.

Progress0 / 4 steps

Create the sample log data

Create a list called logs with these exact entries as strings: "2024-06-01 10:15:23 INFO Job started", "2024-06-01 10:45:00 ERROR Disk failure", "2024-06-01 11:05:12 WARN Memory usage high", "2024-06-01 11:20:45 ERROR Network timeout", "2024-06-01 12:00:00 INFO Job finished"

Hadoop

# Create the list called logs with the exact entries
# Your code here

Need a hint?

Use a Python list with the exact log strings inside double quotes.

Set the filter for ERROR logs

Create a variable called error_logs that contains only the entries from logs which include the exact substring "ERROR".

Hadoop

logs = [
    "2024-06-01 10:15:23 INFO Job started",
    "2024-06-01 10:45:00 ERROR Disk failure",
    "2024-06-01 11:05:12 WARN Memory usage high",
    "2024-06-01 11:20:45 ERROR Network timeout",
    "2024-06-01 12:00:00 INFO Job finished"
]
# Create error_logs list with only entries containing "ERROR"
# Your code here

Need a hint?

Use a list comprehension to filter logs containing the substring "ERROR".

Count ERROR logs per hour

Create a dictionary called error_counts. Use a for loop with variable log to iterate over error_logs. Extract the hour from the timestamp (the substring from index 11 to 13) and count how many ERROR logs occur in each hour.

Hadoop

logs = [
    "2024-06-01 10:15:23 INFO Job started",
    "2024-06-01 10:45:00 ERROR Disk failure",
    "2024-06-01 11:05:12 WARN Memory usage high",
    "2024-06-01 11:20:45 ERROR Network timeout",
    "2024-06-01 12:00:00 INFO Job finished"
]

error_logs = [log for log in logs if "ERROR" in log]

# Create error_counts dictionary and count ERROR logs per hour
# Your code here

Need a hint?

Use string slicing to get the hour from the timestamp and a dictionary to count occurrences.

Print the ERROR counts per hour

Write a print statement to display the error_counts dictionary.

Hadoop

logs = [
    "2024-06-01 10:15:23 INFO Job started",
    "2024-06-01 10:45:00 ERROR Disk failure",
    "2024-06-01 11:05:12 WARN Memory usage high",
    "2024-06-01 11:20:45 ERROR Network timeout",
    "2024-06-01 12:00:00 INFO Job finished"
]

error_logs = [log for log in logs if "ERROR" in log]

error_counts = {}
for log in error_logs:
    hour = log[11:13]
    if hour in error_counts:
        error_counts[hour] += 1
    else:
        error_counts[hour] = 1

# Print the error_counts dictionary
# Your code here

Need a hint?

Use print(error_counts) to show the dictionary.