Challenge - 5 Problems

🎖️

Hadoop Log Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Understanding Hadoop Log Levels

What will be the output of the following Hadoop log configuration snippet when the log level is set to WARN?

log4j.logger.org.apache.hadoop=INFO, console
log4j.logger.org.apache.hadoop.hdfs=WARN, console
log4j.logger.org.apache.hadoop.mapreduce=ERROR, console

Assuming a log event of level INFO from org.apache.hadoop.hdfs and a log event of level WARN from org.apache.hadoop.mapreduce, which will be printed to the console?

Hadoop

log4j.logger.org.apache.hadoop=INFO, console
log4j.logger.org.apache.hadoop.hdfs=WARN, console
log4j.logger.org.apache.hadoop.mapreduce=ERROR, console

AOnly the INFO log from org.apache.hadoop.hdfs is printed

BOnly the WARN log from org.apache.hadoop.hdfs is printed

COnly the WARN log from org.apache.hadoop.mapreduce is printed

DBoth INFO from org.apache.hadoop.hdfs and WARN from org.apache.hadoop.mapreduce are printed

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Parsing Hadoop Logs for Error Counts

Given a Hadoop log file with lines like:

2024-06-01 12:00:01,234 INFO Client: Connection established
2024-06-01 12:00:02,345 ERROR DataNode: Disk failure detected
2024-06-01 12:00:03,456 WARN NameNode: High memory usage

Which Python code snippet correctly counts the number of ERROR log entries?

Hadoop

log_lines = [
    '2024-06-01 12:00:01,234 INFO Client: Connection established',
    '2024-06-01 12:00:02,345 ERROR DataNode: Disk failure detected',
    '2024-06-01 12:00:03,456 WARN NameNode: High memory usage'
]

# Count ERROR logs

Aerror_count = len([line for line in log_lines if 'error' in line.lower()])

Berror_count = len([line for line in log_lines if line.startswith('ERROR')])

Cerror_count = sum(1 for line in log_lines if line.split()[2] == 'ERROR')

Derror_count = sum(1 for line in log_lines if 'ERROR' in line)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Troubleshooting Missing Hadoop Logs

A Hadoop cluster admin notices that logs from org.apache.hadoop.mapreduce are missing in the log files, even though tasks are running. The log4j.properties file contains:

log4j.logger.org.apache.hadoop=INFO, console
log4j.logger.org.apache.hadoop.mapreduce=OFF, console

What is the cause of missing logs for MapReduce?

AThe console appender is not defined, so logs are not saved

BThe log4j.properties file is missing the root logger configuration

CThe log level OFF disables all logging for org.apache.hadoop.mapreduce

DThe INFO level is too low to capture MapReduce logs

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Visualizing Hadoop Log Error Trends

You have extracted daily counts of ERROR logs from Hadoop over 7 days:

errors = [5, 7, 3, 8, 6, 10, 4]

Which Python code using matplotlib will correctly plot these error counts as a line chart with days on the x-axis labeled from 1 to 7?

Hadoop

import matplotlib.pyplot as plt
errors = [5, 7, 3, 8, 6, 10, 4]

plt.plot(range(1, 8), errors)
plt.xlabel('Day')
plt.ylabel('Error Count')
plt.title('Hadoop Error Trends')
plt.show()

plt.plot(errors)
plt.xlabel('Error Count')
plt.ylabel('Day')
plt.title('Hadoop Error Trends')
plt.show()

plt.bar(range(7), errors)
plt.xlabel('Day')
plt.ylabel('Error Count')
plt.title('Hadoop Error Trends')
plt.show()

plt.scatter(range(1, 8), errors)
plt.xlabel('Error Count')
plt.ylabel('Day')
plt.title('Hadoop Error Trends')
plt.show()

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Root Cause Analysis Using Hadoop Logs

During a Hadoop job failure, the logs show repeated java.net.ConnectException: Connection refused errors from DataNode to NameNode. Which is the most likely root cause?

ANameNode service is down or unreachable from DataNode

BDataNode has disk failure causing connection errors

CNetwork bandwidth is saturated causing slow connections

DMapReduce job configuration has incorrect memory settings

Attempts:

2 left