Challenge - 5 Problems

🎖️

Hadoop Mastery in Modern Data Stacks

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

When is Hadoop the best choice for data storage?

Which scenario best describes when Hadoop is the most suitable option for data storage in a modern data stack?

AStoring and processing very large volumes of unstructured data across many servers with fault tolerance.

BRunning real-time analytics on small datasets with low latency requirements.

CManaging transactional data with strict ACID compliance and low latency queries.

DHosting a small data warehouse for a single department with simple SQL queries.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Hadoop MapReduce job simulation

What is the output of this simplified Python simulation of a Hadoop MapReduce word count job?

Hadoop

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']

# Map step
mapped = [(word, 1) for word in data]

# Shuffle and sort step (group by word)
grouped = {}
for word, count in mapped:
    grouped[word] = grouped.get(word, 0) + count

# Reduce step
reduced = {word: count for word, count in grouped.items()}

print(reduced)

A{'apple': 3, 'banana': 2, 'orange': 1}

B{'apple': 1, 'banana': 1, 'orange': 1}

C{'apple': 2, 'banana': 2, 'orange': 1}

DSyntaxError

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of combining Hadoop with Spark for data processing

Given a dataset stored in Hadoop HDFS, what is the main advantage of using Apache Spark on top of Hadoop for processing?

Choose the output that best describes the result of this combination.

AOnly supports SQL queries without machine learning capabilities.

BSlower batch processing due to Hadoop's disk-based storage limiting Spark's speed.

CRemoves the need for Hadoop's distributed storage.

DFaster in-memory data processing with support for complex analytics on large datasets stored in Hadoop.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this Hadoop streaming command

What error will this Hadoop streaming command produce?

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input /data/input \
  -output /data/output \
  -mapper 'python mapper.py' \
  -reducer 'python reducer.py' \
  -file mapper.py \
  -file reducer.py

AError: Invalid Hadoop streaming jar path.

BError: Mapper script not found.

CError: Output directory /data/output already exists.

DNo error, job runs successfully.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Choosing Hadoop for a modern data stack architecture

You are designing a modern data stack for a company with massive historical logs and need to store petabytes of data cost-effectively. The data is mostly unstructured and accessed for batch analytics. Which option best justifies using Hadoop in this architecture?

AHadoop is ideal for transactional databases requiring ACID compliance.

BHadoop provides scalable, distributed storage and batch processing that can handle petabytes of unstructured data cost-effectively.

CHadoop replaces the need for cloud data warehouses and BI tools entirely.

DHadoop is best for real-time streaming analytics and low-latency queries on small datasets.

Attempts:

2 left