Challenge - 5 Problems

🎖️

Row Key Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why is row key design important in Hadoop HBase?

In HBase, the design of the row key affects performance and data retrieval. Why is it important to choose a good row key?

ABecause row keys control the replication factor of data in HDFS.

BBecause row keys determine the file format used to store data on disk.

CBecause a well-designed row key ensures even data distribution and fast lookups.

DBecause row keys decide the compression algorithm applied to data.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

What is the output of this row key generation code?

Consider this Python code generating row keys for time-series data:

keys = [f"user_{i}_{20240101 + i}" for i in range(3)]
print(keys)

What is the output?

Hadoop

keys = [f"user_{i}_{20240101 + i}" for i in range(3)]
print(keys)

A["user_0_20240101", "user_1_20240102", "user_2_20240103"]

B["user_0_20240101", "user_1_20240101", "user_2_20240101"]

C["user_0_20240101", "user_1_20240103", "user_2_20240105"]

D["user_0_20240100", "user_1_20240101", "user_2_20240102"]

Attempts:

2 left

❓ data_output

advanced

2:00remaining

How many regions will be created with this row key pattern?

In HBase, row keys are prefixed with a hash of the user ID modulo 4 to distribute load:

def generate_key(user_id, timestamp):
    prefix = user_id % 4
    return f"{prefix}_{user_id}_{timestamp}"

keys = [generate_key(i, 20240101) for i in range(8)]
regions = set(k.split('_')[0] for k in keys)
print(len(regions))

How many unique region prefixes are there?

Hadoop

def generate_key(user_id, timestamp):
    prefix = user_id % 4
    return f"{prefix}_{user_id}_{timestamp}"

keys = [generate_key(i, 20240101) for i in range(8)]
regions = set(k.split('_')[0] for k in keys)
print(len(regions))

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this row key design code snippet

This code tries to create a row key by concatenating user ID and timestamp, but it raises an error:

user_id = 123
timestamp = 20240101
row_key = user_id + "_" + timestamp
print(row_key)

What error does this code raise?

Hadoop

user_id = 123
timestamp = 20240101
row_key = user_id + "_" + timestamp
print(row_key)

ANameError: name 'timestamp' is not defined

BTypeError: unsupported operand type(s) for +: 'int' and 'str'

CSyntaxError: invalid syntax

DValueError: invalid literal for int() with base 10

Attempts:

2 left

🚀 Application

expert

3:00remaining

Which row key design best avoids hotspotting for sequential timestamps?

You have a dataset with many records per user, each with a sequential timestamp. You want to design row keys in HBase to avoid hotspotting (too much load on one region). Which row key design is best?

AUse row keys as <code>hash(userID)_timestamp</code> where hash distributes users evenly.

BUse row keys as <code>reverse_timestamp_userID</code> where timestamp digits are reversed.

CUse row keys as <code>userID_timestamp</code> with timestamps increasing sequentially.

DUse row keys as <code>timestamp_userID</code> with timestamps increasing sequentially.

Attempts:

2 left