0
0
Hadoopdata~20 mins

Row key design strategies in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Row Key Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why is row key design important in Hadoop HBase?

In HBase, the design of the row key affects performance and data retrieval. Why is it important to choose a good row key?

ABecause row keys control the replication factor of data in HDFS.
BBecause row keys determine the file format used to store data on disk.
CBecause a well-designed row key ensures even data distribution and fast lookups.
DBecause row keys decide the compression algorithm applied to data.
Attempts:
2 left
💡 Hint

Think about how data is stored and accessed in HBase tables.

Predict Output
intermediate
2:00remaining
What is the output of this row key generation code?

Consider this Python code generating row keys for time-series data:

keys = [f"user_{i}_{20240101 + i}" for i in range(3)]
print(keys)

What is the output?

Hadoop
keys = [f"user_{i}_{20240101 + i}" for i in range(3)]
print(keys)
A["user_0_20240101", "user_1_20240102", "user_2_20240103"]
B["user_0_20240101", "user_1_20240101", "user_2_20240101"]
C["user_0_20240101", "user_1_20240103", "user_2_20240105"]
D["user_0_20240100", "user_1_20240101", "user_2_20240102"]
Attempts:
2 left
💡 Hint

Look at how the date part increments with i.

data_output
advanced
2:00remaining
How many regions will be created with this row key pattern?

In HBase, row keys are prefixed with a hash of the user ID modulo 4 to distribute load:

def generate_key(user_id, timestamp):
    prefix = user_id % 4
    return f"{prefix}_{user_id}_{timestamp}"

keys = [generate_key(i, 20240101) for i in range(8)]
regions = set(k.split('_')[0] for k in keys)
print(len(regions))

How many unique region prefixes are there?

Hadoop
def generate_key(user_id, timestamp):
    prefix = user_id % 4
    return f"{prefix}_{user_id}_{timestamp}"

keys = [generate_key(i, 20240101) for i in range(8)]
regions = set(k.split('_')[0] for k in keys)
print(len(regions))
A1
B8
C2
D4
Attempts:
2 left
💡 Hint

Consider the modulo operation and how many distinct values it produces for 0 to 7.

🔧 Debug
advanced
2:00remaining
Identify the error in this row key design code snippet

This code tries to create a row key by concatenating user ID and timestamp, but it raises an error:

user_id = 123
timestamp = 20240101
row_key = user_id + "_" + timestamp
print(row_key)

What error does this code raise?

Hadoop
user_id = 123
timestamp = 20240101
row_key = user_id + "_" + timestamp
print(row_key)
ANameError: name 'timestamp' is not defined
BTypeError: unsupported operand type(s) for +: 'int' and 'str'
CSyntaxError: invalid syntax
DValueError: invalid literal for int() with base 10
Attempts:
2 left
💡 Hint

Check the data types of variables used with the plus operator.

🚀 Application
expert
3:00remaining
Which row key design best avoids hotspotting for sequential timestamps?

You have a dataset with many records per user, each with a sequential timestamp. You want to design row keys in HBase to avoid hotspotting (too much load on one region). Which row key design is best?

AUse row keys as <code>hash(userID)_timestamp</code> where hash distributes users evenly.
BUse row keys as <code>reverse_timestamp_userID</code> where timestamp digits are reversed.
CUse row keys as <code>userID_timestamp</code> with timestamps increasing sequentially.
DUse row keys as <code>timestamp_userID</code> with timestamps increasing sequentially.
Attempts:
2 left
💡 Hint

Think about how to spread writes evenly across regions.