In HBase, the design of the row key affects performance and data retrieval. Why is it important to choose a good row key?
Think about how data is stored and accessed in HBase tables.
Row keys in HBase determine how data is sorted and distributed across regions. A good row key design avoids hotspots and improves query speed.
Consider this Python code generating row keys for time-series data:
keys = [f"user_{i}_{20240101 + i}" for i in range(3)]
print(keys)What is the output?
keys = [f"user_{i}_{20240101 + i}" for i in range(3)] print(keys)
Look at how the date part increments with i.
The date part adds i to 20240101, so it increases by 1 each time.
In HBase, row keys are prefixed with a hash of the user ID modulo 4 to distribute load:
def generate_key(user_id, timestamp):
prefix = user_id % 4
return f"{prefix}_{user_id}_{timestamp}"
keys = [generate_key(i, 20240101) for i in range(8)]
regions = set(k.split('_')[0] for k in keys)
print(len(regions))How many unique region prefixes are there?
def generate_key(user_id, timestamp): prefix = user_id % 4 return f"{prefix}_{user_id}_{timestamp}" keys = [generate_key(i, 20240101) for i in range(8)] regions = set(k.split('_')[0] for k in keys) print(len(regions))
Consider the modulo operation and how many distinct values it produces for 0 to 7.
Modulo 4 of numbers 0 to 7 produces 4 distinct prefixes: 0,1,2,3.
This code tries to create a row key by concatenating user ID and timestamp, but it raises an error:
user_id = 123 timestamp = 20240101 row_key = user_id + "_" + timestamp print(row_key)
What error does this code raise?
user_id = 123 timestamp = 20240101 row_key = user_id + "_" + timestamp print(row_key)
Check the data types of variables used with the plus operator.
Adding an integer and a string directly causes a TypeError in Python.
You have a dataset with many records per user, each with a sequential timestamp. You want to design row keys in HBase to avoid hotspotting (too much load on one region). Which row key design is best?
Think about how to spread writes evenly across regions.
Hashing userID prefixes the row key with a value that distributes data across regions, avoiding hotspotting caused by sequential timestamps.