0
0
Hadoopdata~30 mins

Row key design strategies in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Row Key Design Strategies in Hadoop
📖 Scenario: You work for a company that stores large amounts of user activity data in Hadoop. Efficient data retrieval depends on how you design the row keys in your tables.Good row key design helps speed up queries and avoid data hotspots.
🎯 Goal: Build a simple example to understand how to create row keys using different strategies and see their effect on data organization.
📋 What You'll Learn
Create a dictionary called user_activities with user IDs as keys and lists of activity timestamps as values
Create a configuration variable called use_salted_key to decide if row keys should be salted
Write code to generate row keys for each user activity using either plain or salted keys based on use_salted_key
Print the list of generated row keys
💡 Why This Matters
🌍 Real World
Row key design is critical in Hadoop to ensure fast data access and balanced storage across servers.
💼 Career
Understanding row key strategies helps data engineers optimize big data storage and query performance.
Progress0 / 4 steps
1
Create the user activity data
Create a dictionary called user_activities with these exact entries: 'user1': ['2024-06-01T10:00:00', '2024-06-01T11:00:00'], 'user2': ['2024-06-01T10:30:00'], and 'user3': ['2024-06-01T09:45:00', '2024-06-01T12:00:00'].
Hadoop
Need a hint?

Use curly braces to create a dictionary. Keys are user IDs as strings, values are lists of timestamp strings.

2
Add a configuration variable for salted keys
Create a boolean variable called use_salted_key and set it to True.
Hadoop
Need a hint?

Just create a variable named use_salted_key and assign it the value True.

3
Generate row keys using the chosen strategy
Write code to create a list called row_keys. For each user and each timestamp in user_activities, generate a row key. If use_salted_key is True, prefix the user ID with a salt number (0 or 1) alternating for each activity. If use_salted_key is False, use the user ID directly. The row key format is salt_userID_timestamp or userID_timestamp. Use a for loop with variables user and timestamps to iterate over user_activities.items().
Hadoop
Need a hint?

Use nested loops: outer for user, timestamps, inner for i, timestamp with enumerate. Use i % 2 for salt. Use f-strings to build keys.

4
Print the generated row keys
Write a print statement to display the row_keys list.
Hadoop
Need a hint?

Use print(row_keys) to show the list of keys.