0
0
Hadoopdata~30 mins

Bucketing for sampling in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Bucketing for Sampling in Hadoop
📖 Scenario: You work at a company that stores large amounts of user data in Hadoop. You want to practice sampling data efficiently using bucketing. Bucketing helps split data into fixed parts, making sampling faster and easier.
🎯 Goal: You will create a Hive table with bucketing on user IDs, set the number of buckets, insert sample data, and then write a query to sample data from a specific bucket.
📋 What You'll Learn
Create a Hive table called users with columns user_id (int) and user_name (string)
Bucket the table by user_id into 4 buckets
Insert 8 sample users with specific user_id and user_name
Write a query to select users from bucket number 2
💡 Why This Matters
🌍 Real World
Bucketing is used in big data systems like Hadoop to split large datasets into manageable parts. This helps in faster sampling and querying.
💼 Career
Data engineers and analysts use bucketing to optimize data storage and speed up queries in Hadoop ecosystems.
Progress0 / 4 steps
1
Create a bucketed Hive table
Write a Hive CREATE TABLE statement to create a table called users with columns user_id INT and user_name STRING. Bucket the table by user_id into 4 buckets using CLUSTERED BY (user_id) INTO 4 BUCKETS. Use STORED AS ORC format.
Hadoop
Need a hint?

Use CLUSTERED BY (user_id) INTO 4 BUCKETS to bucket by user_id.

2
Insert sample data into the bucketed table
Write an INSERT INTO TABLE users statement to add these 8 users exactly: (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'), (4, 'David'), (5, 'Eva'), (6, 'Frank'), (7, 'Grace'), (8, 'Hannah'). Use VALUES syntax.
Hadoop
Need a hint?

Use INSERT INTO TABLE users VALUES (...), (...), ...; to add all users.

3
Write a query to sample users from bucket 2
Write a Hive SELECT query to get all columns from users where the bucket number is 2. Use the function pmod(hash(user_id), 4) = 2 to filter the bucket.
Hadoop
Need a hint?

Use pmod(hash(user_id), 4) = 2 in the WHERE clause to filter bucket 2.

4
Display the sampled users from bucket 2
Write a Hive SELECT statement to display the user_id and user_name of users from bucket 2 using the same filter pmod(hash(user_id), 4) = 2. Use ORDER BY user_id to sort the output.
Hadoop
Need a hint?

Use SELECT user_id, user_name FROM users WHERE pmod(hash(user_id), 4) = 2 ORDER BY user_id; to show the sampled users.