Hadoopdata~30 mins

Bucketing for sampling in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Bucketing for Sampling in Hadoop

📖 Scenario: You work at a company that stores large amounts of user data in Hadoop. You want to practice sampling data efficiently using bucketing. Bucketing helps split data into fixed parts, making sampling faster and easier.

🎯 Goal: You will create a Hive table with bucketing on user IDs, set the number of buckets, insert sample data, and then write a query to sample data from a specific bucket.

📋 What You'll Learn

Create a Hive table called users with columns user_id (int) and user_name (string)

Bucket the table by user_id into 4 buckets

Insert 8 sample users with specific user_id and user_name

Write a query to select users from bucket number 2

💡 Why This Matters

🌍 Real World

Bucketing is used in big data systems like Hadoop to split large datasets into manageable parts. This helps in faster sampling and querying.

💼 Career

Data engineers and analysts use bucketing to optimize data storage and speed up queries in Hadoop ecosystems.

Progress0 / 4 steps

Create a bucketed Hive table

Write a Hive CREATE TABLE statement to create a table called users with columns user_id INT and user_name STRING. Bucket the table by user_id into 4 buckets using CLUSTERED BY (user_id) INTO 4 BUCKETS. Use STORED AS ORC format.

Hadoop

-- Write the CREATE TABLE statement for 'users' with bucketing
-- Your code here

Need a hint?

Use CLUSTERED BY (user_id) INTO 4 BUCKETS to bucket by user_id.

Insert sample data into the bucketed table

Write an INSERT INTO TABLE users statement to add these 8 users exactly: (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'), (4, 'David'), (5, 'Eva'), (6, 'Frank'), (7, 'Grace'), (8, 'Hannah'). Use VALUES syntax.

Hadoop

CREATE TABLE users (
  user_id INT,
  user_name STRING
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
STORED AS ORC;

-- Insert 8 users into 'users' table
-- Your code here

Need a hint?

Use INSERT INTO TABLE users VALUES (...), (...), ...; to add all users.

Write a query to sample users from bucket 2

Write a Hive SELECT query to get all columns from users where the bucket number is 2. Use the function pmod(hash(user_id), 4) = 2 to filter the bucket.

Hadoop

CREATE TABLE users (
  user_id INT,
  user_name STRING
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
STORED AS ORC;

INSERT INTO TABLE users VALUES
  (1, 'Alice'),
  (2, 'Bob'),
  (3, 'Charlie'),
  (4, 'David'),
  (5, 'Eva'),
  (6, 'Frank'),
  (7, 'Grace'),
  (8, 'Hannah');

-- Select users from bucket number 2
-- Your code here

Need a hint?

Use pmod(hash(user_id), 4) = 2 in the WHERE clause to filter bucket 2.

Display the sampled users from bucket 2

Write a Hive SELECT statement to display the user_id and user_name of users from bucket 2 using the same filter pmod(hash(user_id), 4) = 2. Use ORDER BY user_id to sort the output.

Hadoop

CREATE TABLE users (
  user_id INT,
  user_name STRING
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
STORED AS ORC;

INSERT INTO TABLE users VALUES
  (1, 'Alice'),
  (2, 'Bob'),
  (3, 'Charlie'),
  (4, 'David'),
  (5, 'Eva'),
  (6, 'Frank'),
  (7, 'Grace'),
  (8, 'Hannah');

-- Display user_id and user_name from bucket 2 ordered by user_id
-- Your code here

Need a hint?

Use SELECT user_id, user_name FROM users WHERE pmod(hash(user_id), 4) = 2 ORDER BY user_id; to show the sampled users.