What is Bucket in Hive in Hadoop: Explanation and Example
bucket is a way to divide data into fixed parts based on a hash of a column, improving query performance. Bucketing helps organize data into manageable files, making operations like joins and sampling faster and more efficient.How It Works
Imagine you have a big box of mixed colored balls and you want to sort them into smaller boxes by color. Bucketing in Hive works similarly by dividing large datasets into smaller, fixed parts called buckets based on the value of a specific column. Hive uses a hash function on the chosen column to decide which bucket each row belongs to.
This division helps Hive quickly find and process only the relevant buckets instead of scanning the entire dataset. It is like looking into just the red ball box instead of searching through all balls. Bucketing is especially useful when you join tables on the bucketed column or when you want to sample data efficiently.
Example
CREATE TABLE employee ( id INT, name STRING, salary FLOAT ) CLUSTERED BY (id) INTO 4 BUCKETS STORED AS ORC; INSERT INTO TABLE employee VALUES (1, 'Alice', 5000), (2, 'Bob', 6000), (3, 'Charlie', 7000), (4, 'David', 8000); -- To see the bucket files, you can check the table directory in HDFS.
When to Use
Use bucketing in Hive when you want to improve query speed on large datasets, especially for join operations on the bucketed column. It helps reduce data scanned by splitting data into smaller parts.
For example, if you have a large sales dataset and often join it with a customer table on customer ID, bucketing both tables by customer ID can make joins faster. Bucketing also helps when you want to take random samples of data efficiently.
Key Points
- Bucketing divides data into fixed parts based on a hash of a column.
- It improves query performance by reducing data scanned.
- Useful for joins and sampling on bucketed columns.
- Requires specifying number of buckets and bucket column when creating table.
- Bucketed tables store data in separate files for each bucket.