0
0
HadoopConceptBeginner · 3 min read

What is Bucket in Hive in Hadoop: Explanation and Example

In Hive, a bucket is a way to divide data into fixed parts based on a hash of a column, improving query performance. Bucketing helps organize data into manageable files, making operations like joins and sampling faster and more efficient.
⚙️

How It Works

Imagine you have a big box of mixed colored balls and you want to sort them into smaller boxes by color. Bucketing in Hive works similarly by dividing large datasets into smaller, fixed parts called buckets based on the value of a specific column. Hive uses a hash function on the chosen column to decide which bucket each row belongs to.

This division helps Hive quickly find and process only the relevant buckets instead of scanning the entire dataset. It is like looking into just the red ball box instead of searching through all balls. Bucketing is especially useful when you join tables on the bucketed column or when you want to sample data efficiently.

💻

Example

This example shows how to create a bucketed table in Hive and insert data into it.
sql
CREATE TABLE employee (
  id INT,
  name STRING,
  salary FLOAT
)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC;

INSERT INTO TABLE employee VALUES
(1, 'Alice', 5000),
(2, 'Bob', 6000),
(3, 'Charlie', 7000),
(4, 'David', 8000);

-- To see the bucket files, you can check the table directory in HDFS.
Output
Table 'employee' created with 4 buckets based on 'id'. Data inserted into buckets accordingly.
🎯

When to Use

Use bucketing in Hive when you want to improve query speed on large datasets, especially for join operations on the bucketed column. It helps reduce data scanned by splitting data into smaller parts.

For example, if you have a large sales dataset and often join it with a customer table on customer ID, bucketing both tables by customer ID can make joins faster. Bucketing also helps when you want to take random samples of data efficiently.

Key Points

  • Bucketing divides data into fixed parts based on a hash of a column.
  • It improves query performance by reducing data scanned.
  • Useful for joins and sampling on bucketed columns.
  • Requires specifying number of buckets and bucket column when creating table.
  • Bucketed tables store data in separate files for each bucket.

Key Takeaways

Bucketing splits data into fixed parts using a hash of a column to improve query speed.
It is especially helpful for optimizing joins and sampling in Hive.
You must define the bucket column and number of buckets when creating the table.
Bucketed data is stored in separate files, making data management efficient.
Use bucketing on large datasets where query performance matters.