0
0
HadoopComparisonBeginner · 4 min read

Partition vs Bucket in Hive in Hadoop: Key Differences and Usage

In Hive on Hadoop, partition divides a table into parts based on column values to speed up queries by pruning data, while bucket further divides data within partitions into fixed number of files for better sampling and join optimization. Partitioning is coarse-grained data organization, and bucketing is fine-grained.
⚖️

Quick Comparison

Here is a quick side-by-side comparison of Partition and Bucket in Hive:

FeaturePartitionBucket
DefinitionDivides table into parts based on column valuesDivides data inside partitions into fixed number of files
Data OrganizationCoarse-grained, by column valuesFine-grained, by hashing column values
Number of FilesOne folder per partitionFixed number of files per table
Query PerformanceSpeeds up by pruning partitionsImproves sampling and join efficiency
Use CaseFilter large datasets by columnOptimize joins and sampling
CreationDefined at table creation or alterDefined by specifying number of buckets
⚖️

Key Differences

Partition in Hive splits a table into separate directories based on the distinct values of one or more columns. This means when you query the table with a filter on the partition column, Hive reads only the relevant directories, reducing data scanned and improving query speed.

Bucket divides data inside each partition (or the whole table if no partition) into a fixed number of files or buckets based on a hash function on a column. Bucketing helps with efficient sampling and optimizes joins by ensuring rows with the same bucketed column value go to the same bucket.

Partitions create a directory structure in HDFS, while buckets create files inside those directories. Partitioning is best for columns with low to moderate cardinality used in filtering, while bucketing is useful for columns used in joins or sampling where data distribution matters.

💻

Partition Example

This example shows how to create a partitioned table in Hive and insert data into partitions.

sql
CREATE TABLE sales (
  item STRING,
  price FLOAT
)
PARTITIONED BY (year INT, month INT);

-- Insert data into a specific partition
INSERT INTO TABLE sales PARTITION (year=2023, month=6) VALUES ('apple', 1.2);

-- Query data filtering on partition
SELECT * FROM sales WHERE year=2023 AND month=6;
Output
item | price ------|------- apple | 1.2
↔️

Bucket Equivalent

This example shows how to create a bucketed table in Hive and insert data into buckets.

sql
CREATE TABLE employee (
  id INT,
  name STRING,
  salary FLOAT
)
CLUSTERED BY (id) INTO 4 BUCKETS;

-- Insert data
INSERT INTO TABLE employee VALUES (1, 'John', 5000), (2, 'Jane', 6000);

-- Query data
SELECT * FROM employee;
Output
id | name | salary ---|------|-------- 1 | John | 5000 2 | Jane | 6000
🎯

When to Use Which

Choose partition when your queries often filter on specific column values and you want to reduce the amount of data scanned by Hive. Partitioning is ideal for large datasets with predictable filtering columns like date or region.

Choose bucket when you want to optimize joins or sampling on a column, especially when the data is large but filtering is not always on that column. Bucketing helps distribute data evenly and speeds up operations that benefit from data grouping.

In many cases, combining partitioning and bucketing gives the best performance by pruning data and optimizing joins.

Key Takeaways

Partition divides data into folders by column values to speed up filtered queries.
Bucket divides data inside partitions into fixed files for better join and sampling performance.
Partitioning reduces data scanned by pruning irrelevant partitions.
Bucketing ensures even data distribution for efficient joins.
Use partitioning for filtering columns and bucketing for join or sampling optimization.