What is Bucketing for sampling in Hadoop?

Hadoopdata~5 mins

Bucketing for sampling in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Bucketing helps split big data into smaller parts. This makes it easier to pick samples and work faster.

When you want to divide a large dataset into equal parts for easier analysis.

When you need to sample data evenly from different groups.

When you want to speed up queries by working on smaller chunks.

When you want to join big tables efficiently by matching buckets.

When you want to balance data processing across multiple machines.

Syntax

Hadoop

CREATE TABLE table_name (
  column1 TYPE,
  column2 TYPE,
  ...
)
CLUSTERED BY (column_name) INTO num_buckets BUCKETS
STORED AS file_format;

Use CLUSTERED BY to specify the column for bucketing.

The number after INTO sets how many buckets to create.

Examples

This creates a table 'users' bucketed by 'id' into 4 parts.

Hadoop

CREATE TABLE users (
  id INT,
  name STRING,
  age INT
)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC;

This creates a 'sales' table bucketed by 'product' into 10 buckets.

Hadoop

CREATE TABLE sales (
  sale_id INT,
  product STRING,
  amount FLOAT
)
CLUSTERED BY (product) INTO 10 BUCKETS
STORED AS PARQUET;

Sample Program

This example creates an 'employees' table bucketed by 'department' into 3 buckets. Then it inserts some data. Finally, it selects data from bucket 1 as a sample.

Hadoop

CREATE TABLE employees (
  emp_id INT,
  name STRING,
  department STRING
)
CLUSTERED BY (department) INTO 3 BUCKETS
STORED AS TEXTFILE;

-- Insert sample data
INSERT INTO TABLE employees VALUES
(1, 'Alice', 'HR'),
(2, 'Bob', 'IT'),
(3, 'Charlie', 'HR'),
(4, 'David', 'Finance'),
(5, 'Eve', 'IT');

-- Query to select one bucket for sampling
SELECT * FROM employees TABLESAMPLE(BUCKET 1 OUT OF 3);

OutputSuccess

Important Notes

Bucketing works best when the bucket column has many distinct values.

Sampling by bucket helps get a fair subset of data without scanning everything.

Make sure to use the same number of buckets when joining bucketed tables for better performance.

Summary

Bucketing splits data into fixed parts based on a column.

It helps with sampling and faster queries on big data.

Use CLUSTERED BY and specify number of buckets when creating tables.