0
0
Hadoopdata~5 mins

Bucketing for sampling in Hadoop

Choose your learning style9 modes available
Introduction

Bucketing helps split big data into smaller parts. This makes it easier to pick samples and work faster.

When you want to divide a large dataset into equal parts for easier analysis.
When you need to sample data evenly from different groups.
When you want to speed up queries by working on smaller chunks.
When you want to join big tables efficiently by matching buckets.
When you want to balance data processing across multiple machines.
Syntax
Hadoop
CREATE TABLE table_name (
  column1 TYPE,
  column2 TYPE,
  ...
)
CLUSTERED BY (column_name) INTO num_buckets BUCKETS
STORED AS file_format;
Use CLUSTERED BY to specify the column for bucketing.
The number after INTO sets how many buckets to create.
Examples
This creates a table 'users' bucketed by 'id' into 4 parts.
Hadoop
CREATE TABLE users (
  id INT,
  name STRING,
  age INT
)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC;
This creates a 'sales' table bucketed by 'product' into 10 buckets.
Hadoop
CREATE TABLE sales (
  sale_id INT,
  product STRING,
  amount FLOAT
)
CLUSTERED BY (product) INTO 10 BUCKETS
STORED AS PARQUET;
Sample Program

This example creates an 'employees' table bucketed by 'department' into 3 buckets. Then it inserts some data. Finally, it selects data from bucket 1 as a sample.

Hadoop
CREATE TABLE employees (
  emp_id INT,
  name STRING,
  department STRING
)
CLUSTERED BY (department) INTO 3 BUCKETS
STORED AS TEXTFILE;

-- Insert sample data
INSERT INTO TABLE employees VALUES
(1, 'Alice', 'HR'),
(2, 'Bob', 'IT'),
(3, 'Charlie', 'HR'),
(4, 'David', 'Finance'),
(5, 'Eve', 'IT');

-- Query to select one bucket for sampling
SELECT * FROM employees TABLESAMPLE(BUCKET 1 OUT OF 3);
OutputSuccess
Important Notes

Bucketing works best when the bucket column has many distinct values.

Sampling by bucket helps get a fair subset of data without scanning everything.

Make sure to use the same number of buckets when joining bucketed tables for better performance.

Summary

Bucketing splits data into fixed parts based on a column.

It helps with sampling and faster queries on big data.

Use CLUSTERED BY and specify number of buckets when creating tables.