0
0
Hadoopdata~5 mins

Partitioning for query performance in Hadoop

Choose your learning style9 modes available
Introduction

Partitioning helps split big data into smaller parts. This makes searching and analyzing data faster and easier.

When you have a large dataset and want to speed up queries.
When you often filter data by a specific column, like date or region.
When you want to reduce the amount of data scanned during analysis.
When you want to organize data for easier management and updates.
Syntax
Hadoop
CREATE TABLE table_name (
  column1 TYPE,
  column2 TYPE,
  ...
)
PARTITIONED BY (partition_column TYPE);
Partition columns are not stored in the main data files but as separate folders.
Queries filtering on partition columns read only relevant partitions, improving speed.
Examples
This creates a sales table partitioned by year. Data for each year is stored separately.
Hadoop
CREATE TABLE sales (
  id INT,
  amount FLOAT
)
PARTITIONED BY (year INT);
This table stores logs partitioned by date, so queries on specific dates are faster.
Hadoop
CREATE TABLE logs (
  event STRING,
  user STRING
)
PARTITIONED BY (date STRING);
Sample Program

This example creates a user_activity table partitioned by country. It inserts data into two partitions and queries only the US partition, which is faster than scanning all data.

Hadoop
CREATE TABLE user_activity (
  user_id INT,
  activity STRING
)
PARTITIONED BY (country STRING);

-- Add data to partitions
INSERT INTO TABLE user_activity PARTITION (country='US') VALUES (1, 'login');
INSERT INTO TABLE user_activity PARTITION (country='CA') VALUES (2, 'logout');

-- Query data only for US users
SELECT * FROM user_activity WHERE country = 'US';
OutputSuccess
Important Notes

Partition columns should be chosen based on query patterns for best performance.

Too many partitions can slow down queries; balance is key.

Summary

Partitioning splits data into smaller, manageable parts.

It speeds up queries by reading only needed partitions.

Choose partition columns wisely based on how you query data.