beginner

What is bucketing in Hadoop?

Bucketing is a technique to divide data into fixed number of files or buckets based on a hash function on a column. It helps in organizing data for efficient sampling and querying.

Click to reveal answer

beginner

How does bucketing help in sampling data?

Bucketing allows sampling by selecting specific buckets instead of scanning the whole dataset. This reduces data scanned and speeds up queries.

Click to reveal answer

intermediate

What is the difference between bucketing and partitioning?

Partitioning divides data based on column values into folders, while bucketing divides data into fixed number of files using a hash function. Bucketing is better for sampling and joins.

Click to reveal answer

intermediate

How do you create a bucketed table in Hive?

Use the CLUSTERED BY clause with a column and specify INTO n BUCKETS. For example: CREATE TABLE t (id INT) CLUSTERED BY (id) INTO 4 BUCKETS;

Click to reveal answer

intermediate

Why is bucketing useful for joins in Hadoop?

Bucketing ensures rows with the same key go to the same bucket, enabling efficient map-side joins without shuffling all data.

Click to reveal answer

What does bucketing in Hadoop do?

ADivides data into fixed number of files based on a hash function

BSplits data into folders based on column values

CCompresses data to save space

DEncrypts data for security

Which clause is used to create a bucketed table in Hive?

ACLUSTERED BY

BSORTED BY

CPARTITIONED BY

DDISTRIBUTED BY

Bucketing helps sampling by:

ACompressing data files

BSelecting specific buckets instead of full data scan

CEncrypting data buckets

DSorting data within partitions

What is a key difference between partitioning and bucketing?

APartitioning divides data into files, bucketing into folders

BPartitioning sorts data, bucketing samples data

CPartitioning divides data into folders, bucketing into files

DPartitioning compresses data, bucketing encrypts data

Why is bucketing useful for joins?

AIt compresses join data

BIt sorts data for faster joins

CIt encrypts join keys

DIt ensures same keys go to same bucket enabling map-side joins

Explain what bucketing is and how it helps with sampling in Hadoop.

Describe the difference between partitioning and bucketing and why bucketing is useful for joins.