0
0
Hadoopdata~5 mins

Bucketing for sampling in Hadoop - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is bucketing in Hadoop?
Bucketing is a technique to divide data into fixed number of files or buckets based on a hash function on a column. It helps in organizing data for efficient sampling and querying.
Click to reveal answer
beginner
How does bucketing help in sampling data?
Bucketing allows sampling by selecting specific buckets instead of scanning the whole dataset. This reduces data scanned and speeds up queries.
Click to reveal answer
intermediate
What is the difference between bucketing and partitioning?
Partitioning divides data based on column values into folders, while bucketing divides data into fixed number of files using a hash function. Bucketing is better for sampling and joins.
Click to reveal answer
intermediate
How do you create a bucketed table in Hive?
Use the CLUSTERED BY clause with a column and specify INTO n BUCKETS. For example: CREATE TABLE t (id INT) CLUSTERED BY (id) INTO 4 BUCKETS;
Click to reveal answer
intermediate
Why is bucketing useful for joins in Hadoop?
Bucketing ensures rows with the same key go to the same bucket, enabling efficient map-side joins without shuffling all data.
Click to reveal answer
What does bucketing in Hadoop do?
ADivides data into fixed number of files based on a hash function
BSplits data into folders based on column values
CCompresses data to save space
DEncrypts data for security
Which clause is used to create a bucketed table in Hive?
ACLUSTERED BY
BSORTED BY
CPARTITIONED BY
DDISTRIBUTED BY
Bucketing helps sampling by:
ACompressing data files
BSelecting specific buckets instead of full data scan
CEncrypting data buckets
DSorting data within partitions
What is a key difference between partitioning and bucketing?
APartitioning divides data into files, bucketing into folders
BPartitioning sorts data, bucketing samples data
CPartitioning divides data into folders, bucketing into files
DPartitioning compresses data, bucketing encrypts data
Why is bucketing useful for joins?
AIt compresses join data
BIt sorts data for faster joins
CIt encrypts join keys
DIt ensures same keys go to same bucket enabling map-side joins
Explain what bucketing is and how it helps with sampling in Hadoop.
Think about dividing data into smaller parts to read less data.
You got /4 concepts.
    Describe the difference between partitioning and bucketing and why bucketing is useful for joins.
    Focus on how data is organized and used in joins.
    You got /4 concepts.