Recall & Review
beginner
What is bucketing in Hadoop?
Bucketing is a technique to divide data into fixed number of files or buckets based on a hash function on a column. It helps in organizing data for efficient sampling and querying.
Click to reveal answer
beginner
How does bucketing help in sampling data?
Bucketing allows sampling by selecting specific buckets instead of scanning the whole dataset. This reduces data scanned and speeds up queries.
Click to reveal answer
intermediate
What is the difference between bucketing and partitioning?
Partitioning divides data based on column values into folders, while bucketing divides data into fixed number of files using a hash function. Bucketing is better for sampling and joins.
Click to reveal answer
intermediate
How do you create a bucketed table in Hive?
Use the CLUSTERED BY clause with a column and specify INTO n BUCKETS. For example: CREATE TABLE t (id INT) CLUSTERED BY (id) INTO 4 BUCKETS;
Click to reveal answer
intermediate
Why is bucketing useful for joins in Hadoop?
Bucketing ensures rows with the same key go to the same bucket, enabling efficient map-side joins without shuffling all data.
Click to reveal answer
What does bucketing in Hadoop do?
✗ Incorrect
Bucketing divides data into fixed number of files (buckets) using a hash function on a column.
Which clause is used to create a bucketed table in Hive?
✗ Incorrect
The CLUSTERED BY clause specifies bucketing in Hive.
Bucketing helps sampling by:
✗ Incorrect
Sampling can be done by reading only some buckets, reducing data scanned.
What is a key difference between partitioning and bucketing?
✗ Incorrect
Partitioning creates folders by column values; bucketing creates fixed number of files using a hash.
Why is bucketing useful for joins?
✗ Incorrect
Bucketing places same keys in same bucket, allowing efficient map-side joins.
Explain what bucketing is and how it helps with sampling in Hadoop.
Think about dividing data into smaller parts to read less data.
You got /4 concepts.
Describe the difference between partitioning and bucketing and why bucketing is useful for joins.
Focus on how data is organized and used in joins.
You got /4 concepts.