Why is bucketing used in Hadoop sampling when working with large datasets?
Think about how dividing data helps in processing and sampling.
Bucketing splits data into fixed buckets based on a hash of a column, which helps parallelize processing and allows sampling by selecting specific buckets efficiently.
Given a Hive table bucketed by user_id into 4 buckets, what is the output count when sampling bucket number 2 only?
Table data (user_id): [101, 102, 103, 104, 105, 106, 107, 108]
Hash function: bucket = hash(user_id) % 4
SELECT COUNT(*) FROM table WHERE input__file__name LIKE '%bucket_2%';Calculate hash(user_id) % 4 for each user_id and count how many fall into bucket 2.
Applying hash(user_id) % 4 to each user_id, only two user_ids fall into bucket 2, so the count is 2.
What error will this Hive query produce?
SELECT * FROM users TABLESAMPLE(BUCKET 3 OUT OF 5) WHERE age > 30;
Check the order of clauses and bucket numbers.
The query is valid if the table is bucketed into 5 buckets. TABLESAMPLE clause can appear before WHERE. No error occurs.
What is the output of this pseudocode for sampling bucket 1 from a bucketed dataset with 3 buckets?
for record in dataset:
if hash(record.key) % 3 == 1:
print(record.key)Dataset keys: [10, 11, 12, 13, 14]
Calculate hash(key) % 3 for each key and select those equal to 1.
Assuming hash(key) = key, keys 10 and 13 satisfy hash(key) % 3 == 1, so they are printed.
You have a large dataset bucketed by 'region' into 10 buckets. You want to perform stratified sampling to get a representative sample from each region. Which approach is best?
Think about how to get balanced samples from all regions.
Sampling equal records from each bucket ensures representation from all regions, as buckets correspond to regions. Other options risk bias or incomplete coverage.