Challenge - 5 Problems

🎖️

Bucketing Sampling Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding Bucketing Purpose in Hadoop Sampling

Why is bucketing used in Hadoop sampling when working with large datasets?

ATo encrypt data buckets for secure sampling

BTo compress data files to save storage space during sampling

CTo divide data into fixed-size files for parallel processing and enable efficient sampling

DTo sort data alphabetically before sampling

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Output of Sampling Using Buckets in Hive

Given a Hive table bucketed by user_id into 4 buckets, what is the output count when sampling bucket number 2 only?

Table data (user_id): [101, 102, 103, 104, 105, 106, 107, 108]

Hash function: bucket = hash(user_id) % 4

Hadoop

SELECT COUNT(*) FROM table WHERE input__file__name LIKE '%bucket_2%';

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the Error in Bucketing Sampling Query

What error will this Hive query produce?

SELECT * FROM users TABLESAMPLE(BUCKET 3 OUT OF 5) WHERE age > 30;

ASyntaxError: TABLESAMPLE clause must come after WHERE clause

BNo error, query runs successfully

CRuntimeError: Sampling on non-bucketed table

DSemanticError: BUCKET number cannot be greater than total buckets

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

Result of Sampling Buckets in Hadoop MapReduce Job

What is the output of this pseudocode for sampling bucket 1 from a bucketed dataset with 3 buckets?

for record in dataset:
    if hash(record.key) % 3 == 1:
        print(record.key)

Dataset keys: [10, 11, 12, 13, 14]

A11, 14

B10, 13

CNone

D12, 15

Attempts:

2 left

🚀 Application

expert

3:00remaining

Choosing Buckets for Stratified Sampling in Hadoop

You have a large dataset bucketed by 'region' into 10 buckets. You want to perform stratified sampling to get a representative sample from each region. Which approach is best?

ASample equal number of records from each bucket by reading all buckets and filtering by region

BRandomly select 3 buckets and use all data from those buckets as sample

CUse TABLESAMPLE(BUCKET 1 OUT OF 10) to get a sample from all regions

DSample only from the bucket with the largest region to reduce data size

Attempts:

2 left