Bucketing for sampling in Hadoop - Time & Space Complexity
When using bucketing for sampling in Hadoop, we want to know how the time to process data changes as the data grows.
We ask: How does bucketing affect the number of operations when sampling large datasets?
Analyze the time complexity of the following Hadoop code snippet.
FROM big_table
INSERT OVERWRITE TABLE sample_table
SELECT * FROM big_table TABLESAMPLE(BUCKET 10 OUT OF 100 ON id);
This code samples data by selecting 10 buckets out of 100 based on the 'id' column.
Look at what repeats when processing the data.
- Primary operation: Scanning all rows in the big_table to assign each to a bucket.
- How many times: Once for each row in the dataset.
As the dataset grows, the number of rows to scan grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 row checks |
| 100 | 100 row checks |
| 1000 | 1000 row checks |
Pattern observation: The operations grow directly with the number of rows; doubling rows doubles work.
Time Complexity: O(n)
This means the time to sample grows linearly with the size of the input data.
[X] Wrong: "Sampling with bucketing only processes the sampled buckets, so time stays the same no matter the data size."
[OK] Correct: Even to decide which rows belong to sampled buckets, all rows must be checked once, so time grows with data size.
Understanding how bucketing affects processing time helps you explain efficient data sampling in big data tools like Hadoop.
What if we increased the number of buckets sampled from 10 to 50? How would the time complexity change?