Partitioning for query performance in Hadoop - Time & Space Complexity
When we use partitioning in Hadoop, we split data into parts to speed up queries.
We want to know how this splitting affects the time it takes to run queries.
Analyze the time complexity of the following code snippet.
-- Example of reading data with partition pruning
SELECT * FROM sales
WHERE year = 2023 AND region = 'US';
-- Data is stored in folders by year and region
-- Hadoop reads only matching partitions
This code reads only the partitions for year 2023 and region US, skipping others.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning data files in selected partitions.
- How many times: Once per relevant partition, not all data.
When data grows, partitioning lets queries skip irrelevant parts.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 partitions | Scan 1-2 partitions |
| 100 partitions | Scan 1-2 partitions |
| 1000 partitions | Scan 1-2 partitions |
Pattern observation: The work stays about the same because only needed partitions are scanned.
Time Complexity: O(k)
This means the query time grows with the number of partitions scanned, not total data size.
[X] Wrong: "Partitioning always makes queries run in constant time regardless of data."
[OK] Correct: If the query filters are not on partition keys, all partitions may be scanned, so time grows with data.
Understanding partitioning time helps you explain how big data queries stay fast by reading less data.
"What if we changed the query to filter on a non-partitioned column? How would the time complexity change?"