0
0
Hadoopdata~10 mins

Partitioning for query performance in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Partitioning for query performance
Start Query
Check Partition Key
Filter Partitions
Scan Only Needed Partitions
Return Results
End Query
When a query runs, it uses the partition key to filter data and scans only relevant partitions, speeding up the query.
Execution Sample
Hadoop
SELECT * FROM sales
WHERE year = 2023 AND region = 'US';
This query selects sales data only for year 2023 and region US, using partitioning to scan less data.
Execution Table
StepActionPartition Key UsedPartitions ScannedResult
1Start QueryN/AN/AQuery begins
2Check WHERE clause for partition keysyear, regionN/AIdentified partitions to filter
3Filter partitions by year=2023year=2023Partitions with year=2023Reduced partitions
4Filter partitions by region='US'region='US'Partitions with year=2023 and region='US'Further reduced partitions
5Scan filtered partitionsyear=2023, region='US'Only relevant partitionsData scanned efficiently
6Return resultsN/AN/AQuery results returned
7End QueryN/AN/AQuery finished
💡 Query ends after scanning only partitions matching year=2023 and region='US', improving performance.
Variable Tracker
VariableStartAfter Step 3After Step 4Final
Partitions to scanAll partitionsPartitions with year=2023Partitions with year=2023 and region='US'Partitions with year=2023 and region='US'
Key Moments - 2 Insights
Why does the query scan fewer partitions after filtering by partition keys?
Because the query uses the WHERE clause on partition keys (year and region), it can skip partitions that don't match, as shown in steps 3 and 4 of the execution table.
What happens if the query does not filter by partition keys?
If no partition keys are used in the WHERE clause, the query scans all partitions, which is slower. This is implied by the 'Partitions to scan' variable starting as 'All partitions'.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, at which step does the query identify the partition keys used?
AStep 2
BStep 3
CStep 5
DStep 6
💡 Hint
Check the 'Partition Key Used' column in the execution table at each step.
According to the variable tracker, what is the state of 'Partitions to scan' after step 4?
AAll partitions
BPartitions with year=2023
CPartitions with year=2023 and region='US'
DNo partitions
💡 Hint
Look at the 'After Step 4' column for 'Partitions to scan' in the variable tracker.
If the query did not filter by region, how would the partitions scanned change at step 4?
APartitions would be filtered by region only
BPartitions would remain filtered by year only
CPartitions would include all years and regions
DPartitions would be empty
💡 Hint
Refer to the filtering steps in the execution table and how partition keys affect partitions scanned.
Concept Snapshot
Partitioning splits data into parts by keys (like year, region).
Queries use WHERE on these keys to scan only needed parts.
This reduces data scanned and speeds up queries.
Without partition filters, all data is scanned, slowing queries.
Full Transcript
Partitioning helps queries run faster by dividing data into parts based on keys like year or region. When a query uses these keys in its WHERE clause, it scans only the matching partitions instead of all data. For example, a query filtering year=2023 and region='US' scans only partitions with those values. This reduces the amount of data read and speeds up the query. If no partition keys are used, the query scans all partitions, which is slower. The execution table shows each step: starting the query, identifying partition keys, filtering partitions step-by-step, scanning only relevant partitions, and returning results. The variable tracker shows how the set of partitions to scan shrinks after each filter. Understanding this flow helps write efficient queries on partitioned data.