0
0
Hadoopdata~10 mins

Hive query optimization in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Hive query optimization
Write Hive Query
Parse Query
Generate Logical Plan
Apply Optimizations
Generate Physical Plan
Execute Query on Hadoop
Return Results
The Hive query goes through parsing, logical plan creation, optimization, physical plan generation, and then execution on Hadoop.
Execution Sample
Hadoop
SELECT dept, COUNT(*) FROM employees WHERE salary > 50000 GROUP BY dept;
This query counts employees with salary over 50000 per department.
Execution Table
StepActionDetailsEffect
1Parse QueryCheck syntax and build parse treeValid parse tree created
2Generate Logical PlanCreate plan with filter and group byLogical plan with filter salary>50000 and group by dept
3Apply OptimizationsPush filter before group byFilter applied early to reduce data
4Generate Physical PlanCreate MapReduce or Tez jobsPhysical plan optimized for execution
5Execute QueryRun jobs on Hadoop clusterData processed with less resource usage
6Return ResultsAggregate counts per deptFinal counts returned
💡 Query execution completes after returning aggregated results.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
Query PlanRaw parse treeLogical plan with filter and group byOptimized logical plan with filter pushed downPhysical execution planExecuted and results ready
Key Moments - 3 Insights
Why is pushing the filter before the group by important?
Because applying the filter early reduces the amount of data grouped, making the query faster as shown in step 3 of the execution_table.
What happens if the query is not optimized before execution?
The query will still run but may use more resources and take longer, as the physical plan won't be efficient (see step 4 and 5).
How does Hive decide which physical plan to use?
Hive chooses the plan based on cost and available execution engines like MapReduce or Tez, as indicated in step 4.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, at which step is the filter condition applied early to reduce data?
AStep 4
BStep 2
CStep 3
DStep 5
💡 Hint
Check the 'Apply Optimizations' step in the execution_table where filter pushdown is mentioned.
According to variable_tracker, what is the state of the query plan after step 3?
AOptimized logical plan with filter pushed down
BPhysical execution plan
CRaw parse tree
DExecuted and results ready
💡 Hint
Look at the 'After Step 3' column in variable_tracker for 'Query Plan'.
If the filter was not pushed down, how would the execution_table change?
AStep 4 would be skipped
BStep 3 would show no optimization and more data processed
CStep 5 would return no results
DStep 1 would fail parsing
💡 Hint
Refer to step 3 in execution_table where optimization is applied to reduce data.
Concept Snapshot
Hive Query Optimization:
- Parse query to build logical plan
- Push filters early to reduce data
- Generate efficient physical plan
- Execute on Hadoop with less resource use
- Returns aggregated results faster
Full Transcript
Hive query optimization involves several steps: first, the query is parsed to check syntax and create a parse tree. Then, a logical plan is generated that includes operations like filtering and grouping. Optimizations are applied, such as pushing filters before grouping to reduce data early. Next, a physical plan is created to run the query efficiently on Hadoop using engines like MapReduce or Tez. Finally, the query executes and returns results. This process reduces resource use and speeds up query execution.