Challenge - 5 Problems
Hive Query Optimization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a Hive query with partition pruning
Consider a Hive table
sales partitioned by year. What is the output count of this query?SELECT COUNT(*) FROM sales WHERE year = 2023;
Hadoop
SELECT COUNT(*) FROM sales WHERE year = 2023;Attempts:
2 left
💡 Hint
Partition pruning filters data before scanning, so only 2023 partitions are counted.
✗ Incorrect
The query uses partition pruning on the 'year' column, so only data from 2023 partitions is scanned and counted.
🧠 Conceptual
intermediate2:00remaining
Effect of using OR vs UNION ALL in Hive queries
Which option best explains the performance difference between these two Hive queries?
Query 1: SELECT * FROM table WHERE col = 'A' OR col = 'B'; Query 2: SELECT * FROM table WHERE col = 'A' UNION ALL SELECT * FROM table WHERE col = 'B';
Attempts:
2 left
💡 Hint
Splitting filters can help Hive optimize each scan separately.
✗ Incorrect
Using UNION ALL splits the query into two simpler scans that can be optimized individually, often improving performance compared to OR conditions.
🔧 Debug
advanced2:00remaining
Identify the cause of slow Hive query with joins
This Hive query runs very slowly:
What is the most likely cause of the slow performance?
SELECT a.id, b.value FROM table_a a JOIN table_b b ON a.key = b.key WHERE a.date = '2023-01-01';
What is the most likely cause of the slow performance?
Attempts:
2 left
💡 Hint
Check if both tables are partitioned and filters applied correctly.
✗ Incorrect
If table_b is large and not partitioned or filtered, the join causes a full scan on it, slowing the query.
❓ data_output
advanced2:00remaining
Result of using map-side join in Hive
Given
table_small is small and table_large is very large, what is the output of this query?SELECT /*+ MAPJOIN(table_small) */ l.id, s.value FROM table_large l JOIN table_small s ON l.key = s.key LIMIT 5;
Attempts:
2 left
💡 Hint
MAPJOIN loads the small table into memory to speed up join.
✗ Incorrect
MAPJOIN hint forces Hive to load the small table into memory and join it with the large table on the map side, improving speed and returning correct rows.
🚀 Application
expert3:00remaining
Optimizing a Hive query with skewed data
You have a Hive table with skewed keys causing slow joins. Which option is the best approach to optimize the join performance?
Attempts:
2 left
💡 Hint
Hive has built-in features to handle skewed joins efficiently.
✗ Incorrect
Enabling
hive.optimize.skewjoin lets Hive detect skewed keys and handle them separately to avoid reducer overload and improve join speed.