0
0
Hadoopdata~30 mins

Hive query optimization in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Hive Query Optimization
📖 Scenario: You work as a data analyst in a company that stores large sales data in Hive tables. You want to write efficient Hive queries to get insights quickly without wasting resources.
🎯 Goal: Build a Hive query step-by-step that selects sales data for a specific year and calculates total sales per product, using optimization techniques like filtering early and selecting only needed columns.
📋 What You'll Learn
Create a Hive table with sales data
Add a variable for the target year to filter data
Write a Hive query that selects product and sales amount filtered by the target year
Print the total sales per product for the target year
💡 Why This Matters
🌍 Real World
Companies use Hive to store and analyze large datasets. Writing optimized queries saves time and computing resources.
💼 Career
Data analysts and engineers must write efficient Hive queries to handle big data and deliver insights quickly.
Progress0 / 4 steps
1
Create the sales data table
Create a Hive table called sales_data with columns product (string), year (int), and sales_amount (int). Insert these exact rows: ('Apple', 2022, 100), ('Banana', 2022, 150), ('Apple', 2023, 200), ('Banana', 2023, 100).
Hadoop
Need a hint?

Use CREATE TABLE to define the table and INSERT INTO TABLE to add rows.

2
Set the target year variable
Create a Hive variable called target_year and set it to 2022 to filter sales data for that year.
Hadoop
Need a hint?

Use SET variable_name=value; syntax to create a variable in Hive.

3
Write the optimized query to select filtered data
Write a Hive query that selects product and sales_amount from sales_data where year equals the variable ${target_year}. Use this exact query format: SELECT product, sales_amount FROM sales_data WHERE year = ${target_year};
Hadoop
Need a hint?

Use the variable ${target_year} in the WHERE clause to filter data.

4
Calculate and display total sales per product
Write a Hive query that calculates the total sales per product for the year stored in ${target_year}. Use GROUP BY product and SUM(sales_amount). Print the results with columns product and total_sales.
Hadoop
Need a hint?

Use SUM(sales_amount) AS total_sales and GROUP BY product to get totals.