How to Use Join in Hive in Hadoop: Syntax and Examples
In Hive on Hadoop, you use
JOIN to combine rows from two or more tables based on a related column. The basic syntax is SELECT ... FROM table1 JOIN table2 ON table1.key = table2.key, which merges matching rows from both tables.Syntax
The basic syntax for a join in Hive is:
- SELECT: Choose columns to display.
- FROM: Specify the first table.
- JOIN: Specify the second table to join.
- ON: Define the condition to match rows between tables.
Hive supports different join types like INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN.
sql
SELECT a.column1, b.column2 FROM table1 a JOIN table2 b ON a.id = b.id;
Example
This example shows how to join two tables employees and departments on the dept_id column to get employee names with their department names.
sql
CREATE TABLE employees (emp_id INT, emp_name STRING, dept_id INT); CREATE TABLE departments (dept_id INT, dept_name STRING); INSERT INTO employees VALUES (1, 'Alice', 10), (2, 'Bob', 20), (3, 'Charlie', 10); INSERT INTO departments VALUES (10, 'HR'), (20, 'Engineering'); SELECT e.emp_name, d.dept_name FROM employees e JOIN departments d ON e.dept_id = d.dept_id;
Output
emp_name dept_name
Alice HR
Bob Engineering
Charlie HR
Common Pitfalls
Common mistakes when using joins in Hive include:
- Not specifying the
ONcondition correctly, causing a Cartesian product (all rows combined). - Using
JOINwithout qualifying columns, leading to ambiguous column errors. - Ignoring join type differences, which can cause missing or extra rows.
- Joining large tables without proper filtering, causing slow queries.
sql
/* Wrong: Missing ON condition causes Cartesian product */ SELECT * FROM employees e JOIN departments d; /* Right: Always specify ON condition */ SELECT * FROM employees e JOIN departments d ON e.dept_id = d.dept_id;
Quick Reference
| Join Type | Description | Example |
|---|---|---|
| INNER JOIN | Returns rows with matching keys in both tables | SELECT * FROM A JOIN B ON A.id = B.id; |
| LEFT OUTER JOIN | Returns all rows from left table, matched rows from right | SELECT * FROM A LEFT OUTER JOIN B ON A.id = B.id; |
| RIGHT OUTER JOIN | Returns all rows from right table, matched rows from left | SELECT * FROM A RIGHT OUTER JOIN B ON A.id = B.id; |
| FULL OUTER JOIN | Returns all rows when there is a match in one of the tables | SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id; |
Key Takeaways
Always specify the ON condition to avoid unintended large results.
Use the appropriate join type based on the data you want to retrieve.
Qualify column names when joining tables to prevent ambiguity.
Joining large tables without filters can slow down your Hive queries.
Test joins with small data samples to verify correctness before running on big data.