What is internal vs external table hive in hadoop

HadoopComparisonBeginner · 4 min read

Internal vs External Table in Hive: Key Differences and Usage

In Hive, an internal table stores data within Hive's warehouse directory and deletes data when dropped, while an external table points to data stored outside Hive and preserves data when dropped. Internal tables manage data lifecycle fully, whereas external tables allow sharing data with other tools.

⚖️

Quick Comparison

This table summarizes the main differences between internal and external tables in Hive.

Feature	Internal Table	External Table
Data Storage Location	Hive's warehouse directory (default)	User-specified external location
Data Deletion on Drop	Deletes both table and data	Deletes only table metadata, data remains
Use Case	Hive manages full data lifecycle	Data shared with other systems or tools
Creation Syntax	CREATE TABLE ...	CREATE EXTERNAL TABLE ...
Data Ownership	Hive owns data	User owns data
Data Backup	Backup needed before drop	Data safe after drop

⚖️

Key Differences

Internal tables store data inside Hive's default warehouse directory, usually at /user/hive/warehouse. When you drop an internal table, Hive deletes both the table schema and the actual data files. This means Hive fully controls the data lifecycle.

In contrast, external tables point to data stored outside Hive's warehouse, often on HDFS or other storage locations. Dropping an external table removes only the table schema from Hive, leaving the data files untouched. This allows multiple tools or users to access the same data without Hive deleting it.

Because of this, internal tables are best when Hive is the sole data manager, while external tables are ideal for sharing data or when data is managed outside Hive.

⚖️

Code Comparison

Here is how you create and drop an internal table in Hive:

sql

CREATE TABLE employees (
  id INT,
  name STRING,
  salary FLOAT
);

-- Insert sample data
INSERT INTO TABLE employees VALUES (1, 'Alice', 50000), (2, 'Bob', 60000);

-- Drop the table
DROP TABLE employees;

Output

Table employees created. 2 rows inserted. Table employees dropped and data deleted.

↔️

External Table Equivalent

Here is how you create and drop an external table in Hive pointing to existing data:

sql

CREATE EXTERNAL TABLE employees_ext (
  id INT,
  name STRING,
  salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/data/employees';

-- Drop the table
DROP TABLE employees_ext;

Output

External table employees_ext created. Table employees_ext dropped but data at /user/data/employees remains intact.

🎯

When to Use Which

Choose internal tables when Hive should fully manage the data lifecycle, including storage and deletion. This is good for temporary or Hive-exclusive datasets.

Choose external tables when data is shared across multiple tools or users, or when data already exists outside Hive. External tables prevent accidental data loss by preserving data on drop.

✅

Key Takeaways

Internal tables store data inside Hive and delete data when dropped.

External tables link to external data and keep data after table drop.

Use internal tables for Hive-managed data lifecycle.

Use external tables to share data or protect data from deletion.

Dropping external tables only removes metadata, not data files.