We use tables to organize data in Hadoop. Managed and external tables help decide who controls the data files and what happens when we delete the table.
External vs managed tables in Hadoop
CREATE TABLE table_name (column1 TYPE, column2 TYPE, ...)
STORED AS file_format;
-- For external tables, add: LOCATION 'path_to_data';Managed tables do not use the LOCATION clause; Hadoop stores data in its default folder.
External tables use LOCATION to point to data outside Hadoop's control.
CREATE TABLE managed_table ( id INT, name STRING ) STORED AS PARQUET;
CREATE EXTERNAL TABLE external_table (
id INT,
name STRING
)
STORED AS PARQUET
LOCATION '/user/data/external_table/';This example creates one managed and one external table, inserts data, and queries both. Managed table data is stored inside Hadoop's control. External table data is stored at the specified location.
CREATE TABLE managed_employees ( emp_id INT, emp_name STRING ) STORED AS TEXTFILE; CREATE EXTERNAL TABLE external_employees ( emp_id INT, emp_name STRING ) STORED AS TEXTFILE LOCATION '/user/hadoop/external_employees/'; -- After creating, insert some data INSERT INTO managed_employees VALUES (1, 'Alice'); INSERT INTO external_employees VALUES (2, 'Bob'); -- Query both tables SELECT * FROM managed_employees; SELECT * FROM external_employees;
Deleting a managed table deletes its data files automatically.
Deleting an external table only deletes the table metadata, not the data files.
Use external tables to share data across different systems without moving files.
Managed tables let Hadoop control data storage and cleanup.
External tables keep data files where you choose and only manage metadata.
Choose based on whether you want Hadoop to manage your data files or not.