How to Use Sqoop for Data Import in Hadoop Easily
Use
sqoop import command to transfer data from relational databases like MySQL into Hadoop HDFS. Specify connection details, target directory, and table name in the command to import data efficiently.Syntax
The basic syntax of the sqoop import command includes specifying the database connection, target directory in HDFS, and the table to import.
--connect: JDBC URL of the database--username: Database username--password: Database password--table: Name of the table to import--target-dir: HDFS directory to store imported data--split-by: Column used to split data for parallel import
bash
sqoop import --connect jdbc:mysql://hostname:3306/dbname --username user --password pass --table tablename --target-dir /user/hadoop/tablename --split-by id
Example
This example imports the employees table from a MySQL database into HDFS directory /user/hadoop/employees. It uses the emp_id column to split the import into parallel tasks.
bash
sqoop import \ --connect jdbc:mysql://localhost:3306/companydb \ --username root \ --password rootpass \ --table employees \ --target-dir /user/hadoop/employees \ --split-by emp_id
Output
INFO mapreduce.ImportJobBase: Beginning import of employees
INFO mapreduce.ImportJobBase: Transferred 1000 records
INFO mapreduce.ImportJobBase: Completed import of employees
Common Pitfalls
- Not specifying
--split-byon a column with unique values can cause import to run in a single task, slowing down the process. - Using incorrect JDBC URL or credentials will cause connection failures.
- For large tables, not setting
--target-dirproperly may overwrite existing data. - For password security, avoid using
--passworddirectly; use--password-fileinstead.
bash
Wrong way: sqoop import --connect jdbc:mysql://localhost:3306/companydb --username root --password rootpass --table employees Right way: sqoop import --connect jdbc:mysql://localhost:3306/companydb --username root --password-file /path/to/passwordfile --table employees --target-dir /user/hadoop/employees --split-by emp_id
Quick Reference
| Option | Description |
|---|---|
| --connect | JDBC URL of the source database |
| --username | Database username |
| --password | Database password (avoid in scripts) |
| --password-file | File containing password for security |
| --table | Name of the table to import |
| --target-dir | HDFS directory to store imported data |
| --split-by | Column to split data for parallel import |
| --num-mappers | Number of parallel tasks (default 4) |
Key Takeaways
Use
sqoop import with correct connection and table details to import data into Hadoop.Always specify a unique column with
--split-by for faster parallel import.Avoid putting passwords directly in commands; use
--password-file for security.Set
--target-dir to control where data lands in HDFS and prevent overwriting.Check connection details carefully to avoid import failures.