0
0
HadoopHow-ToBeginner ยท 3 min read

How to Use Sqoop for Data Import in Hadoop Easily

Use sqoop import command to transfer data from relational databases like MySQL into Hadoop HDFS. Specify connection details, target directory, and table name in the command to import data efficiently.
๐Ÿ“

Syntax

The basic syntax of the sqoop import command includes specifying the database connection, target directory in HDFS, and the table to import.

  • --connect: JDBC URL of the database
  • --username: Database username
  • --password: Database password
  • --table: Name of the table to import
  • --target-dir: HDFS directory to store imported data
  • --split-by: Column used to split data for parallel import
bash
sqoop import --connect jdbc:mysql://hostname:3306/dbname --username user --password pass --table tablename --target-dir /user/hadoop/tablename --split-by id
๐Ÿ’ป

Example

This example imports the employees table from a MySQL database into HDFS directory /user/hadoop/employees. It uses the emp_id column to split the import into parallel tasks.

bash
sqoop import \
  --connect jdbc:mysql://localhost:3306/companydb \
  --username root \
  --password rootpass \
  --table employees \
  --target-dir /user/hadoop/employees \
  --split-by emp_id
Output
INFO mapreduce.ImportJobBase: Beginning import of employees INFO mapreduce.ImportJobBase: Transferred 1000 records INFO mapreduce.ImportJobBase: Completed import of employees
โš ๏ธ

Common Pitfalls

  • Not specifying --split-by on a column with unique values can cause import to run in a single task, slowing down the process.
  • Using incorrect JDBC URL or credentials will cause connection failures.
  • For large tables, not setting --target-dir properly may overwrite existing data.
  • For password security, avoid using --password directly; use --password-file instead.
bash
Wrong way:
sqoop import --connect jdbc:mysql://localhost:3306/companydb --username root --password rootpass --table employees

Right way:
sqoop import --connect jdbc:mysql://localhost:3306/companydb --username root --password-file /path/to/passwordfile --table employees --target-dir /user/hadoop/employees --split-by emp_id
๐Ÿ“Š

Quick Reference

OptionDescription
--connectJDBC URL of the source database
--usernameDatabase username
--passwordDatabase password (avoid in scripts)
--password-fileFile containing password for security
--tableName of the table to import
--target-dirHDFS directory to store imported data
--split-byColumn to split data for parallel import
--num-mappersNumber of parallel tasks (default 4)
โœ…

Key Takeaways

Use sqoop import with correct connection and table details to import data into Hadoop.
Always specify a unique column with --split-by for faster parallel import.
Avoid putting passwords directly in commands; use --password-file for security.
Set --target-dir to control where data lands in HDFS and prevent overwriting.
Check connection details carefully to avoid import failures.