How to Read Data from Database Using PySpark
To read data from a database in PySpark, use the
spark.read.format('jdbc') method with the database URL, table name, and connection properties like user and password. This loads the database table into a PySpark DataFrame for analysis.Syntax
The basic syntax to read from a database in PySpark uses the jdbc format. You specify the database connection URL, the table to read, and connection properties such as username and password.
- url: The JDBC URL of the database.
- dbtable: The table name or query to read.
- properties: A dictionary with connection details like user and password.
python
df = spark.read.format('jdbc')\ .option('url', 'jdbc:postgresql://host:port/database')\ .option('dbtable', 'schema.table_name')\ .option('user', 'username')\ .option('password', 'password')\ .load()
Example
This example shows how to read a table named employees from a PostgreSQL database into a PySpark DataFrame and display the first 5 rows.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ReadFromDB').getOrCreate() jdbc_url = 'jdbc:postgresql://localhost:5432/mydatabase' connection_properties = { 'user': 'myuser', 'password': 'mypassword', 'driver': 'org.postgresql.Driver' } df = spark.read.format('jdbc') \ .option('url', jdbc_url) \ .option('dbtable', 'public.employees') \ .options(**connection_properties) \ .load() # Show first 5 rows df.show(5)
Output
+-------+----------+-------+
| id | name | salary|
+-------+----------+-------+
| 1| Alice | 50000|
| 2| Bob | 60000|
| 3| Charlie | 55000|
| 4| David | 70000|
| 5| Emma | 65000|
+-------+----------+-------+
Common Pitfalls
Common mistakes when reading from databases in PySpark include:
- Not specifying the correct JDBC driver in
driveroption. - Using wrong JDBC URL format for the database.
- Forgetting to include user and password in connection properties.
- Trying to read a table that does not exist or misspelling the table name.
- Not having the JDBC driver jar file available in Spark's classpath.
Always verify your connection details and ensure the driver is accessible.
python
wrong_df = spark.read.format('jdbc') \ .option('url', 'jdbc:postgresql://localhost:5432/mydatabase') \ .option('dbtable', 'wrong_table_name') \ .option('user', 'myuser') \ .option('password', 'mypassword') \ .load() # This will raise an error because the table does not exist. # Correct way: correct_df = spark.read.format('jdbc') \ .option('url', 'jdbc:postgresql://localhost:5432/mydatabase') \ .option('dbtable', 'public.employees') \ .option('user', 'myuser') \ .option('password', 'mypassword') \ .load()
Quick Reference
Here is a quick cheat sheet for reading from databases in PySpark:
| Option | Description | Example |
|---|---|---|
| format | Data source format | 'jdbc' |
| url | JDBC connection string | 'jdbc:postgresql://host:port/db' |
| dbtable | Table name or query | 'schema.table' or '(select * from table) as alias' |
| user | Database username | 'myuser' |
| password | Database password | 'mypassword' |
| driver | JDBC driver class | 'org.postgresql.Driver' |
Key Takeaways
Use spark.read.format('jdbc') with options url, dbtable, user, and password to read from databases.
Always specify the correct JDBC driver and ensure it is in Spark's classpath.
Check your database URL and table names carefully to avoid errors.
You can load any table or query result as a PySpark DataFrame for analysis.
Common errors come from missing drivers, wrong URLs, or incorrect table names.