0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Read Data from Database Using PySpark

To read data from a database in PySpark, use the spark.read.format('jdbc') method with the database URL, table name, and connection properties like user and password. This loads the database table into a PySpark DataFrame for analysis.
๐Ÿ“

Syntax

The basic syntax to read from a database in PySpark uses the jdbc format. You specify the database connection URL, the table to read, and connection properties such as username and password.

  • url: The JDBC URL of the database.
  • dbtable: The table name or query to read.
  • properties: A dictionary with connection details like user and password.
python
df = spark.read.format('jdbc')\
    .option('url', 'jdbc:postgresql://host:port/database')\
    .option('dbtable', 'schema.table_name')\
    .option('user', 'username')\
    .option('password', 'password')\
    .load()
๐Ÿ’ป

Example

This example shows how to read a table named employees from a PostgreSQL database into a PySpark DataFrame and display the first 5 rows.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ReadFromDB').getOrCreate()

jdbc_url = 'jdbc:postgresql://localhost:5432/mydatabase'
connection_properties = {
    'user': 'myuser',
    'password': 'mypassword',
    'driver': 'org.postgresql.Driver'
}

df = spark.read.format('jdbc') \
    .option('url', jdbc_url) \
    .option('dbtable', 'public.employees') \
    .options(**connection_properties) \
    .load()

# Show first 5 rows
df.show(5)
Output
+-------+----------+-------+ | id | name | salary| +-------+----------+-------+ | 1| Alice | 50000| | 2| Bob | 60000| | 3| Charlie | 55000| | 4| David | 70000| | 5| Emma | 65000| +-------+----------+-------+
โš ๏ธ

Common Pitfalls

Common mistakes when reading from databases in PySpark include:

  • Not specifying the correct JDBC driver in driver option.
  • Using wrong JDBC URL format for the database.
  • Forgetting to include user and password in connection properties.
  • Trying to read a table that does not exist or misspelling the table name.
  • Not having the JDBC driver jar file available in Spark's classpath.

Always verify your connection details and ensure the driver is accessible.

python
wrong_df = spark.read.format('jdbc') \
    .option('url', 'jdbc:postgresql://localhost:5432/mydatabase') \
    .option('dbtable', 'wrong_table_name') \
    .option('user', 'myuser') \
    .option('password', 'mypassword') \
    .load()

# This will raise an error because the table does not exist.

# Correct way:
correct_df = spark.read.format('jdbc') \
    .option('url', 'jdbc:postgresql://localhost:5432/mydatabase') \
    .option('dbtable', 'public.employees') \
    .option('user', 'myuser') \
    .option('password', 'mypassword') \
    .load()
๐Ÿ“Š

Quick Reference

Here is a quick cheat sheet for reading from databases in PySpark:

OptionDescriptionExample
formatData source format'jdbc'
urlJDBC connection string'jdbc:postgresql://host:port/db'
dbtableTable name or query'schema.table' or '(select * from table) as alias'
userDatabase username'myuser'
passwordDatabase password'mypassword'
driverJDBC driver class'org.postgresql.Driver'
โœ…

Key Takeaways

Use spark.read.format('jdbc') with options url, dbtable, user, and password to read from databases.
Always specify the correct JDBC driver and ensure it is in Spark's classpath.
Check your database URL and table names carefully to avoid errors.
You can load any table or query result as a PySpark DataFrame for analysis.
Common errors come from missing drivers, wrong URLs, or incorrect table names.