How to write to database pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Write Data to Database Using PySpark

To write data to a database in PySpark, use the DataFrame.write method with the jdbc format. Specify the database URL, table name, connection properties like user and password, and the write mode such as append or overwrite.

📐

Syntax

The basic syntax to write a PySpark DataFrame to a database is:

df.write: Starts the write operation on the DataFrame.
.format('jdbc'): Specifies the JDBC format for database connection.
.option('url', 'jdbc:db_type://host:port/db_name'): Sets the database connection URL.
.option('dbtable', 'table_name'): Defines the target table in the database.
.option('user', 'username') and .option('password', 'password'): Provide database credentials.
.mode('append' or 'overwrite'): Chooses how to write data (add or replace).
.save(): Executes the write operation.

python

df.write.format('jdbc')\
  .option('url', 'jdbc:postgresql://localhost:5432/mydb')\
  .option('dbtable', 'my_table')\
  .option('user', 'myuser')\
  .option('password', 'mypassword')\
  .option('driver', 'org.postgresql.Driver')\
  .mode('append')\
  .save()

💻

Example

This example shows how to create a simple DataFrame and write it to a PostgreSQL database table named people. It uses append mode to add data without deleting existing rows.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WriteToDB').getOrCreate()

# Sample data
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
columns = ['id', 'name']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Write DataFrame to PostgreSQL database
jdbc_url = 'jdbc:postgresql://localhost:5432/mydb'
connection_properties = {
    'user': 'myuser',
    'password': 'mypassword',
    'driver': 'org.postgresql.Driver'
}

df.write.jdbc(url=jdbc_url, table='people', mode='append', properties=connection_properties)

spark.stop()

Output

Data written successfully to the 'people' table in the database.

⚠️

Common Pitfalls

Missing JDBC driver: You must have the correct JDBC driver jar file for your database in Spark's classpath.
Wrong URL format: The JDBC URL must match your database type and connection details exactly.
Incorrect write mode: Using overwrite deletes existing data; use append to add data safely.
Not specifying driver: Some databases require explicitly setting the driver option.
Authentication errors: Check username and password carefully.

python

## Wrong way (missing driver and wrong mode):
df.write.format('jdbc')\
  .option('url', 'jdbc:mysql://localhost:3306/mydb')\
  .option('dbtable', 'my_table')\
  .option('user', 'user')\
  .option('password', 'pass')\
  .mode('overwrite')\
  .save()

## Right way (include driver and use append mode):
df.write.format('jdbc')\
  .option('url', 'jdbc:mysql://localhost:3306/mydb')\
  .option('dbtable', 'my_table')\
  .option('user', 'user')\
  .option('password', 'pass')\
  .option('driver', 'com.mysql.cj.jdbc.Driver')\
  .mode('append')\
  .save()

📊

Quick Reference

Remember these key points when writing to a database with PySpark:

Use df.write.format('jdbc') to specify JDBC.
Set url, dbtable, user, and password options.
Include the driver option if needed.
Choose write mode carefully: append to add, overwrite to replace.
Ensure the JDBC driver jar is in Spark's classpath.

✅

Key Takeaways

Use DataFrame.write.format('jdbc') with proper options to write data to a database.

Always specify the correct JDBC URL, table name, user, and password.

Include the JDBC driver option if your database requires it.

Choose the write mode carefully to avoid unwanted data loss.

Make sure the JDBC driver jar is available in Spark's environment.