How to Write Data to Database Using PySpark
To write data to a database in PySpark, use the
DataFrame.write method with the jdbc format. Specify the database URL, table name, connection properties like user and password, and the write mode such as append or overwrite.Syntax
The basic syntax to write a PySpark DataFrame to a database is:
df.write: Starts the write operation on the DataFrame..format('jdbc'): Specifies the JDBC format for database connection..option('url', 'jdbc:db_type://host:port/db_name'): Sets the database connection URL..option('dbtable', 'table_name'): Defines the target table in the database..option('user', 'username')and.option('password', 'password'): Provide database credentials..mode('append' or 'overwrite'): Chooses how to write data (add or replace)..save(): Executes the write operation.
python
df.write.format('jdbc')\ .option('url', 'jdbc:postgresql://localhost:5432/mydb')\ .option('dbtable', 'my_table')\ .option('user', 'myuser')\ .option('password', 'mypassword')\ .option('driver', 'org.postgresql.Driver')\ .mode('append')\ .save()
Example
This example shows how to create a simple DataFrame and write it to a PostgreSQL database table named people. It uses append mode to add data without deleting existing rows.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('WriteToDB').getOrCreate() # Sample data data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')] columns = ['id', 'name'] # Create DataFrame df = spark.createDataFrame(data, columns) # Write DataFrame to PostgreSQL database jdbc_url = 'jdbc:postgresql://localhost:5432/mydb' connection_properties = { 'user': 'myuser', 'password': 'mypassword', 'driver': 'org.postgresql.Driver' } df.write.jdbc(url=jdbc_url, table='people', mode='append', properties=connection_properties) spark.stop()
Output
Data written successfully to the 'people' table in the database.
Common Pitfalls
- Missing JDBC driver: You must have the correct JDBC driver jar file for your database in Spark's classpath.
- Wrong URL format: The JDBC URL must match your database type and connection details exactly.
- Incorrect write mode: Using
overwritedeletes existing data; useappendto add data safely. - Not specifying driver: Some databases require explicitly setting the
driveroption. - Authentication errors: Check username and password carefully.
python
## Wrong way (missing driver and wrong mode): df.write.format('jdbc')\ .option('url', 'jdbc:mysql://localhost:3306/mydb')\ .option('dbtable', 'my_table')\ .option('user', 'user')\ .option('password', 'pass')\ .mode('overwrite')\ .save() ## Right way (include driver and use append mode): df.write.format('jdbc')\ .option('url', 'jdbc:mysql://localhost:3306/mydb')\ .option('dbtable', 'my_table')\ .option('user', 'user')\ .option('password', 'pass')\ .option('driver', 'com.mysql.cj.jdbc.Driver')\ .mode('append')\ .save()
Quick Reference
Remember these key points when writing to a database with PySpark:
- Use
df.write.format('jdbc')to specify JDBC. - Set
url,dbtable,user, andpasswordoptions. - Include the
driveroption if needed. - Choose write
modecarefully:appendto add,overwriteto replace. - Ensure the JDBC driver jar is in Spark's classpath.
Key Takeaways
Use DataFrame.write.format('jdbc') with proper options to write data to a database.
Always specify the correct JDBC URL, table name, user, and password.
Include the JDBC driver option if your database requires it.
Choose the write mode carefully to avoid unwanted data loss.
Make sure the JDBC driver jar is available in Spark's environment.