How to Write CSV Files in PySpark: Syntax and Examples
In PySpark, you write a CSV file using the
DataFrame.write.csv() method. You specify the output path and can add options like header=True to include column names in the file.Syntax
The basic syntax to write a CSV file in PySpark is:
DataFrame.write.csv(path, mode, header, sep)path: The folder path where CSV files will be saved.mode: How to handle existing data (e.g., 'overwrite', 'append').header: Whether to write column names as the first row (True/False).sep: The delimiter character, default is comma.
python
df.write.csv(path='output_folder', mode='overwrite', header=True, sep=',')
Example
This example creates a simple DataFrame and writes it as a CSV file with headers included. It overwrites any existing data in the output folder.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('WriteCSVExample').getOrCreate() # Create sample data data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)] columns = ['id', 'name', 'age'] # Create DataFrame df = spark.createDataFrame(data, columns) # Write DataFrame to CSV output_path = 'output_csv' df.write.csv(path=output_path, mode='overwrite', header=True) spark.stop()
Output
Files saved in folder 'output_csv' with CSV format including header row.
Common Pitfalls
Common mistakes when writing CSV in PySpark include:
- Not setting
header=Trueif you want column names in the CSV. - Using
mode='error'or default mode which fails if the output folder exists. - Expecting a single CSV file output; PySpark writes multiple part files by default.
- Not specifying the correct path or permissions for output folder.
Example of a wrong and right way:
python
# Wrong: No header, default mode (fails if folder exists) df.write.csv('output_csv') # Right: Include header and overwrite existing data df.write.csv('output_csv', mode='overwrite', header=True)
Quick Reference
| Option | Description | Example |
|---|---|---|
| path | Folder path to save CSV files | 'output_folder' |
| mode | Write mode: 'overwrite', 'append', 'ignore', 'error' (default) | 'overwrite' |
| header | Write column names as first row (True/False) | True |
| sep | Field delimiter character | ',' |
| quote | Character for quoting fields | "'" |
| escape | Character to escape quotes inside fields | '\' |
Key Takeaways
Use DataFrame.write.csv() with path and header=True to save CSV with column names.
Set mode='overwrite' to replace existing files and avoid errors.
PySpark writes multiple part files by default, not a single CSV file.
Always check output folder path and permissions before writing.
Use options like sep and quote to customize CSV format.