How to write csv in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Write CSV Files in PySpark: Syntax and Examples

In PySpark, you write a CSV file using the DataFrame.write.csv() method. You specify the output path and can add options like header=True to include column names in the file.

📐

Syntax

The basic syntax to write a CSV file in PySpark is:

DataFrame.write.csv(path, mode, header, sep)
path: The folder path where CSV files will be saved.
mode: How to handle existing data (e.g., 'overwrite', 'append').
header: Whether to write column names as the first row (True/False).
sep: The delimiter character, default is comma.

python

df.write.csv(path='output_folder', mode='overwrite', header=True, sep=',')

💻

Example

This example creates a simple DataFrame and writes it as a CSV file with headers included. It overwrites any existing data in the output folder.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WriteCSVExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Write DataFrame to CSV
output_path = 'output_csv'
df.write.csv(path=output_path, mode='overwrite', header=True)

spark.stop()

Output

Files saved in folder 'output_csv' with CSV format including header row.

⚠️

Common Pitfalls

Common mistakes when writing CSV in PySpark include:

Not setting header=True if you want column names in the CSV.
Using mode='error' or default mode which fails if the output folder exists.
Expecting a single CSV file output; PySpark writes multiple part files by default.
Not specifying the correct path or permissions for output folder.

Example of a wrong and right way:

python

# Wrong: No header, default mode (fails if folder exists)
df.write.csv('output_csv')

# Right: Include header and overwrite existing data
df.write.csv('output_csv', mode='overwrite', header=True)

📊

Quick Reference

Option	Description	Example
path	Folder path to save CSV files	'output_folder'
mode	Write mode: 'overwrite', 'append', 'ignore', 'error' (default)	'overwrite'
header	Write column names as first row (True/False)	True
sep	Field delimiter character	','
quote	Character for quoting fields	"'"
escape	Character to escape quotes inside fields	'\'

✅

Key Takeaways

Use DataFrame.write.csv() with path and header=True to save CSV with column names.

Set mode='overwrite' to replace existing files and avoid errors.

PySpark writes multiple part files by default, not a single CSV file.

Always check output folder path and permissions before writing.

Use options like sep and quote to customize CSV format.