0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Write CSV Files in PySpark: Syntax and Examples

In PySpark, you write a CSV file using the DataFrame.write.csv() method. You specify the output path and can add options like header=True to include column names in the file.
๐Ÿ“

Syntax

The basic syntax to write a CSV file in PySpark is:

  • DataFrame.write.csv(path, mode, header, sep)
  • path: The folder path where CSV files will be saved.
  • mode: How to handle existing data (e.g., 'overwrite', 'append').
  • header: Whether to write column names as the first row (True/False).
  • sep: The delimiter character, default is comma.
python
df.write.csv(path='output_folder', mode='overwrite', header=True, sep=',')
๐Ÿ’ป

Example

This example creates a simple DataFrame and writes it as a CSV file with headers included. It overwrites any existing data in the output folder.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WriteCSVExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Write DataFrame to CSV
output_path = 'output_csv'
df.write.csv(path=output_path, mode='overwrite', header=True)

spark.stop()
Output
Files saved in folder 'output_csv' with CSV format including header row.
โš ๏ธ

Common Pitfalls

Common mistakes when writing CSV in PySpark include:

  • Not setting header=True if you want column names in the CSV.
  • Using mode='error' or default mode which fails if the output folder exists.
  • Expecting a single CSV file output; PySpark writes multiple part files by default.
  • Not specifying the correct path or permissions for output folder.

Example of a wrong and right way:

python
# Wrong: No header, default mode (fails if folder exists)
df.write.csv('output_csv')

# Right: Include header and overwrite existing data
df.write.csv('output_csv', mode='overwrite', header=True)
๐Ÿ“Š

Quick Reference

OptionDescriptionExample
pathFolder path to save CSV files'output_folder'
modeWrite mode: 'overwrite', 'append', 'ignore', 'error' (default)'overwrite'
headerWrite column names as first row (True/False)True
sepField delimiter character','
quoteCharacter for quoting fields"'"
escapeCharacter to escape quotes inside fields'\'
โœ…

Key Takeaways

Use DataFrame.write.csv() with path and header=True to save CSV with column names.
Set mode='overwrite' to replace existing files and avoid errors.
PySpark writes multiple part files by default, not a single CSV file.
Always check output folder path and permissions before writing.
Use options like sep and quote to customize CSV format.