How to read csv in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Read CSV Files in PySpark: Simple Guide

To read a CSV file in PySpark, use spark.read.csv(path, options). This loads the CSV into a DataFrame, where you can specify options like header=True to use the first row as column names and inferSchema=True to detect data types automatically.

📐

Syntax

The basic syntax to read a CSV file in PySpark is:

spark.read.csv(path, options): Reads the CSV file from the given path.
path: The location of the CSV file (local or distributed storage).
options: Optional parameters like header and inferSchema.

Common options include:

header=True: Treats the first row as column names.
inferSchema=True: Automatically detects data types.
sep=',': Defines the delimiter (default is comma).

python

df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)

💻

Example

This example shows how to read a CSV file with a header and inferred schema, then display the data.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ReadCSVExample').getOrCreate()

# Read CSV file with header and schema inference
file_path = 'example.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)

df.show()

spark.stop()

Output

+---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85.0| | 2| Bob| 90.5| | 3|Charlie| 78.0| +---+-------+-----+

⚠️

Common Pitfalls

Some common mistakes when reading CSV files in PySpark include:

Not setting header=True when the CSV has column names, causing the first row to be treated as data.
Forgetting inferSchema=True, which results in all columns being read as strings.
Using the wrong file path or missing file errors.
Not specifying the correct delimiter if the CSV uses a separator other than a comma.

Example of a wrong and right way:

python

# Wrong way: header missing, schema not inferred
wrong_df = spark.read.csv('example.csv')
wrong_df.show()

# Right way: header and schema specified
right_df = spark.read.csv('example.csv', header=True, inferSchema=True)
right_df.show()

Output

+-------+ | _c0| +-------+ | id| | 1| | 2| | 3| +-------+ +---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85.0| | 2| Bob| 90.5| | 3|Charlie| 78.0| +---+-------+-----+

📊

Quick Reference

Option	Description	Example
header	Use first row as column names	header=True
inferSchema	Detect data types automatically	inferSchema=True
sep	Set delimiter character	sep=';'
mode	Handle corrupt records (e.g., 'DROPMALFORMED')	mode='DROPMALFORMED'
encoding	Set file encoding	encoding='UTF-8'

✅

Key Takeaways

Use spark.read.csv with header=True and inferSchema=True to read CSV files properly.

Always check the file path and delimiter to avoid read errors.

Without inferSchema, all columns are read as strings by default.

Setting header=True ensures the first row is used as column names.

Use options like mode and encoding to handle special CSV cases.