0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Read CSV Files in PySpark: Simple Guide

To read a CSV file in PySpark, use spark.read.csv(path, options). This loads the CSV into a DataFrame, where you can specify options like header=True to use the first row as column names and inferSchema=True to detect data types automatically.
๐Ÿ“

Syntax

The basic syntax to read a CSV file in PySpark is:

  • spark.read.csv(path, options): Reads the CSV file from the given path.
  • path: The location of the CSV file (local or distributed storage).
  • options: Optional parameters like header and inferSchema.

Common options include:

  • header=True: Treats the first row as column names.
  • inferSchema=True: Automatically detects data types.
  • sep=',': Defines the delimiter (default is comma).
python
df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)
๐Ÿ’ป

Example

This example shows how to read a CSV file with a header and inferred schema, then display the data.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ReadCSVExample').getOrCreate()

# Read CSV file with header and schema inference
file_path = 'example.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)

df.show()

spark.stop()
Output
+---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85.0| | 2| Bob| 90.5| | 3|Charlie| 78.0| +---+-------+-----+
โš ๏ธ

Common Pitfalls

Some common mistakes when reading CSV files in PySpark include:

  • Not setting header=True when the CSV has column names, causing the first row to be treated as data.
  • Forgetting inferSchema=True, which results in all columns being read as strings.
  • Using the wrong file path or missing file errors.
  • Not specifying the correct delimiter if the CSV uses a separator other than a comma.

Example of a wrong and right way:

python
# Wrong way: header missing, schema not inferred
wrong_df = spark.read.csv('example.csv')
wrong_df.show()

# Right way: header and schema specified
right_df = spark.read.csv('example.csv', header=True, inferSchema=True)
right_df.show()
Output
+-------+ | _c0| +-------+ | id| | 1| | 2| | 3| +-------+ +---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85.0| | 2| Bob| 90.5| | 3|Charlie| 78.0| +---+-------+-----+
๐Ÿ“Š

Quick Reference

OptionDescriptionExample
headerUse first row as column namesheader=True
inferSchemaDetect data types automaticallyinferSchema=True
sepSet delimiter charactersep=';'
modeHandle corrupt records (e.g., 'DROPMALFORMED')mode='DROPMALFORMED'
encodingSet file encodingencoding='UTF-8'
โœ…

Key Takeaways

Use spark.read.csv with header=True and inferSchema=True to read CSV files properly.
Always check the file path and delimiter to avoid read errors.
Without inferSchema, all columns are read as strings by default.
Setting header=True ensures the first row is used as column names.
Use options like mode and encoding to handle special CSV cases.