Apache-sparkHow-ToBeginner · 3 min read

How to Create DataFrame from CSV in PySpark: Simple Guide

To create a DataFrame from a CSV file in PySpark, use spark.read.csv() with the file path and options like header=True to read the first row as column names. This returns a DataFrame you can use for analysis.

📐

Syntax

The basic syntax to create a DataFrame from a CSV file in PySpark is:

spark.read.csv(path, header=True, inferSchema=True)
path: The location of the CSV file.
header=True: Treats the first row as column names.
inferSchema=True: Automatically detects data types of columns.

python

df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)

💻

Example

This example shows how to create a Spark session, read a CSV file into a DataFrame, and display its content.

python

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName('CSVExample').getOrCreate()

# Read CSV file into DataFrame
file_path = 'example.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Show DataFrame content
df.show()

Output

+---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85 | | 2| Bob| 90 | | 3|Charlie| 78 | +---+-------+-----+

⚠️

Common Pitfalls

Common mistakes when creating DataFrames from CSV in PySpark include:

Not setting header=True causes the first row to be treated as data, not column names.
Skipping inferSchema=True results in all columns being read as strings.
Incorrect file path or missing file causes errors.

Always check the file path and use options to correctly read the CSV.

python

wrong_df = spark.read.csv('example.csv')  # No header or schema
wrong_df.show()

# Correct way
correct_df = spark.read.csv('example.csv', header=True, inferSchema=True)
correct_df.show()

Output

+-------+ | _c0| +-------+ |id,name,score| |1,Alice,85| |2,Bob,90| |3,Charlie,78| +-------+ +---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85 | | 2| Bob| 90 | | 3|Charlie| 78 | +---+-------+-----+

📊

Quick Reference

Option	Description	Default
path	File path to the CSV file	Required
header	Use first row as column names	False
inferSchema	Automatically detect data types	False
sep	Field delimiter (default comma)	,
mode	Handling corrupt records (e.g., 'PERMISSIVE')	'PERMISSIVE'

✅

Key Takeaways

Use spark.read.csv() with header=True and inferSchema=True to create DataFrame from CSV.

Always verify the file path to avoid file not found errors.

Without header=True, the first CSV row is treated as data, not column names.

Without inferSchema=True, all columns are read as strings by default.

Use .show() to quickly inspect the loaded DataFrame content.