0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Create DataFrame from CSV in PySpark: Simple Guide

To create a DataFrame from a CSV file in PySpark, use spark.read.csv() with the file path and options like header=True to read the first row as column names. This returns a DataFrame you can use for analysis.
๐Ÿ“

Syntax

The basic syntax to create a DataFrame from a CSV file in PySpark is:

  • spark.read.csv(path, header=True, inferSchema=True)
  • path: The location of the CSV file.
  • header=True: Treats the first row as column names.
  • inferSchema=True: Automatically detects data types of columns.
python
df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)
๐Ÿ’ป

Example

This example shows how to create a Spark session, read a CSV file into a DataFrame, and display its content.

python
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName('CSVExample').getOrCreate()

# Read CSV file into DataFrame
file_path = 'example.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Show DataFrame content
df.show()
Output
+---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85 | | 2| Bob| 90 | | 3|Charlie| 78 | +---+-------+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when creating DataFrames from CSV in PySpark include:

  • Not setting header=True causes the first row to be treated as data, not column names.
  • Skipping inferSchema=True results in all columns being read as strings.
  • Incorrect file path or missing file causes errors.

Always check the file path and use options to correctly read the CSV.

python
wrong_df = spark.read.csv('example.csv')  # No header or schema
wrong_df.show()

# Correct way
correct_df = spark.read.csv('example.csv', header=True, inferSchema=True)
correct_df.show()
Output
+-------+ | _c0| +-------+ |id,name,score| |1,Alice,85| |2,Bob,90| |3,Charlie,78| +-------+ +---+-------+-----+ | id| name|score| +---+-------+-----+ | 1| Alice| 85 | | 2| Bob| 90 | | 3|Charlie| 78 | +---+-------+-----+
๐Ÿ“Š

Quick Reference

OptionDescriptionDefault
pathFile path to the CSV fileRequired
headerUse first row as column namesFalse
inferSchemaAutomatically detect data typesFalse
sepField delimiter (default comma),
modeHandling corrupt records (e.g., 'PERMISSIVE')'PERMISSIVE'
โœ…

Key Takeaways

Use spark.read.csv() with header=True and inferSchema=True to create DataFrame from CSV.
Always verify the file path to avoid file not found errors.
Without header=True, the first CSV row is treated as data, not column names.
Without inferSchema=True, all columns are read as strings by default.
Use .show() to quickly inspect the loaded DataFrame content.