How to Create DataFrame from CSV in PySpark: Simple Guide
To create a DataFrame from a CSV file in PySpark, use
spark.read.csv() with the file path and options like header=True to read the first row as column names. This returns a DataFrame you can use for analysis.Syntax
The basic syntax to create a DataFrame from a CSV file in PySpark is:
spark.read.csv(path, header=True, inferSchema=True)path: The location of the CSV file.header=True: Treats the first row as column names.inferSchema=True: Automatically detects data types of columns.
python
df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)
Example
This example shows how to create a Spark session, read a CSV file into a DataFrame, and display its content.
python
from pyspark.sql import SparkSession # Create Spark session spark = SparkSession.builder.appName('CSVExample').getOrCreate() # Read CSV file into DataFrame file_path = 'example.csv' df = spark.read.csv(file_path, header=True, inferSchema=True) # Show DataFrame content df.show()
Output
+---+-------+-----+
| id| name|score|
+---+-------+-----+
| 1| Alice| 85 |
| 2| Bob| 90 |
| 3|Charlie| 78 |
+---+-------+-----+
Common Pitfalls
Common mistakes when creating DataFrames from CSV in PySpark include:
- Not setting
header=Truecauses the first row to be treated as data, not column names. - Skipping
inferSchema=Trueresults in all columns being read as strings. - Incorrect file path or missing file causes errors.
Always check the file path and use options to correctly read the CSV.
python
wrong_df = spark.read.csv('example.csv') # No header or schema wrong_df.show() # Correct way correct_df = spark.read.csv('example.csv', header=True, inferSchema=True) correct_df.show()
Output
+-------+
| _c0|
+-------+
|id,name,score|
|1,Alice,85|
|2,Bob,90|
|3,Charlie,78|
+-------+
+---+-------+-----+
| id| name|score|
+---+-------+-----+
| 1| Alice| 85 |
| 2| Bob| 90 |
| 3|Charlie| 78 |
+---+-------+-----+
Quick Reference
| Option | Description | Default |
|---|---|---|
| path | File path to the CSV file | Required |
| header | Use first row as column names | False |
| inferSchema | Automatically detect data types | False |
| sep | Field delimiter (default comma) | , |
| mode | Handling corrupt records (e.g., 'PERMISSIVE') | 'PERMISSIVE' |
Key Takeaways
Use spark.read.csv() with header=True and inferSchema=True to create DataFrame from CSV.
Always verify the file path to avoid file not found errors.
Without header=True, the first CSV row is treated as data, not column names.
Without inferSchema=True, all columns are read as strings by default.
Use .show() to quickly inspect the loaded DataFrame content.