How to Read CSV Files in PySpark: Simple Guide
To read a CSV file in PySpark, use
spark.read.csv(path, options). This loads the CSV into a DataFrame, where you can specify options like header=True to use the first row as column names and inferSchema=True to detect data types automatically.Syntax
The basic syntax to read a CSV file in PySpark is:
spark.read.csv(path, options): Reads the CSV file from the given path.path: The location of the CSV file (local or distributed storage).options: Optional parameters likeheaderandinferSchema.
Common options include:
header=True: Treats the first row as column names.inferSchema=True: Automatically detects data types.sep=',': Defines the delimiter (default is comma).
python
df = spark.read.csv('path/to/file.csv', header=True, inferSchema=True)
Example
This example shows how to read a CSV file with a header and inferred schema, then display the data.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ReadCSVExample').getOrCreate() # Read CSV file with header and schema inference file_path = 'example.csv' df = spark.read.csv(file_path, header=True, inferSchema=True) df.show() spark.stop()
Output
+---+-------+-----+
| id| name|score|
+---+-------+-----+
| 1| Alice| 85.0|
| 2| Bob| 90.5|
| 3|Charlie| 78.0|
+---+-------+-----+
Common Pitfalls
Some common mistakes when reading CSV files in PySpark include:
- Not setting
header=Truewhen the CSV has column names, causing the first row to be treated as data. - Forgetting
inferSchema=True, which results in all columns being read as strings. - Using the wrong file path or missing file errors.
- Not specifying the correct delimiter if the CSV uses a separator other than a comma.
Example of a wrong and right way:
python
# Wrong way: header missing, schema not inferred wrong_df = spark.read.csv('example.csv') wrong_df.show() # Right way: header and schema specified right_df = spark.read.csv('example.csv', header=True, inferSchema=True) right_df.show()
Output
+-------+
| _c0|
+-------+
| id|
| 1|
| 2|
| 3|
+-------+
+---+-------+-----+
| id| name|score|
+---+-------+-----+
| 1| Alice| 85.0|
| 2| Bob| 90.5|
| 3|Charlie| 78.0|
+---+-------+-----+
Quick Reference
| Option | Description | Example |
|---|---|---|
| header | Use first row as column names | header=True |
| inferSchema | Detect data types automatically | inferSchema=True |
| sep | Set delimiter character | sep=';' |
| mode | Handle corrupt records (e.g., 'DROPMALFORMED') | mode='DROPMALFORMED' |
| encoding | Set file encoding | encoding='UTF-8' |
Key Takeaways
Use spark.read.csv with header=True and inferSchema=True to read CSV files properly.
Always check the file path and delimiter to avoid read errors.
Without inferSchema, all columns are read as strings by default.
Setting header=True ensures the first row is used as column names.
Use options like mode and encoding to handle special CSV cases.