0
0
Apache Sparkdata~5 mins

Reading CSV files with options in Apache Spark

Choose your learning style9 modes available
Introduction

We use options to tell Spark exactly how to read a CSV file. This helps Spark understand the file better and get the data right.

When the CSV file has a header row with column names.
When the data uses a separator other than a comma, like a semicolon or tab.
When you want to ignore bad or corrupted lines in the file.
When you want to specify how to handle missing values.
When you want to control how Spark infers data types.
Syntax
Apache Spark
spark.read.option("option_name", "option_value").csv("file_path")

You can chain multiple option() calls to set different options.

Common options include header, sep, inferSchema, and mode.

Examples
Reads a CSV file where the first row has column names.
Apache Spark
df = spark.read.option("header", "true").csv("data.csv")
Reads a CSV file with semicolon separators and a header row.
Apache Spark
df = spark.read.option("sep", ";").option("header", "true").csv("data.csv")
Reads a CSV file and tries to guess the data types of each column.
Apache Spark
df = spark.read.option("inferSchema", "true").csv("data.csv")
Reads a CSV file and skips lines that are corrupted or badly formatted.
Apache Spark
df = spark.read.option("mode", "DROPMALFORMED").csv("data.csv")
Sample Program

This program reads a CSV file named example.csv that uses semicolons to separate values and has a header row. It also asks Spark to guess the data types of each column. Then it prints the schema and the data.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSVOptionsExample").getOrCreate()

# Read CSV with header, semicolon separator, and infer schema
file_path = "example.csv"
df = spark.read.option("header", "true")\
               .option("sep", ";")\
               .option("inferSchema", "true")\
               .csv(file_path)

# Show the data
print("DataFrame schema:")
df.printSchema()
print("DataFrame content:")
df.show()

spark.stop()
OutputSuccess
Important Notes

Setting header to true tells Spark to use the first row as column names.

The inferSchema option can slow down reading for big files but gives correct data types.

The mode option controls how Spark handles bad lines: PERMISSIVE (default), DROPMALFORMED, or FAILFAST.

Summary

Use options to customize how Spark reads CSV files.

Common options include header, sep, inferSchema, and mode.

Setting options correctly helps Spark read data accurately and avoid errors.