0
0
Apache Sparkdata~5 mins

Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark

Choose your learning style9 modes available
Introduction

We use DataFrames to organize data in rows and columns. Loading data from files like CSV, JSON, or Parquet helps us start analyzing real-world data easily.

You have a CSV file with sales data and want to analyze it.
You receive JSON logs from a web server and want to explore them.
You want to load a Parquet file from a data lake for fast processing.
You need to combine data from different file formats into one table.
You want to quickly check the contents of a data file before analysis.
Syntax
Apache Spark
spark.read.format("file_format").option("option_name", "option_value").load("file_path")

Replace file_format with csv, json, or parquet.

Options like header or inferSchema help Spark understand the data better.

Examples
Load a CSV file with headers and automatically detect data types.
Apache Spark
df_csv = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/sales.csv")
Load a JSON file without extra options.
Apache Spark
df_json = spark.read.format("json").load("data/logs.json")
Load a Parquet file, which is already optimized for Spark.
Apache Spark
df_parquet = spark.read.format("parquet").load("data/records.parquet")
Sample Program

This program creates a Spark session, loads three types of files into DataFrames, and prints the first 3 rows of each. It shows how easy it is to start working with data from different file formats.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LoadFilesExample").getOrCreate()

# Load CSV file with header and schema inference
csv_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./data/sample.csv")

# Load JSON file
json_df = spark.read.format("json").load("./data/sample.json")

# Load Parquet file
parquet_df = spark.read.format("parquet").load("./data/sample.parquet")

# Show first 3 rows of each DataFrame
print("CSV DataFrame:")
csv_df.show(3)

print("JSON DataFrame:")
json_df.show(3)

print("Parquet DataFrame:")
parquet_df.show(3)

spark.stop()
OutputSuccess
Important Notes

Make sure the file path is correct and accessible by Spark.

CSV files often need header and inferSchema options for better results.

Parquet files store schema inside, so no extra options are usually needed.

Summary

DataFrames organize data in rows and columns for easy analysis.

You can load CSV, JSON, and Parquet files using spark.read.format().load().

Options help Spark understand the data better, especially for CSV files.