Apache Sparkdata~5 mins

Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We use DataFrames to organize data in rows and columns. Loading data from files like CSV, JSON, or Parquet helps us start analyzing real-world data easily.

You have a CSV file with sales data and want to analyze it.

You receive JSON logs from a web server and want to explore them.

You want to load a Parquet file from a data lake for fast processing.

You need to combine data from different file formats into one table.

You want to quickly check the contents of a data file before analysis.

Syntax

Apache Spark

spark.read.format("file_format").option("option_name", "option_value").load("file_path")

Replace file_format with csv, json, or parquet.

Options like header or inferSchema help Spark understand the data better.

Examples

Load a CSV file with headers and automatically detect data types.

Apache Spark

df_csv = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/sales.csv")

Load a JSON file without extra options.

Apache Spark

df_json = spark.read.format("json").load("data/logs.json")

Load a Parquet file, which is already optimized for Spark.

Apache Spark

df_parquet = spark.read.format("parquet").load("data/records.parquet")

Sample Program

This program creates a Spark session, loads three types of files into DataFrames, and prints the first 3 rows of each. It shows how easy it is to start working with data from different file formats.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LoadFilesExample").getOrCreate()

# Load CSV file with header and schema inference
csv_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./data/sample.csv")

# Load JSON file
json_df = spark.read.format("json").load("./data/sample.json")

# Load Parquet file
parquet_df = spark.read.format("parquet").load("./data/sample.parquet")

# Show first 3 rows of each DataFrame
print("CSV DataFrame:")
csv_df.show(3)

print("JSON DataFrame:")
json_df.show(3)

print("Parquet DataFrame:")
parquet_df.show(3)

spark.stop()

OutputSuccess

Important Notes

Make sure the file path is correct and accessible by Spark.

CSV files often need header and inferSchema options for better results.

Parquet files store schema inside, so no extra options are usually needed.

Summary

DataFrames organize data in rows and columns for easy analysis.

You can load CSV, JSON, and Parquet files using spark.read.format().load().

Options help Spark understand the data better, especially for CSV files.