We use DataFrames to organize data in rows and columns. Loading data from files like CSV, JSON, or Parquet helps us start analyzing real-world data easily.
0
0
Creating DataFrames from files (CSV, JSON, Parquet) in Apache Spark
Introduction
You have a CSV file with sales data and want to analyze it.
You receive JSON logs from a web server and want to explore them.
You want to load a Parquet file from a data lake for fast processing.
You need to combine data from different file formats into one table.
You want to quickly check the contents of a data file before analysis.
Syntax
Apache Spark
spark.read.format("file_format").option("option_name", "option_value").load("file_path")
Replace file_format with csv, json, or parquet.
Options like header or inferSchema help Spark understand the data better.
Examples
Load a CSV file with headers and automatically detect data types.
Apache Spark
df_csv = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/sales.csv")
Load a JSON file without extra options.
Apache Spark
df_json = spark.read.format("json").load("data/logs.json")
Load a Parquet file, which is already optimized for Spark.
Apache Spark
df_parquet = spark.read.format("parquet").load("data/records.parquet")
Sample Program
This program creates a Spark session, loads three types of files into DataFrames, and prints the first 3 rows of each. It shows how easy it is to start working with data from different file formats.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("LoadFilesExample").getOrCreate() # Load CSV file with header and schema inference csv_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./data/sample.csv") # Load JSON file json_df = spark.read.format("json").load("./data/sample.json") # Load Parquet file parquet_df = spark.read.format("parquet").load("./data/sample.parquet") # Show first 3 rows of each DataFrame print("CSV DataFrame:") csv_df.show(3) print("JSON DataFrame:") json_df.show(3) print("Parquet DataFrame:") parquet_df.show(3) spark.stop()
OutputSuccess
Important Notes
Make sure the file path is correct and accessible by Spark.
CSV files often need header and inferSchema options for better results.
Parquet files store schema inside, so no extra options are usually needed.
Summary
DataFrames organize data in rows and columns for easy analysis.
You can load CSV, JSON, and Parquet files using spark.read.format().load().
Options help Spark understand the data better, especially for CSV files.