0
0
Apache Sparkdata~5 mins

Parquet format and columnar storage in Apache Spark

Choose your learning style9 modes available
Introduction

Parquet format stores data by columns, making it faster and smaller to read and write. It helps analyze big data efficiently.

When you want to save large datasets for fast reading later.
When you only need to read a few columns from a big table.
When you want to reduce storage space for your data.
When working with big data tools like Apache Spark or Hadoop.
When you want to improve query speed on large datasets.
Syntax
Apache Spark
df.write.parquet("path/to/save")
df = spark.read.parquet("path/to/save")

Use write.parquet() to save a DataFrame in Parquet format.

Use read.parquet() to load Parquet files back into a DataFrame.

Examples
Saves the DataFrame df as a Parquet file in the folder data/output.parquet.
Apache Spark
df.write.parquet("data/output.parquet")
Loads the Parquet file back into a new DataFrame df2.
Apache Spark
df2 = spark.read.parquet("data/output.parquet")
Saves only the name and age columns to Parquet, saving space and time.
Apache Spark
df.select("name", "age").write.parquet("data/names_ages.parquet")
Sample Program

This program creates a small DataFrame with people data, saves it as a Parquet file, then reads it back and shows the content. It demonstrates how Parquet stores data efficiently and can be reused easily.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()

# Create a simple DataFrame
data = [(1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Save DataFrame as Parquet
parquet_path = "./people.parquet"
df.write.mode("overwrite").parquet(parquet_path)

# Read the Parquet file back
df_parquet = spark.read.parquet(parquet_path)

# Show the loaded data
print("Data loaded from Parquet:")
df_parquet.show()
OutputSuccess
Important Notes

Parquet files store data by columns, so reading only needed columns is faster.

Parquet compresses data automatically, saving disk space.

Always use mode("overwrite") if you want to replace existing Parquet files.

Summary

Parquet format stores data column-wise for faster and smaller storage.

Use write.parquet() and read.parquet() in Spark to save and load data.

Parquet is great for big data and improves query speed and storage efficiency.