Apache Sparkdata~5 mins

Parquet format and columnar storage in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Parquet format stores data by columns, making it faster and smaller to read and write. It helps analyze big data efficiently.

When you want to save large datasets for fast reading later.

When you only need to read a few columns from a big table.

When you want to reduce storage space for your data.

When working with big data tools like Apache Spark or Hadoop.

When you want to improve query speed on large datasets.

Syntax

Apache Spark

df.write.parquet("path/to/save")
df = spark.read.parquet("path/to/save")

Use write.parquet() to save a DataFrame in Parquet format.

Use read.parquet() to load Parquet files back into a DataFrame.

Examples

Saves the DataFrame df as a Parquet file in the folder data/output.parquet.

Apache Spark

df.write.parquet("data/output.parquet")

Loads the Parquet file back into a new DataFrame df2.

Apache Spark

df2 = spark.read.parquet("data/output.parquet")

Saves only the name and age columns to Parquet, saving space and time.

Apache Spark

df.select("name", "age").write.parquet("data/names_ages.parquet")

Sample Program

This program creates a small DataFrame with people data, saves it as a Parquet file, then reads it back and shows the content. It demonstrates how Parquet stores data efficiently and can be reused easily.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()

# Create a simple DataFrame
data = [(1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Save DataFrame as Parquet
parquet_path = "./people.parquet"
df.write.mode("overwrite").parquet(parquet_path)

# Read the Parquet file back
df_parquet = spark.read.parquet(parquet_path)

# Show the loaded data
print("Data loaded from Parquet:")
df_parquet.show()

OutputSuccess

Important Notes

Parquet files store data by columns, so reading only needed columns is faster.

Parquet compresses data automatically, saving disk space.

Always use mode("overwrite") if you want to replace existing Parquet files.

Summary

Parquet format stores data column-wise for faster and smaller storage.

Use write.parquet() and read.parquet() in Spark to save and load data.

Parquet is great for big data and improves query speed and storage efficiency.