Parquet format stores data by columns, making it faster and smaller to read and write. It helps analyze big data efficiently.
Parquet format and columnar storage in Apache Spark
df.write.parquet("path/to/save") df = spark.read.parquet("path/to/save")
Use write.parquet() to save a DataFrame in Parquet format.
Use read.parquet() to load Parquet files back into a DataFrame.
df as a Parquet file in the folder data/output.parquet.df.write.parquet("data/output.parquet")df2.df2 = spark.read.parquet("data/output.parquet")name and age columns to Parquet, saving space and time.df.select("name", "age").write.parquet("data/names_ages.parquet")
This program creates a small DataFrame with people data, saves it as a Parquet file, then reads it back and shows the content. It demonstrates how Parquet stores data efficiently and can be reused easily.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ParquetExample").getOrCreate() # Create a simple DataFrame data = [(1, "Alice", 29), (2, "Bob", 31), (3, "Cathy", 25)] columns = ["id", "name", "age"] df = spark.createDataFrame(data, columns) # Save DataFrame as Parquet parquet_path = "./people.parquet" df.write.mode("overwrite").parquet(parquet_path) # Read the Parquet file back df_parquet = spark.read.parquet(parquet_path) # Show the loaded data print("Data loaded from Parquet:") df_parquet.show()
Parquet files store data by columns, so reading only needed columns is faster.
Parquet compresses data automatically, saving disk space.
Always use mode("overwrite") if you want to replace existing Parquet files.
Parquet format stores data column-wise for faster and smaller storage.
Use write.parquet() and read.parquet() in Spark to save and load data.
Parquet is great for big data and improves query speed and storage efficiency.