0
0
Apache Sparkdata~10 mins

Parquet format and columnar storage in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Parquet format and columnar storage
Start: Data in rows
Convert to columns
Store each column separately
Apply compression on columns
Save as Parquet file
Read Parquet file
Load only needed columns
Process data efficiently
End
Data is transformed from row-based to column-based storage, compressed, saved as Parquet, and read efficiently by loading only needed columns.
Execution Sample
Apache Spark
df = spark.read.csv('data.csv', header=True)
df.write.parquet('data.parquet')
parquet_df = spark.read.parquet('data.parquet')
parquet_df = parquet_df.select('column1')
parquet_df.show()
Read CSV data, save it as Parquet, then read Parquet and select one column to show.
Execution Table
StepActionInput Data ShapeStorage FormatOutput
1Read CSV fileRows with all columnsCSV (row-based)DataFrame with rows
2Write DataFrame as ParquetRows with all columnsParquet (columnar)Parquet file with columns stored separately
3Read Parquet fileParquet fileParquet (columnar)DataFrame with columns loaded
4Select one columnDataFrame with all columnsIn-memory column dataDataFrame with only selected column
5Show dataSelected column dataIn-memoryPrinted column values
6End---
💡 Process ends after showing selected column data from Parquet file.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
dfNoneDataFrame with all rows and columnsSame DataFrameSame DataFrameSame DataFrameSame DataFrame
parquet_dfNoneNoneNoneDataFrame loaded from ParquetDataFrame with selected columnDataFrame with selected column
Key Moments - 3 Insights
Why does Parquet store data in columns instead of rows?
Because storing data by columns allows reading only the needed columns, which saves time and space. This is shown in execution_table step 4 where only one column is selected and loaded.
How does Parquet improve storage size?
Parquet applies compression on each column separately, which is more efficient because similar data types are stored together. This is implied in the concept_flow step 'Apply compression on columns'.
When reading Parquet, do we load all data or only what we need?
We load only the columns we need, as shown in execution_table step 4 where only 'column1' is selected and loaded, improving performance.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the storage format after writing the DataFrame?
ACSV (row-based)
BJSON
CParquet (columnar)
DText file
💡 Hint
Check step 2 in execution_table where the DataFrame is written as Parquet.
At which step do we select only one column from the DataFrame?
AStep 2
BStep 4
CStep 3
DStep 5
💡 Hint
Look at execution_table step 4 where 'Select one column' action happens.
If we read the Parquet file without selecting columns, what happens?
AAll columns are loaded into memory
BOnly one column is loaded
CNo data is loaded
DFile is converted back to CSV
💡 Hint
Refer to execution_table step 3 where the Parquet file is read fully before column selection.
Concept Snapshot
Parquet format stores data by columns, not rows.
This allows reading only needed columns, saving time and space.
Each column is compressed separately for efficiency.
Use spark.read.parquet() to load Parquet files.
Select columns to load less data and speed up processing.
Full Transcript
This visual execution shows how data is read from a CSV file into a DataFrame, then saved as a Parquet file which stores data in a columnar format. The Parquet format stores each column separately and compresses it, making storage efficient. When reading the Parquet file, you can select only the columns you need, which loads less data and speeds up processing. The execution table traces each step: reading CSV, writing Parquet, reading Parquet, selecting columns, and showing data. Variable tracking shows how the DataFrame changes from all columns to selected columns. Key moments clarify why columnar storage is faster and smaller, and how Parquet reads only needed columns. The quiz tests understanding of storage format, selection step, and data loading behavior. The snapshot summarizes the main points about Parquet and columnar storage.