Hadoopdata~10 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Data serialization (Avro, Parquet, ORC)

Raw Data

↓

Choose Format

↓

Avro

↓

Serialize Data

↓

Store in Hadoop

↓

Read & Deserialize

↓

Use Data for Analysis

Data flows from raw form to serialization in Avro, Parquet, or ORC formats, then stored and later read back for analysis.

Execution Sample

Hadoop

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])
df.write.parquet('data.parquet')
read_df = spark.read.parquet('data.parquet')
read_df.show()

This code saves a small table in Parquet format and reads it back to show the data.

Execution Table

Step	Action	Input Data	Format Used	Output / Result
1	Create DataFrame	[(1, 'Alice'), (2, 'Bob')]	N/A	DataFrame with 2 rows
2	Write DataFrame to file	DataFrame	Parquet	File 'data.parquet' created
3	Read file back	File 'data.parquet'	Parquet	DataFrame with 2 rows
4	Show DataFrame	DataFrame	N/A	Displays rows: (1, Alice), (2, Bob)
5	End	N/A	N/A	Process complete

💡 All steps completed successfully; data serialized and deserialized using Parquet.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	After Step 4	Final
df	None	DataFrame with 2 rows	DataFrame unchanged	DataFrame unchanged	DataFrame unchanged	DataFrame unchanged
read_df	None	None	None	DataFrame with 2 rows	DataFrame unchanged	DataFrame unchanged

Key Moments - 2 Insights

Why do we need to write data in a special format like Parquet instead of just saving as text?

What happens if we try to read the data without specifying the correct format?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the content of 'df' after step 1?

ADataFrame with 2 rows

BFile 'data.parquet'

CEmpty DataFrame

DNone

Concept Snapshot

Data serialization saves data in formats like Avro, Parquet, or ORC.
These formats store data efficiently with schema info.
Use write() to save and read() to load data.
Choose format based on use case: Avro for row-based, Parquet/ORC for column-based.
Serialization improves storage and query speed.

Full Transcript

This visual execution shows how raw data is converted into serialized formats like Avro, Parquet, or ORC for storage in Hadoop. The flow starts with raw data, choosing a format, serializing, storing, then reading back for analysis. The sample code creates a small table, saves it as Parquet, reads it back, and displays the data. The execution table traces each step, showing data creation, writing to Parquet, reading from Parquet, and displaying results. Variable tracking shows how the DataFrame variables change over steps. Key moments clarify why special formats are used and the importance of matching read and write formats. The quiz tests understanding of variable states, steps of saving, and format changes. The snapshot summarizes key points about data serialization formats and their use in Hadoop environments.