0
0
Hadoopdata~10 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Data serialization (Avro, Parquet, ORC)
Raw Data
Choose Format
Avro
Serialize Data
Store in Hadoop
Read & Deserialize
Use Data for Analysis
Data flows from raw form to serialization in Avro, Parquet, or ORC formats, then stored and later read back for analysis.
Execution Sample
Hadoop
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])
df.write.parquet('data.parquet')
read_df = spark.read.parquet('data.parquet')
read_df.show()
This code saves a small table in Parquet format and reads it back to show the data.
Execution Table
StepActionInput DataFormat UsedOutput / Result
1Create DataFrame[(1, 'Alice'), (2, 'Bob')]N/ADataFrame with 2 rows
2Write DataFrame to fileDataFrameParquetFile 'data.parquet' created
3Read file backFile 'data.parquet'ParquetDataFrame with 2 rows
4Show DataFrameDataFrameN/ADisplays rows: (1, Alice), (2, Bob)
5EndN/AN/AProcess complete
💡 All steps completed successfully; data serialized and deserialized using Parquet.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
dfNoneDataFrame with 2 rowsDataFrame unchangedDataFrame unchangedDataFrame unchangedDataFrame unchanged
read_dfNoneNoneNoneDataFrame with 2 rowsDataFrame unchangedDataFrame unchanged
Key Moments - 2 Insights
Why do we need to write data in a special format like Parquet instead of just saving as text?
Parquet stores data in a compact, efficient way with schema info, making reading faster and saving space, as shown in step 2 where the DataFrame is saved in Parquet format.
What happens if we try to read the data without specifying the correct format?
The system may fail or read data incorrectly because it expects the format used during writing, as seen in step 3 where reading uses Parquet format to correctly deserialize.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the content of 'df' after step 1?
ADataFrame with 2 rows
BFile 'data.parquet'
CEmpty DataFrame
DNone
💡 Hint
Check the 'df' variable state in variable_tracker after step 1.
At which step is the data actually saved to disk in Parquet format?
AStep 1
BStep 2
CStep 3
DStep 4
💡 Hint
Look at the 'Action' and 'Output / Result' columns in execution_table.
If we changed the format from Parquet to Avro in step 2, what would change in step 3?
ADataFrame would have more rows
BNo change, still read as Parquet
CWe would read the file using Avro format instead of Parquet
DDataFrame would be empty
💡 Hint
Refer to the 'Format Used' column in execution_table for steps 2 and 3.
Concept Snapshot
Data serialization saves data in formats like Avro, Parquet, or ORC.
These formats store data efficiently with schema info.
Use write() to save and read() to load data.
Choose format based on use case: Avro for row-based, Parquet/ORC for column-based.
Serialization improves storage and query speed.
Full Transcript
This visual execution shows how raw data is converted into serialized formats like Avro, Parquet, or ORC for storage in Hadoop. The flow starts with raw data, choosing a format, serializing, storing, then reading back for analysis. The sample code creates a small table, saves it as Parquet, reads it back, and displays the data. The execution table traces each step, showing data creation, writing to Parquet, reading from Parquet, and displaying results. Variable tracking shows how the DataFrame variables change over steps. Key moments clarify why special formats are used and the importance of matching read and write formats. The quiz tests understanding of variable states, steps of saving, and format changes. The snapshot summarizes key points about data serialization formats and their use in Hadoop environments.