What is Data serialization (Avro, Parquet, ORC) in Hadoop?

Hadoopdata~5 mins

Data serialization (Avro, Parquet, ORC) in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Data serialization helps save and share data efficiently. It makes data smaller and faster to read or write.

When you want to store large amounts of data in a compact way.

When you need to transfer data between different systems quickly.

When you want to read data fast for analysis or reporting.

When you want to keep data organized with a schema.

When you want to save data in a format that works well with big data tools like Hadoop.

Syntax

Hadoop

Use tools or libraries to write/read data in formats like Avro, Parquet, or ORC.
Example: df.write.format('parquet').save('path')

Avro stores data row-wise and includes a schema for easy reading and writing.

Parquet and ORC store data column-wise, which is faster for reading specific columns.

Examples

Read data stored in Avro format using Spark.

Hadoop

spark.read.format('avro').load('data.avro')

Save a DataFrame in Parquet format for efficient storage.

Hadoop

df.write.format('parquet').save('output/path')

Load data stored in ORC format for fast columnar access.

Hadoop

spark.read.format('orc').load('data.orc')

Sample Program

This program creates a small table with three people. It saves the data in Avro, Parquet, and ORC formats. Then it reads each format back and shows the data. This helps see how serialization works in Hadoop with Spark.

Hadoop

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SerializationExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Save data in Avro format
avro_path = '/tmp/data_avro'
df.write.format('avro').mode('overwrite').save(avro_path)

# Read back Avro data
df_avro = spark.read.format('avro').load(avro_path)
print('Avro Data:')
df_avro.show()

# Save data in Parquet format
parquet_path = '/tmp/data_parquet'
df.write.format('parquet').mode('overwrite').save(parquet_path)

# Read back Parquet data
df_parquet = spark.read.format('parquet').load(parquet_path)
print('Parquet Data:')
df_parquet.show()

# Save data in ORC format
orc_path = '/tmp/data_orc'
df.write.format('orc').mode('overwrite').save(orc_path)

# Read back ORC data
df_orc = spark.read.format('orc').load(orc_path)
print('ORC Data:')
df_orc.show()

spark.stop()

OutputSuccess

Important Notes

Avro is good when you want to keep the schema with the data for easy reading and writing.

Parquet and ORC are better for big data queries because they store data by columns, making reading faster.

Make sure the storage path exists and you have permission to write when saving data.

Summary

Data serialization saves data in formats like Avro, Parquet, and ORC to make storage and reading efficient.

Avro stores data row-wise with schema, while Parquet and ORC store data column-wise for faster queries.

These formats are widely used in big data tools like Hadoop and Spark.