0
0
Hadoopdata~30 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Working with Data Serialization Formats: Avro, Parquet, and ORC
📖 Scenario: You work in a company that collects sales data daily. The data is stored in different formats to save space and speed up processing. You want to practice reading and writing data using popular serialization formats: Avro, Parquet, and ORC.
🎯 Goal: Learn how to create a simple dataset, configure the output format, write the data in the chosen serialization format, and then read it back to see the stored data.
📋 What You'll Learn
Create a sample dataset as a list of dictionaries
Set a variable to choose the serialization format
Write the dataset to a file in the chosen format
Read the file back and print the data
💡 Why This Matters
🌍 Real World
Data serialization formats like Avro, Parquet, and ORC are used in big data systems to store and transfer data efficiently.
💼 Career
Knowing how to read and write these formats is important for data engineers and data scientists working with Hadoop and big data tools.
Progress0 / 4 steps
1
Create the sample sales data
Create a variable called sales_data that is a list of dictionaries. Each dictionary should have these exact keys and values: {'date': '2024-04-01', 'product': 'apple', 'quantity': 10}, {'date': '2024-04-01', 'product': 'banana', 'quantity': 5}, and {'date': '2024-04-02', 'product': 'apple', 'quantity': 7}.
Hadoop
Need a hint?

Use a list with three dictionaries exactly as shown.

2
Set the serialization format
Create a variable called format_choice and set it to the string 'parquet'. This variable will decide which serialization format to use.
Hadoop
Need a hint?

Set the variable exactly as format_choice = 'parquet'.

3
Write the data to a file in the chosen format
Use the format_choice variable to write sales_data to a file named sales_data with the correct extension: .parquet for 'parquet', .avro for 'avro', and .orc for 'orc'. Use PyArrow or fastavro libraries to write the data. Write the code to handle the 'parquet' case only.
Hadoop
Need a hint?

Use PyArrow's Table.from_pylist and write_table functions for parquet.

4
Read the data back and print it
Read the data from the sales_data.parquet file using PyArrow and print the result as a list of dictionaries.
Hadoop
Need a hint?

Use pq.read_table and to_pylist() to read and convert the data.