Hadoopdata~30 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Working with Data Serialization Formats: Avro, Parquet, and ORC

📖 Scenario: You work in a company that collects sales data daily. The data is stored in different formats to save space and speed up processing. You want to practice reading and writing data using popular serialization formats: Avro, Parquet, and ORC.

🎯 Goal: Learn how to create a simple dataset, configure the output format, write the data in the chosen serialization format, and then read it back to see the stored data.

📋 What You'll Learn

Create a sample dataset as a list of dictionaries

Set a variable to choose the serialization format

Write the dataset to a file in the chosen format

Read the file back and print the data

💡 Why This Matters

🌍 Real World

Data serialization formats like Avro, Parquet, and ORC are used in big data systems to store and transfer data efficiently.

💼 Career

Knowing how to read and write these formats is important for data engineers and data scientists working with Hadoop and big data tools.

Progress0 / 4 steps

Create the sample sales data

Create a variable called sales_data that is a list of dictionaries. Each dictionary should have these exact keys and values: {'date': '2024-04-01', 'product': 'apple', 'quantity': 10}, {'date': '2024-04-01', 'product': 'banana', 'quantity': 5}, and {'date': '2024-04-02', 'product': 'apple', 'quantity': 7}.

Hadoop

# Create the sales_data list with the exact dictionaries
# Your code here

Need a hint?

Use a list with three dictionaries exactly as shown.

Set the serialization format

Create a variable called format_choice and set it to the string 'parquet'. This variable will decide which serialization format to use.

Hadoop

sales_data = [
    {'date': '2024-04-01', 'product': 'apple', 'quantity': 10},
    {'date': '2024-04-01', 'product': 'banana', 'quantity': 5},
    {'date': '2024-04-02', 'product': 'apple', 'quantity': 7}
]
# Set format_choice to 'parquet'
# Your code here

Need a hint?

Set the variable exactly as format_choice = 'parquet'.

Write the data to a file in the chosen format

Use the format_choice variable to write sales_data to a file named sales_data with the correct extension: .parquet for 'parquet', .avro for 'avro', and .orc for 'orc'. Use PyArrow or fastavro libraries to write the data. Write the code to handle the 'parquet' case only.

Hadoop

sales_data = [
    {'date': '2024-04-01', 'product': 'apple', 'quantity': 10},
    {'date': '2024-04-01', 'product': 'banana', 'quantity': 5},
    {'date': '2024-04-02', 'product': 'apple', 'quantity': 7}
]
format_choice = 'parquet'

# Write sales_data to 'sales_data.parquet' if format_choice is 'parquet'
# Your code here

Need a hint?

Use PyArrow's Table.from_pylist and write_table functions for parquet.

Read the data back and print it

Read the data from the sales_data.parquet file using PyArrow and print the result as a list of dictionaries.

Hadoop

import pyarrow as pa
import pyarrow.parquet as pq

sales_data = [
    {'date': '2024-04-01', 'product': 'apple', 'quantity': 10},
    {'date': '2024-04-01', 'product': 'banana', 'quantity': 5},
    {'date': '2024-04-02', 'product': 'apple', 'quantity': 7}
]
format_choice = 'parquet'

if format_choice == 'parquet':
    table = pa.Table.from_pylist(sales_data)
    pq.write_table(table, 'sales_data.parquet')

# Read the parquet file and print the data
# Your code here

Need a hint?

Use pq.read_table and to_pylist() to read and convert the data.