0
0
Hadoopdata~20 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Serialization Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this Avro schema validation code?

Given this Avro schema and a JSON record, what will the validation output be?

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

Record: {"name": "Alice", "age": "twenty"}
Hadoop
from fastavro import parse_schema, validate

schema = {
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

parsed_schema = parse_schema(schema)

record = {"name": "Alice", "age": "twenty"}

is_valid = validate(record, parsed_schema)
print(is_valid)
ATypeError
BTrue
CFalse
DKeyError
Attempts:
2 left
💡 Hint

Check the data types in the record compared to the schema.

data_output
intermediate
2:00remaining
How many rows are stored in this Parquet file snippet?

Assume you read a Parquet file with this Python snippet using PyArrow:

import pyarrow.parquet as pq

pq_file = pq.ParquetFile('data.parquet')
num_rows = pq_file.metadata.num_rows
print(num_rows)

If the file contains 3 row groups with 1000, 1500, and 500 rows respectively, what is the output?

Hadoop
import pyarrow.parquet as pq

pq_file = pq.ParquetFile('data.parquet')
num_rows = pq_file.metadata.num_rows
print(num_rows)
A3000
B1000
C500
D1500
Attempts:
2 left
💡 Hint

Remember that total rows is sum of all row groups.

🔧 Debug
advanced
2:00remaining
Why does this ORC file read code raise an error?

Consider this Python code snippet using pyorc to read an ORC file:

import pyorc

with open('data.orc', 'rb') as file:
    reader = pyorc.Reader(file)
    for row in reader:
        print(row)

The code raises a ValueError: 'Invalid ORC file'. What is the most likely cause?

Hadoop
import pyorc

with open('data.orc', 'rb') as file:
    reader = pyorc.Reader(file)
    for row in reader:
        print(row)
AThe file is not a valid ORC file or is corrupted
BThe file was opened in text mode instead of binary mode
Cpyorc.Reader requires a file path string, not a file object
DThe ORC file is empty, causing the reader to fail
Attempts:
2 left
💡 Hint

Check the file format and integrity before reading.

🧠 Conceptual
advanced
2:00remaining
Which data serialization format supports schema evolution best?

Among Avro, Parquet, and ORC, which format is designed to handle schema changes like adding or removing fields without breaking existing data?

AParquet
BAvro
CORC
DNone of these support schema evolution
Attempts:
2 left
💡 Hint

Think about which format stores schema with data and supports forward/backward compatibility.

🚀 Application
expert
2:00remaining
Which option produces the smallest file size for a large dataset with many repeated string values?

You have a large dataset with many repeated string values in one column. You want to serialize it efficiently using Hadoop-compatible formats. Which format will likely produce the smallest file size?

AORC without compression
BAvro with Snappy compression
CCSV compressed with gzip
DParquet with dictionary encoding and Snappy compression
Attempts:
2 left
💡 Hint

Consider columnar formats and encoding techniques for repeated values.