Challenge - 5 Problems

🎖️

Serialization Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the output of this Avro schema validation code?

Given this Avro schema and a JSON record, what will the validation output be?

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

Record: {"name": "Alice", "age": "twenty"}

Hadoop

from fastavro import parse_schema, validate

schema = {
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

parsed_schema = parse_schema(schema)

record = {"name": "Alice", "age": "twenty"}

is_valid = validate(record, parsed_schema)
print(is_valid)

ATypeError

BTrue

CFalse

DKeyError

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

How many rows are stored in this Parquet file snippet?

Assume you read a Parquet file with this Python snippet using PyArrow:

import pyarrow.parquet as pq

pq_file = pq.ParquetFile('data.parquet')
num_rows = pq_file.metadata.num_rows
print(num_rows)

If the file contains 3 row groups with 1000, 1500, and 500 rows respectively, what is the output?

Hadoop

import pyarrow.parquet as pq

pq_file = pq.ParquetFile('data.parquet')
num_rows = pq_file.metadata.num_rows
print(num_rows)

A3000

B1000

C500

D1500

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this ORC file read code raise an error?

Consider this Python code snippet using pyorc to read an ORC file:

import pyorc

with open('data.orc', 'rb') as file:
    reader = pyorc.Reader(file)
    for row in reader:
        print(row)

The code raises a ValueError: 'Invalid ORC file'. What is the most likely cause?

Hadoop

import pyorc

with open('data.orc', 'rb') as file:
    reader = pyorc.Reader(file)
    for row in reader:
        print(row)

AThe file is not a valid ORC file or is corrupted

BThe file was opened in text mode instead of binary mode

Cpyorc.Reader requires a file path string, not a file object

DThe ORC file is empty, causing the reader to fail

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Which data serialization format supports schema evolution best?

Among Avro, Parquet, and ORC, which format is designed to handle schema changes like adding or removing fields without breaking existing data?

AParquet

BAvro

CORC

DNone of these support schema evolution

Attempts:

2 left

🚀 Application

expert

2:00remaining

Which option produces the smallest file size for a large dataset with many repeated string values?

You have a large dataset with many repeated string values in one column. You want to serialize it efficiently using Hadoop-compatible formats. Which format will likely produce the smallest file size?

AORC without compression

BAvro with Snappy compression

CCSV compressed with gzip

DParquet with dictionary encoding and Snappy compression

Attempts:

2 left