Given this Avro schema and a JSON record, what will the validation output be?
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
Record: {"name": "Alice", "age": "twenty"}from fastavro import parse_schema, validate schema = { "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "age", "type": "int"} ] } parsed_schema = parse_schema(schema) record = {"name": "Alice", "age": "twenty"} is_valid = validate(record, parsed_schema) print(is_valid)
Check the data types in the record compared to the schema.
The schema expects an integer for 'age', but the record provides a string 'twenty'. So validation fails and returns False.
Assume you read a Parquet file with this Python snippet using PyArrow:
import pyarrow.parquet as pq
pq_file = pq.ParquetFile('data.parquet')
num_rows = pq_file.metadata.num_rows
print(num_rows)If the file contains 3 row groups with 1000, 1500, and 500 rows respectively, what is the output?
import pyarrow.parquet as pq pq_file = pq.ParquetFile('data.parquet') num_rows = pq_file.metadata.num_rows print(num_rows)
Remember that total rows is sum of all row groups.
The metadata.num_rows returns the total number of rows in the Parquet file, which is 1000 + 1500 + 500 = 3000.
Consider this Python code snippet using pyorc to read an ORC file:
import pyorc
with open('data.orc', 'rb') as file:
reader = pyorc.Reader(file)
for row in reader:
print(row)The code raises a ValueError: 'Invalid ORC file'. What is the most likely cause?
import pyorc with open('data.orc', 'rb') as file: reader = pyorc.Reader(file) for row in reader: print(row)
Check the file format and integrity before reading.
The ValueError 'Invalid ORC file' usually means the file is not a valid ORC format or is corrupted, so pyorc cannot parse it.
Among Avro, Parquet, and ORC, which format is designed to handle schema changes like adding or removing fields without breaking existing data?
Think about which format stores schema with data and supports forward/backward compatibility.
Avro stores schema with the data and supports schema evolution well, allowing adding/removing fields without breaking compatibility.
You have a large dataset with many repeated string values in one column. You want to serialize it efficiently using Hadoop-compatible formats. Which format will likely produce the smallest file size?
Consider columnar formats and encoding techniques for repeated values.
Parquet is a columnar format that supports dictionary encoding, which efficiently compresses repeated strings, combined with Snappy compression it produces smaller files than Avro or ORC without compression.