Data serialization (Avro, Parquet, ORC) in Hadoop - Time & Space Complexity
When working with big data, we often save data in special formats like Avro, Parquet, or ORC. Understanding how long it takes to read or write these formats helps us plan and speed up data processing.
We want to know: how does the time to serialize or deserialize data grow as the data size grows?
Analyze the time complexity of the following Hadoop code snippet that writes data using Parquet format.
Job job = Job.getInstance(conf);
job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
ParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
// Mapper writes records to Parquet files
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Convert input to Avro record
// Write record to ParquetOutputFormat
context.write(null, avroRecord);
}
This code writes many records into Parquet files using Avro schema inside Hadoop MapReduce.
Look for repeated actions that take most time.
- Primary operation: Writing each record to the Parquet file inside the map function.
- How many times: Once for every input record, so the number of records (n).
As the number of records grows, the time to write grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 writes |
| 100 | 100 writes |
| 1000 | 1000 writes |
Pattern observation: The time grows roughly in direct proportion to the number of records.
Time Complexity: O(n)
This means the time to serialize data grows linearly with the number of records.
[X] Wrong: "Serialization time stays the same no matter how much data we write."
[OK] Correct: Each record must be processed and written, so more data means more work and more time.
Knowing how serialization time grows helps you explain data pipeline speed and design better systems. It shows you understand how data size affects processing time.
"What if we changed from writing data record-by-record to writing in batches? How would the time complexity change?"