0
0
Hadoopdata~5 mins

Data serialization (Avro, Parquet, ORC) in Hadoop - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Data serialization (Avro, Parquet, ORC)
O(n)
Understanding Time Complexity

When working with big data, we often save data in special formats like Avro, Parquet, or ORC. Understanding how long it takes to read or write these formats helps us plan and speed up data processing.

We want to know: how does the time to serialize or deserialize data grow as the data size grows?

Scenario Under Consideration

Analyze the time complexity of the following Hadoop code snippet that writes data using Parquet format.


    Job job = Job.getInstance(conf);
    job.setOutputFormatClass(ParquetOutputFormat.class);
    ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
    ParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
    
    // Mapper writes records to Parquet files
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Convert input to Avro record
        // Write record to ParquetOutputFormat
        context.write(null, avroRecord);
    }
    

This code writes many records into Parquet files using Avro schema inside Hadoop MapReduce.

Identify Repeating Operations

Look for repeated actions that take most time.

  • Primary operation: Writing each record to the Parquet file inside the map function.
  • How many times: Once for every input record, so the number of records (n).
How Execution Grows With Input

As the number of records grows, the time to write grows too.

Input Size (n)Approx. Operations
1010 writes
100100 writes
10001000 writes

Pattern observation: The time grows roughly in direct proportion to the number of records.

Final Time Complexity

Time Complexity: O(n)

This means the time to serialize data grows linearly with the number of records.

Common Mistake

[X] Wrong: "Serialization time stays the same no matter how much data we write."

[OK] Correct: Each record must be processed and written, so more data means more work and more time.

Interview Connect

Knowing how serialization time grows helps you explain data pipeline speed and design better systems. It shows you understand how data size affects processing time.

Self-Check

"What if we changed from writing data record-by-record to writing in batches? How would the time complexity change?"