0
0
Hadoopdata~10 mins

Compression codecs (Snappy, LZO, Gzip) in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Compression codecs (Snappy, LZO, Gzip)
Input Data
Choose Codec: Snappy / LZO / Gzip
Compress Data
Store Compressed Data
Read Compressed Data
Decompress Data
Output Original Data
Data flows from input through a chosen compression codec, then stored compressed, later read and decompressed back to original.
Execution Sample
Hadoop
input_data = 'Example data to compress'
codec = 'Snappy'
compressed = compress(input_data, codec)
decompressed = decompress(compressed, codec)
print(decompressed)
This code compresses a string using Snappy codec and then decompresses it back to original.
Execution Table
StepActionInputCodecOutputNotes
1Start with input data'Example data to compress'-'Example data to compress'Initial raw data
2Select codec-Snappy-Snappy chosen for compression
3Compress data'Example data to compress'Snappyb'compressed_bytes_snappy'Data compressed using Snappy
4Store compressed datab'compressed_bytes_snappy'-Stored compressed dataData saved in compressed form
5Read compressed dataStored compressed data-b'compressed_bytes_snappy'Compressed data loaded
6Decompress datab'compressed_bytes_snappy'Snappy'Example data to compress'Data decompressed back to original
7Output data'Example data to compress'-'Example data to compress'Final output matches input
💡 Process ends after decompression returns original data
Variable Tracker
VariableStartAfter CompressionAfter StorageAfter ReadAfter DecompressionFinal
input_data'Example data to compress''Example data to compress''Example data to compress''Example data to compress''Example data to compress''Example data to compress'
codec-SnappySnappySnappySnappySnappy
compressed-b'compressed_bytes_snappy'b'compressed_bytes_snappy'b'compressed_bytes_snappy'--
decompressed----'Example data to compress''Example data to compress'
Key Moments - 3 Insights
Why does the compressed data look like unreadable bytes?
Compressed data is binary and optimized for size, not readability. See step 3 in execution_table where output is shown as bytes.
Can we decompress data with a different codec than used for compression?
No, decompression must use the same codec as compression to restore original data. See step 6 where Snappy is used again.
Why do we store compressed data instead of original?
Storing compressed data saves space and speeds up data transfer. See step 4 where compressed data is stored.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the output after step 3 (Compress data)?
ACompressed binary bytes
BOriginal readable string
CDecompressed string
DEmpty data
💡 Hint
Check the Output column at step 3 in execution_table
At which step does the data get restored to its original form?
AStep 3
BStep 5
CStep 6
DStep 4
💡 Hint
Look for decompression action and output matching original data in execution_table
If we change codec from Snappy to Gzip, what changes in the execution_table?
ADecompressed data changes
BOnly the codec column changes, compressed output changes accordingly
CInput data changes
DNo changes at all
💡 Hint
Codec choice affects compression and decompression steps, see codec column in execution_table
Concept Snapshot
Compression codecs reduce data size for storage and transfer.
Common codecs: Snappy (fast), LZO (fast), Gzip (high compression).
Process: input data → compress → store → read → decompress → original data.
Use same codec for compress and decompress.
Compressed data is binary and not human-readable.
Full Transcript
Compression codecs like Snappy, LZO, and Gzip help reduce data size in Hadoop. The process starts with input data, which is compressed using a chosen codec. The compressed data is stored and later read back. To get the original data, the compressed data is decompressed using the same codec. Compressed data appears as unreadable bytes because it is binary. Using the wrong codec for decompression will fail to restore the original data. This process saves storage space and speeds up data transfer.