Compression codecs make data smaller so it uses less space and moves faster. This helps when working with big data in Hadoop.
0
0
Compression codecs (Snappy, LZO, Gzip) in Hadoop
Introduction
When you want to save disk space by storing data in a smaller size.
When you need to speed up data transfer between Hadoop nodes.
When you want to reduce the time to read and write large files.
When you want to balance between compression speed and compression ratio.
When you want to choose a codec that works well with your data type and processing needs.
Syntax
Hadoop
hadoop jar your-job.jar -D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec \
-D mapreduce.output.fileoutputformat.compress.type=BLOCK \
input_path output_pathYou set compression codecs in Hadoop job configuration using codec class names.
Common codecs are Snappy, LZO, and Gzip, each with different speed and compression levels.
Examples
This sets the codec to Snappy, which is fast and good for real-time processing.
Hadoop
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
This sets the codec to LZO, which is fast and compresses well but needs extra setup.
Hadoop
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
This sets the codec to Gzip, which compresses very well but is slower.
Hadoop
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
Sample Program
This Python example writes a text file compressed with Gzip codec to Hadoop HDFS, then reads it back and prints the content.
Hadoop
from pydoop import hdfs # Write a text file compressed with Gzip in Hadoop HDFS file_path = '/user/test/sample.txt.gz' text = 'Hello Hadoop compression codecs!\n' * 5 with hdfs.open(file_path, 'wt', codec='gzip') as f: f.write(text) # Read back the file with hdfs.open(file_path, 'rt', codec='gzip') as f: content = f.read() print(content)
OutputSuccess
Important Notes
Snappy is fast but compresses less than Gzip.
LZO requires installing native libraries on all Hadoop nodes.
Gzip compresses best but uses more CPU and is slower.
Summary
Compression codecs reduce data size to save space and speed up Hadoop jobs.
Snappy, LZO, and Gzip are popular codecs with different speed and compression trade-offs.
Choose the codec based on your need for speed, compression ratio, and cluster setup.