0
0
Kafkadevops~15 mins

Compression (gzip, snappy, lz4) in Kafka - Deep Dive

Choose your learning style9 modes available
Overview - Compression (gzip, snappy, lz4)
What is it?
Compression in Kafka means making data smaller before sending it over the network or saving it to disk. It uses algorithms like gzip, snappy, and lz4 to shrink message sizes. This helps Kafka move data faster and use less storage. Each compression type balances speed and size differently.
Why it matters
Without compression, Kafka would send and store much larger messages, slowing down data flow and increasing costs. Compression saves bandwidth and storage, making Kafka more efficient and cheaper to run. It also helps keep systems responsive when handling lots of data.
Where it fits
Before learning compression, you should understand Kafka basics like producers, consumers, and topics. After compression, you can explore Kafka performance tuning and monitoring to see how compression affects throughput and latency.
Mental Model
Core Idea
Compression shrinks data to move and store it more efficiently by trading off CPU work for smaller size.
Think of it like...
Compression is like packing a suitcase tightly before a trip so you can carry more clothes in less space.
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Original Data │───▶ │ Compression  │───▶ │ Smaller Data  │
└───────────────┘     └───────────────┘     └───────────────┘
       │                                         │
       ▼                                         ▼
  Larger size                               Less bandwidth
  More storage                             Faster transfer
Build-Up - 7 Steps
1
FoundationWhat is Data Compression
🤔
Concept: Compression reduces the size of data by encoding it more efficiently.
Data compression uses algorithms to find patterns and remove redundancy in data. This means the same information takes fewer bytes. For example, repeating words or numbers can be stored once with a count instead of many times.
Result
Data size becomes smaller, saving space and transfer time.
Understanding compression basics helps you see why smaller data moves faster and costs less to store.
2
FoundationKafka Message Flow Basics
🤔
Concept: Kafka sends messages from producers to brokers and then to consumers.
In Kafka, producers create messages and send them to topics on brokers. Consumers read messages from these topics. Messages can be large or small, and many messages flow continuously.
Result
Kafka moves data between systems reliably and quickly.
Knowing how Kafka moves messages sets the stage for why compression matters in this flow.
3
IntermediateHow Compression Works in Kafka
🤔Before reading on: do you think Kafka compresses each message individually or batches of messages? Commit to your answer.
Concept: Kafka compresses batches of messages together, not single messages.
Kafka groups messages into batches before sending. Compression algorithms then compress the whole batch as one unit. This improves compression efficiency because patterns across messages can be found and reduced.
Result
Batches become smaller, reducing network and storage use.
Knowing Kafka compresses batches explains why batch size affects compression effectiveness.
4
IntermediateComparing gzip, snappy, and lz4
🤔Before reading on: which do you think is fastest: gzip, snappy, or lz4? Commit to your answer.
Concept: Different compression algorithms balance speed and compression ratio differently.
gzip compresses well but is slower. Snappy is very fast but compresses less. LZ4 is also very fast and compresses better than snappy but not as well as gzip. Choosing depends on whether speed or size matters more.
Result
You can pick the right compression for your Kafka workload needs.
Understanding these trade-offs helps optimize Kafka performance and resource use.
5
IntermediateConfiguring Compression in Kafka Producers
🤔
Concept: Kafka producers can be set to use a compression type when sending messages.
In the producer configuration, you set 'compression.type' to 'gzip', 'snappy', or 'lz4'. This tells Kafka which algorithm to use when compressing message batches before sending to brokers.
Result
Messages sent by the producer are compressed accordingly.
Knowing how to configure compression lets you control Kafka's efficiency and speed.
6
AdvancedImpact of Compression on Kafka Performance
🤔Before reading on: does compression always improve Kafka throughput? Commit to your answer.
Concept: Compression can improve throughput but adds CPU overhead and latency.
Compression reduces data size, so less network bandwidth is used and brokers store less data. This can increase throughput. However, compressing and decompressing data uses CPU, which can add latency and reduce performance if CPU is limited.
Result
Compression improves performance when CPU is sufficient but can hurt if CPU is a bottleneck.
Balancing CPU and network resources is key to effective Kafka compression use.
7
ExpertKafka Compression Internals and Broker Handling
🤔Before reading on: do Kafka brokers decompress messages when storing or forwarding? Commit to your answer.
Concept: Kafka brokers store and forward compressed data without decompressing it.
Kafka brokers treat compressed batches as opaque blobs. They do not decompress or recompress messages. Consumers decompress messages after receiving them. This design reduces broker CPU load and speeds up message forwarding.
Result
Brokers efficiently handle compressed data, improving cluster scalability.
Knowing brokers don't decompress explains why compression saves CPU cluster-wide and why consumers must support decompression.
Under the Hood
Compression algorithms scan data to find repeated patterns or redundant information. They replace these with shorter codes or references. Kafka applies compression to batches of messages, producing a compressed byte stream. Brokers store and forward this stream without unpacking it. Consumers decompress the stream to get original messages.
Why designed this way?
Kafka compresses batches to maximize compression efficiency and reduce overhead. Brokers avoid decompressing to save CPU and keep message flow fast. This design balances resource use across producers, brokers, and consumers. Alternatives like decompressing at brokers would slow the system and increase complexity.
Producer ──▶ Compress Batch ──▶ Broker (store compressed) ──▶ Consumer ──▶ Decompress Batch

[Producer] --(batch)--> [Compression Algorithm] --(compressed batch)--> [Broker]

[Broker] --(compressed batch)--> [Consumer] --(decompressed messages)--> [Application]
Myth Busters - 4 Common Misconceptions
Quick: Does Kafka compress each message individually or batches? Commit to your answer.
Common Belief:Kafka compresses each message separately before sending.
Tap to reveal reality
Reality:Kafka compresses entire batches of messages together, not individual messages.
Why it matters:Thinking compression is per message leads to misunderstanding batch size effects and poor tuning decisions.
Quick: Is gzip always the best compression for Kafka? Commit to your answer.
Common Belief:gzip is always the best because it compresses data the most.
Tap to reveal reality
Reality:gzip compresses well but is slower; snappy and lz4 are faster with less compression, better for real-time needs.
Why it matters:Choosing gzip blindly can cause high CPU use and latency, hurting Kafka performance.
Quick: Do Kafka brokers decompress messages when forwarding? Commit to your answer.
Common Belief:Brokers decompress and recompress messages to optimize storage.
Tap to reveal reality
Reality:Brokers store and forward compressed data as-is without decompressing.
Why it matters:Expecting brokers to decompress can cause confusion about CPU usage and message flow.
Quick: Does compression always improve Kafka throughput? Commit to your answer.
Common Belief:Compression always makes Kafka faster by reducing data size.
Tap to reveal reality
Reality:Compression helps if CPU is available; if CPU is limited, compression overhead can reduce throughput.
Why it matters:Ignoring CPU costs can lead to worse performance despite smaller data.
Expert Zone
1
Kafka compression effectiveness depends heavily on batch size; small batches compress poorly.
2
Choosing compression affects not just network but also CPU load on producers and consumers differently.
3
Some Kafka clients support additional compression codecs (like Zstd) which offer better trade-offs but require compatible consumers.
When NOT to use
Avoid compression when message sizes are very small or latency is critical and CPU is limited. Instead, use uncompressed or faster codecs like lz4. For very large messages, consider external compression before sending to Kafka.
Production Patterns
In production, teams often use lz4 for low latency and snappy for balanced speed and compression. gzip is used when storage savings are more important than speed. Monitoring CPU and network metrics guides compression tuning.
Connections
Data Serialization
Compression often works together with serialization to reduce message size.
Understanding serialization formats like Avro or Protobuf helps optimize compression by structuring data efficiently before compression.
Network Bandwidth Optimization
Compression reduces data size, directly impacting bandwidth usage.
Knowing network limits and costs helps decide when and how much to compress Kafka messages.
Human Language Compression (Linguistics)
Both use patterns and redundancy removal to convey information efficiently.
Recognizing that compression algorithms mimic how humans shorten language reveals universal principles of efficient communication.
Common Pitfalls
#1Setting compression without considering batch size.
Wrong approach:producer.properties: compression.type=lz4 producer.properties: batch.size=1
Correct approach:producer.properties: compression.type=lz4 producer.properties: batch.size=16384
Root cause:Small batches compress poorly, wasting CPU and network benefits.
#2Using gzip compression on CPU-limited systems.
Wrong approach:producer.properties: compression.type=gzip
Correct approach:producer.properties: compression.type=snappy
Root cause:gzip uses more CPU, causing latency and throughput drops on limited hardware.
#3Assuming brokers decompress messages.
Wrong approach:Expecting broker logs to show decompressed data or CPU spikes from decompression.
Correct approach:Understand brokers forward compressed batches as-is; decompression happens only on consumers.
Root cause:Misunderstanding Kafka's design leads to wrong troubleshooting and performance expectations.
Key Takeaways
Compression in Kafka reduces message size by encoding batches efficiently, saving bandwidth and storage.
Kafka compresses batches, not individual messages, which improves compression ratio but requires tuning batch size.
Choosing the right compression codec balances speed and size; gzip compresses best but is slow, snappy and lz4 are faster with less compression.
Kafka brokers do not decompress messages, which saves CPU and speeds message forwarding; consumers handle decompression.
Compression improves throughput only if CPU resources are sufficient; otherwise, it can add latency and reduce performance.