0
0
Computer Visionml~15 mins

Model optimization (pruning, quantization) in Computer Vision - Deep Dive

Choose your learning style9 modes available
Overview - Model optimization (pruning, quantization)
What is it?
Model optimization means making a machine learning model smaller and faster without losing much accuracy. Two common ways are pruning, which removes unnecessary parts of the model, and quantization, which uses simpler numbers to represent the model's data. These techniques help models run well on devices like phones or cameras. They keep the model smart but use less memory and power.
Why it matters
Without optimization, models can be too big and slow to use on everyday devices. This would limit AI to powerful computers only, making it hard to have smart apps on phones or cameras. Optimization lets AI work everywhere, saving energy and making devices respond faster. It also reduces costs and helps protect privacy by running AI locally instead of sending data to the cloud.
Where it fits
Before learning model optimization, you should understand how neural networks work and how models are trained. After this, you can explore advanced topics like model distillation, hardware-aware training, and deploying models on edge devices. Optimization is a key step between training a model and making it practical for real-world use.
Mental Model
Core Idea
Model optimization shrinks and speeds up AI models by cutting unneeded parts and using simpler numbers, keeping smartness while saving resources.
Think of it like...
Imagine packing a suitcase for a trip: pruning is like removing clothes you won't wear, and quantization is like folding clothes tightly to save space. You still have what you need but carry less weight and fit everything easily.
┌───────────────┐       ┌───────────────┐
│ Original Model│──────▶│ Pruning       │
│ (Big & Heavy) │       │ (Remove parts)│
└───────────────┘       └───────────────┘
         │                      │
         │                      ▼
         │              ┌───────────────┐
         │              │ Quantization  │
         │              │ (Simplify nums)│
         ▼              └───────────────┘
┌───────────────┐              │
│ Optimized     │◀─────────────┘
│ Model (Small, │
│ Fast, Efficient)│
└───────────────┘
Build-Up - 7 Steps
1
FoundationWhat is model pruning?
🤔
Concept: Pruning means cutting out parts of a model that are not very important.
Neural networks have many connections (weights). Some of these connections have very small values and do not affect the output much. Pruning removes these small connections to make the model smaller. For example, if a weight is close to zero, we can set it to zero and ignore it during calculations.
Result
The model becomes smaller and faster because it has fewer connections to process.
Understanding pruning shows how models can be simplified by focusing only on the important parts, which saves memory and speeds up computation.
2
FoundationWhat is quantization?
🤔
Concept: Quantization means using simpler numbers to represent the model's data.
Models usually use 32-bit floating-point numbers to store weights and activations. Quantization changes these to smaller types like 8-bit integers. This reduces the size of the model and speeds up calculations because simpler numbers need less memory and can be processed faster by hardware.
Result
The model uses less memory and runs faster, often with little loss in accuracy.
Knowing quantization helps you see how changing number formats can make AI models more efficient without retraining.
3
IntermediateHow pruning affects model accuracy
🤔Before reading on: do you think pruning always reduces model accuracy significantly? Commit to your answer.
Concept: Pruning can reduce accuracy if too many important connections are removed, but careful pruning keeps accuracy high.
When pruning, we remove weights with small impact. If we prune too aggressively, the model loses important information and accuracy drops. Techniques like gradual pruning remove weights slowly and retrain the model to recover accuracy. This balance keeps the model efficient and accurate.
Result
Pruned models can be almost as accurate as original models if pruning is done carefully.
Understanding the trade-off between pruning amount and accuracy helps in applying pruning effectively without hurting model performance.
4
IntermediateTypes of quantization methods
🤔Before reading on: do you think quantization always requires retraining the model? Commit to your answer.
Concept: There are different ways to quantize models, some needing retraining and some not.
Post-training quantization converts weights after training without changing the model. Quantization-aware training simulates quantization during training, helping the model adapt and keep accuracy. Choosing the right method depends on the use case and accuracy needs.
Result
Quantization can be done quickly or with more effort to keep accuracy high.
Knowing quantization methods lets you pick the best approach for your project’s speed and accuracy goals.
5
IntermediateCombining pruning and quantization
🤔Before reading on: do you think pruning and quantization can be applied together without problems? Commit to your answer.
Concept: Pruning and quantization can be combined to make models even smaller and faster.
First, pruning removes unnecessary weights. Then quantization reduces number precision. Together, they multiply the benefits. However, combining them requires careful tuning to avoid accuracy loss. Sometimes retraining after both steps helps the model adjust.
Result
Models become very efficient while maintaining good accuracy.
Understanding how these techniques interact helps in building highly optimized models for real devices.
6
AdvancedHardware impact of optimization
🤔Before reading on: do you think all hardware benefits equally from pruning and quantization? Commit to your answer.
Concept: Different devices handle pruning and quantization differently, affecting speed and power use.
Some hardware like CPUs and GPUs support 8-bit math natively, so quantization speeds up inference a lot. Pruning helps more on devices that can skip zero weights efficiently. Specialized AI chips may have unique support for these optimizations. Knowing hardware details guides which optimization to prioritize.
Result
Optimizations can lead to big speed and power gains on some devices but less on others.
Knowing hardware effects prevents wasted effort on optimizations that don’t help your target device.
7
ExpertSurprising effects of aggressive pruning
🤔Before reading on: do you think pruning more always makes the model worse? Commit to your answer.
Concept: Sometimes pruning a lot can help the model by forcing it to focus on stronger features.
In some cases, heavy pruning acts like regularization, reducing overfitting and improving generalization. This means the model performs better on new data. However, this effect depends on the model and data. Experts use pruning schedules and retraining to exploit this.
Result
Aggressive pruning can sometimes improve accuracy, not just reduce size.
Understanding this counterintuitive effect helps experts tune pruning for both efficiency and better model quality.
Under the Hood
Pruning works by setting small weights to zero, effectively removing connections in the neural network graph. This reduces the number of multiplications during inference. Quantization changes the data type of weights and activations from high-precision floats to lower-precision integers, which reduces memory bandwidth and allows faster integer arithmetic on hardware. Both methods rely on the model's tolerance to small changes in parameters and computations.
Why designed this way?
Models were originally designed for accuracy without concern for size or speed. As AI moved to devices with limited resources, pruning and quantization were developed to reduce model demands. Pruning leverages the observation that many weights have little impact, while quantization exploits hardware efficiency with simpler number formats. Alternatives like model redesign or distillation exist but pruning and quantization are direct, effective, and widely supported.
┌───────────────┐
│ Neural Network│
│ Weights (float32)│
└───────┬───────┘
        │ Pruning: zero small weights
        ▼
┌───────────────┐
│ Sparse Network │
│ (many zeros)  │
└───────┬───────┘
        │ Quantization: convert float32 to int8
        ▼
┌───────────────┐
│ Optimized Model│
│ (smaller, fast)│
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does pruning always cause a big drop in model accuracy? Commit to yes or no.
Common Belief:Pruning always makes the model much less accurate because it removes important parts.
Tap to reveal reality
Reality:Pruning removes mostly unimportant weights, so with careful pruning and retraining, accuracy stays almost the same.
Why it matters:Believing pruning ruins accuracy may stop people from using it, missing out on big efficiency gains.
Quick: Is quantization just about changing number types without any effect on model behavior? Commit to yes or no.
Common Belief:Quantization is a simple number format change that does not affect model predictions.
Tap to reveal reality
Reality:Quantization changes how numbers are stored and computed, which can slightly change model outputs and sometimes reduce accuracy.
Why it matters:Ignoring quantization effects can lead to unexpected accuracy drops in deployed models.
Quick: Can pruning and quantization be applied in any order without impact? Commit to yes or no.
Common Belief:The order of pruning and quantization does not matter for model performance.
Tap to reveal reality
Reality:The order and method of applying pruning and quantization affect final accuracy and efficiency; usually pruning first then quantizing works best.
Why it matters:Wrong order can cause bigger accuracy loss or less efficient models.
Quick: Does pruning always reduce model size on disk? Commit to yes or no.
Common Belief:Pruning always makes the saved model file smaller.
Tap to reveal reality
Reality:Pruning creates sparse models, but without special file formats, the saved model size may not shrink much.
Why it matters:Expecting smaller files without proper sparse storage can cause confusion and wasted storage.
Expert Zone
1
Pruning schedules that gradually increase pruning amount during training yield better accuracy than one-shot pruning.
2
Quantization-aware training simulates low-precision math during training, allowing the model to adapt and maintain accuracy better than post-training quantization.
3
Hardware support varies widely; some accelerators benefit more from pruning sparsity, others from quantization, so optimization must be hardware-aware.
When NOT to use
Avoid pruning and quantization when maximum accuracy is critical and model size or speed is not a concern, such as in research or offline analysis. Instead, use full precision models or explore model architecture improvements. For very small models, pruning may have little effect, and quantization can cause unacceptable accuracy loss.
Production Patterns
In production, models are often pruned during training with gradual schedules, then quantized using quantization-aware training. Deployment pipelines include calibration steps to tune quantization parameters. Some systems use hardware-specific libraries to exploit sparsity and low-precision math for maximum speed and energy efficiency.
Connections
Data Compression
Model optimization is similar to data compression, both reduce size while preserving essential information.
Understanding compression techniques helps grasp how pruning and quantization remove redundancy and simplify data without losing meaning.
Human Memory Efficiency
Like the brain forgetting unimportant details to save space, pruning removes less useful connections to keep the model efficient.
Knowing how humans optimize memory storage gives intuition about why pruning works well in neural networks.
Digital Signal Processing (DSP)
Quantization in models is related to quantization in DSP, where signals are approximated with fewer bits to save bandwidth.
Familiarity with DSP quantization helps understand the trade-offs between precision and resource use in AI models.
Common Pitfalls
#1Pruning too much at once without retraining.
Wrong approach:Remove 50% of weights in one step and use the model immediately without retraining.
Correct approach:Gradually prune weights over multiple steps with retraining after each step to recover accuracy.
Root cause:Belief that pruning is a one-time cut ignores the need for model adaptation to changes.
#2Applying quantization without calibration or awareness.
Wrong approach:Convert weights to int8 directly after training without any calibration or retraining.
Correct approach:Use quantization-aware training or post-training calibration to adjust model for low precision.
Root cause:Assuming quantization is a simple data type change without impact on model behavior.
#3Ignoring hardware capabilities when optimizing.
Wrong approach:Apply pruning and quantization blindly without considering target device support.
Correct approach:Analyze hardware features and tailor optimization methods accordingly for best performance.
Root cause:Lack of understanding of hardware differences leads to suboptimal or ineffective optimizations.
Key Takeaways
Model optimization uses pruning and quantization to make AI models smaller and faster while keeping accuracy high.
Pruning removes unimportant connections gradually, often with retraining to maintain performance.
Quantization changes number precision to reduce memory and speed up computation, sometimes requiring model adaptation.
Combining pruning and quantization can greatly improve efficiency but needs careful tuning and hardware awareness.
Understanding these techniques is essential for deploying AI on real-world devices with limited resources.