Overview - L1 and L2 regularization

What is it?

L1 and L2 regularization are techniques used to make machine learning models simpler and better at guessing new data. They add a small penalty to the model's complexity during training, which helps prevent the model from memorizing the training data too closely. L1 regularization encourages the model to use fewer features by making some weights exactly zero, while L2 regularization makes weights smaller but rarely zero. Both help the model generalize better to unseen data.

Why it matters

Without regularization, models often learn too much from the training data, including noise and random details, causing poor performance on new data. This is called overfitting. L1 and L2 regularization help avoid overfitting by keeping the model simpler and more focused on important patterns. This leads to more reliable predictions in real-world applications like image recognition, speech processing, or medical diagnosis.

Where it fits

Before learning regularization, you should understand basic machine learning concepts like models, training, loss functions, and overfitting. After mastering L1 and L2 regularization, you can explore advanced topics like dropout, batch normalization, and other regularization methods to improve model robustness.

Mental Model

Core Idea

L1 and L2 regularization add a penalty to large model weights during training to keep the model simple and improve its ability to predict new data.

Think of it like...

Imagine packing a suitcase for a trip. L1 regularization is like only taking the most essential items, leaving out everything else, while L2 regularization is like packing everything but making sure nothing is too bulky or heavy.

Model weights
  ↓
┌───────────────┐
│   Training    │
│   Process     │
└───────────────┘
    ↓       ↓
L1 penalty  L2 penalty
  ↓           ↓
Sparse weights  Small weights
  ↓           ↓
Simpler model  Simpler model

Build-Up - 7 Steps

1

FoundationUnderstanding model weights and overfitting

Concept: Learn what model weights are and why overfitting happens.

In machine learning, a model learns by adjusting numbers called weights. These weights decide how much each input feature affects the output. Overfitting happens when the model learns the training data too well, including noise and random details, making it bad at guessing new data.

Result

You understand that weights control model behavior and that overfitting means the model is too closely tied to training data.

Knowing that weights control predictions helps you see why controlling their size can improve model generalization.

2

FoundationWhat is regularization in simple terms

3

IntermediateL1 regularization and sparsity

4

IntermediateL2 regularization and weight shrinkage

5

IntermediateImplementing L1 and L2 in TensorFlow

6

AdvancedChoosing between L1, L2, or both

7

ExpertRegularization impact on optimization and training dynamics

Under the Hood

L1 regularization adds the sum of absolute values of weights to the loss, creating a penalty that grows linearly with weight size. This penalty encourages weights to become exactly zero because the gradient is constant away from zero, pushing weights to the edges of the parameter space. L2 regularization adds the sum of squared weights to the loss, creating a smooth quadratic penalty. Its gradient grows linearly with weight size, gently pulling weights toward zero but rarely making them exactly zero. Both penalties modify the loss function, changing the gradients used in training to control weight sizes.

Why designed this way?

L1 and L2 regularization were designed to solve overfitting by controlling model complexity. L2 was inspired by ridge regression, which stabilizes solutions by shrinking weights. L1 was introduced to encourage sparsity and feature selection, which was not possible with L2 alone. The choice of absolute vs squared penalties reflects a tradeoff between sparsity and smoothness. These methods were preferred over more complex alternatives for their mathematical simplicity and effectiveness.

Training Data
   ↓
Model with Weights
   ↓
Loss Function + Regularization Penalty
   ↓
Gradient Calculation
   ↓
Weight Update

Regularization Penalty:
┌───────────────┐
│ L1: sum |w|  │
│ L2: sum w²   │
└───────────────┘

Effect:
L1 → Sparse weights (many zeros)
L2 → Small weights (smooth shrinkage)

Myth Busters - 4 Common Misconceptions

Quick: Does L2 regularization set weights exactly to zero? Commit to yes or no.

Common Belief:L2 regularization makes some weights exactly zero, just like L1.

Tap to reveal reality

Quick: Is regularization only useful for very large models? Commit to yes or no.

Common Belief:Regularization is only needed for very complex or large models.

Tap to reveal reality

Quick: Does increasing regularization strength always improve model accuracy? Commit to yes or no.

Common Belief:Stronger regularization always makes the model better.

Tap to reveal reality

Quick: Does adding L1 or L2 regularization change the model architecture? Commit to yes or no.

Common Belief:Regularization changes the model's structure or number of layers.

Tap to reveal reality

Expert Zone

1

L1 regularization can cause instability in training because its gradient is not smooth at zero, requiring careful tuning of optimizers and learning rates.

2

L2 regularization acts like a Gaussian prior on weights in Bayesian terms, linking regularization to probabilistic interpretations.

3

Elastic Net regularization balances L1 and L2 penalties, which can improve performance on correlated features where pure L1 or L2 may fail.

When NOT to use

Avoid L1 regularization when you need smooth gradients for stable training or when feature selection is not desired. Avoid L2 if you want sparse models or interpretability through zero weights. Alternatives include dropout for regularization without weight penalties, or Bayesian methods for uncertainty estimation.

Production Patterns

In production, L2 regularization is commonly used as a default to improve generalization with minimal tuning. L1 is used when feature selection or model interpretability is important. Elastic Net is popular in domains like genomics or finance where correlated features exist. Regularization strengths are often tuned via cross-validation or automated hyperparameter search.

Connections

Dropout

Alternative regularization method

Dropout randomly disables neurons during training to prevent co-adaptation, offering a different way to reduce overfitting compared to weight penalties.

Bayesian Inference

Mathematical interpretation

L2 regularization corresponds to assuming a Gaussian prior on weights, linking machine learning regularization to probability theory and uncertainty modeling.

Minimalism in Art

Shared principle of simplicity

Just as minimalism removes unnecessary elements to highlight core beauty, regularization removes or shrinks unnecessary model weights to highlight essential patterns.

Common Pitfalls

#1Applying very strong regularization without tuning.

Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(10.0)), tf.keras.layers.Dense(10) ])

Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(0.01)), tf.keras.layers.Dense(10) ])

Root cause:Misunderstanding that regularization strength should be small and tuned; too large values cause underfitting.

#2Using L1 regularization expecting smooth training without adjusting optimizer.

Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L1(0.01)), tf.keras.layers.Dense(10) ]) optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L1(0.01)), tf.keras.layers.Dense(10) ]) optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

Root cause:Not adjusting optimizer and learning rate for L1's non-smooth gradients leads to unstable training.

#3Confusing regularization with dropout and applying both incorrectly.

Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(0.01)), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(10) ]) model.compile(optimizer='adam', loss='mse') # Using dropout during inference

Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(0.01)), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(10) ]) model.compile(optimizer='adam', loss='mse') # Dropout is automatically disabled during inference by Keras

Root cause:Misunderstanding dropout behavior causes misuse during prediction, confusing regularization effects.

Key Takeaways

L1 and L2 regularization help models avoid overfitting by adding penalties to large weights during training.

L1 regularization creates sparse models by pushing some weights exactly to zero, enabling automatic feature selection.

L2 regularization shrinks weights smoothly, keeping all features but reducing their impact to improve generalization.

Choosing the right regularization type and strength is crucial and depends on the problem and data characteristics.

Understanding how regularization affects training dynamics and optimization helps in building better and more reliable models.