Overview - Activation functions (ReLU, Sigmoid, Softmax)

What is it?

Activation functions are simple mathematical formulas used inside neural networks to decide if a neuron should be activated or not. They help the network learn complex patterns by adding non-linearity. Common activation functions include ReLU, Sigmoid, and Softmax, each serving different purposes in the network.

Why it matters

Without activation functions, neural networks would behave like simple linear models, unable to solve complex problems like recognizing images or understanding speech. Activation functions allow networks to learn and represent complicated relationships in data, making AI useful in real life.

Where it fits

Before learning activation functions, you should understand basic neural networks and how neurons connect. After this, you can explore training techniques like backpropagation and optimization, which rely on activation functions to update the network.

Mental Model

Core Idea

Activation functions decide how much signal passes through a neuron, enabling neural networks to learn complex, non-linear patterns.

Think of it like...

Activation functions are like gates on a water pipe: they control how much water (signal) flows through, sometimes letting it all pass, sometimes only a little, or distributing it among many pipes.

Input Signal
   │
   ▼
[Activation Function]
   │
   ▼
Output Signal

Common types:
 ┌───────┐   ┌─────────┐   ┌─────────┐
 │ ReLU  │   │ Sigmoid │   │ Softmax │
 └───────┘   └─────────┘   └─────────┘

Build-Up - 7 Steps

1

FoundationWhat is an Activation Function

Concept: Introduce the basic idea of activation functions as simple formulas that transform neuron outputs.

In a neural network, each neuron calculates a number from inputs and weights. The activation function takes this number and changes it to decide if the neuron should 'fire' or not. This helps the network learn more than just straight lines.

Result

You understand that activation functions add decision-making power to neurons beyond simple sums.

Knowing that activation functions control neuron output is key to understanding how networks learn complex patterns.

2

FoundationWhy Non-Linearity Matters

3

IntermediateReLU: The Simple Gate

4

IntermediateSigmoid: Smooth Probability Output

5

IntermediateSoftmax: Choosing Among Many

6

AdvancedActivation Functions and Backpropagation

7

ExpertSoftmax Numerical Stability Tricks

Under the Hood

Activation functions transform the weighted sum of inputs inside each neuron. ReLU outputs zero for negatives and identity for positives, creating sparsity and avoiding gradient vanishing. Sigmoid uses the formula 1/(1+e^-x), smoothly mapping inputs to (0,1), but saturates at extremes causing small gradients. Softmax exponentiates inputs and normalizes them to sum to 1, creating a probability distribution. During backpropagation, derivatives of these functions determine how errors flow backward to update weights.

Why designed this way?

Activation functions were designed to introduce non-linearity, enabling networks to learn complex patterns. ReLU was chosen for simplicity and efficiency, avoiding vanishing gradients common in Sigmoid and Tanh. Sigmoid was historically used for binary outputs due to its probabilistic interpretation. Softmax was created to handle multi-class outputs by converting raw scores into probabilities. Numerical stability techniques evolved to handle computational limits of exponentials.

Input Vector
   │
   ▼
[Weighted Sum]
   │
   ▼
┌───────────────┐
│ Activation Fn │
│ ┌───────────┐ │
│ │ ReLU      │ │
│ │ Sigmoid   │ │
│ │ Softmax   │ │
│ └───────────┘ │
└───────────────┘
   │
   ▼
Output Vector

Backpropagation:
Output Error
   │
   ▼
Derivative of Activation
   │
   ▼
Weight Updates

Myth Busters - 4 Common Misconceptions

Quick: Does ReLU output negative values for negative inputs? Commit to yes or no.

Common Belief:ReLU outputs negative values for negative inputs, just scaled down.

Tap to reveal reality

Quick: Does Sigmoid output values outside 0 to 1? Commit to yes or no.

Common Belief:Sigmoid can output values less than 0 or greater than 1 depending on input.

Tap to reveal reality

Quick: Does Softmax output independent probabilities for each class? Commit to yes or no.

Common Belief:Softmax outputs independent probabilities for each class without affecting others.

Tap to reveal reality

Quick: Can you safely compute Softmax by directly exponentiating large input values? Commit to yes or no.

Common Belief:Directly computing Softmax on large inputs is safe and accurate.

Tap to reveal reality

Expert Zone

1

ReLU creates sparse activations which can improve efficiency but may cause 'dead neurons' that never activate.

2

Sigmoid's gradient saturation at extremes slows learning in deep networks, leading to preference for ReLU variants.

3

Softmax outputs are sensitive to input scale; temperature scaling can adjust confidence in predictions.

When NOT to use

Avoid Sigmoid in hidden layers of deep networks due to vanishing gradients; prefer ReLU or its variants. Softmax is only suitable for final layers in classification tasks, not hidden layers. For regression tasks, use linear or no activation in output layers.

Production Patterns

In production, ReLU is the default hidden layer activation for speed and stability. Sigmoid is used in binary classification outputs. Softmax is standard for multi-class classification outputs. Numerical stability tricks like subtracting max logits before Softmax are always applied. Sometimes, leaky ReLU or parametric ReLU replace ReLU to fix dead neuron issues.

Connections

Decision Trees

Activation functions create non-linear decision boundaries similar to how decision trees split data non-linearly.

Understanding activation functions helps see how neural networks can approximate complex splits like decision trees but in a smooth, differentiable way.

Probability Theory

Softmax outputs a probability distribution over classes, linking neural networks to probabilistic models.

Knowing Softmax connects neural networks to probability helps interpret outputs as confidence levels, useful in risk-sensitive applications.

Electrical Circuits

Activation functions act like electronic components controlling current flow, shaping signals through the network.

Seeing activations as signal gates clarifies how networks modulate information, similar to how circuits control electricity.

Common Pitfalls

#1Using Sigmoid activation in all layers of a deep network.

Wrong approach:import torch.nn as nn model = nn.Sequential( nn.Linear(10, 20), nn.Sigmoid(), nn.Linear(20, 10), nn.Sigmoid() )

Correct approach:import torch.nn as nn model = nn.Sequential( nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 10), nn.Sigmoid() # only output layer )

Root cause:Misunderstanding that Sigmoid causes vanishing gradients in hidden layers, slowing or stopping learning.

#2Computing Softmax without numerical stability adjustment on large inputs.

Wrong approach:import torch import torch.nn.functional as F x = torch.tensor([1000.0, 1001.0, 1002.0]) softmax_output = F.softmax(x, dim=0) print(softmax_output)

Correct approach:import torch import torch.nn.functional as F x = torch.tensor([1000.0, 1001.0, 1002.0]) stable_x = x - x.max() softmax_output = F.softmax(stable_x, dim=0) print(softmax_output)

Root cause:Ignoring that exponentials of large numbers overflow, causing incorrect or NaN outputs.

#3Expecting ReLU to output negative values for negative inputs.

Wrong approach:import torch import torch.nn.functional as F x = torch.tensor([-3.0, -1.0, 0.0, 2.0]) output = F.relu(x) print(output) # expecting negative values

Correct approach:import torch import torch.nn.functional as F x = torch.tensor([-3.0, -1.0, 0.0, 2.0]) output = F.relu(x) print(output) # tensor([0., 0., 0., 2.])

Root cause:Misunderstanding ReLU definition; it zeroes out negatives instead of scaling them.

Key Takeaways

Activation functions add non-linearity to neural networks, enabling them to learn complex patterns beyond simple lines.

ReLU is a fast, simple activation that blocks negative signals and helps avoid vanishing gradients, making it popular in hidden layers.

Sigmoid squashes inputs to values between 0 and 1, useful for binary probabilities but can slow learning in deep networks.

Softmax converts raw scores into probabilities that sum to one, essential for multi-class classification outputs.

Numerical stability tricks, like subtracting the max input before Softmax, are critical to prevent errors in real-world models.