0
0
TensorFlowml~15 mins

Activation functions (ReLU, sigmoid, softmax) in TensorFlow - Deep Dive

Choose your learning style9 modes available
Overview - Activation functions (ReLU, sigmoid, softmax)
What is it?
Activation functions are simple math formulas used inside neural networks to decide if a neuron should be active or not. They help the network learn complex patterns by adding non-linearity, which means the network can understand more than just straight lines. Common activation functions include ReLU, sigmoid, and softmax, each with a special role in how the network processes information. Without them, neural networks would be limited and unable to solve many real-world problems.
Why it matters
Activation functions let neural networks learn and solve complex tasks like recognizing images, understanding speech, or translating languages. Without them, networks would only do simple math and fail to capture the rich patterns in data. This would make many AI applications impossible or very weak, limiting the impact of machine learning in everyday life.
Where it fits
Before learning activation functions, you should understand basic neural networks and how neurons connect and pass signals. After mastering activation functions, you can explore advanced network designs, training techniques, and optimization methods that rely on these functions to work well.
Mental Model
Core Idea
Activation functions decide how much a neuron’s signal should pass forward, enabling neural networks to learn complex, non-linear patterns.
Think of it like...
Activation functions are like gates in a water pipe system that control how much water flows through each pipe, deciding which paths get more flow and which get less or none.
Input Layer ──▶ [Neuron + Activation Function] ──▶ Output Layer

Activation Function:
  ┌─────────────┐
  │   Input x   │
  └─────┬───────┘
        │
        ▼
  ┌─────────────┐
  │  Activation │
  │   Function  │
  └─────┬───────┘
        │
        ▼
  ┌─────────────┐
  │ Output y =  │
  │ f(x)        │
  └─────────────┘
Build-Up - 7 Steps
1
FoundationWhat is an Activation Function
🤔
Concept: Activation functions transform a neuron's input into an output signal that can be passed to the next layer.
In a neural network, each neuron sums its inputs and then applies an activation function to decide its output. This output is what the next layer receives. Without activation functions, the network would just do simple addition and multiplication, which limits its ability to learn complex patterns.
Result
Neurons produce outputs that can represent complex features, not just simple sums.
Understanding that activation functions add non-linearity is key to why neural networks can solve complex problems.
2
FoundationWhy Non-Linearity Matters
🤔
Concept: Non-linearity allows neural networks to model complex relationships beyond straight lines.
If all neurons just added inputs without activation functions, the whole network would behave like a single linear function. This means it could only solve simple problems like straight-line separation. Activation functions introduce curves and bends in the decision boundary, enabling the network to learn complicated patterns.
Result
Networks can learn to recognize shapes, sounds, and other complex data.
Knowing that non-linearity is essential explains why activation functions are a must-have in neural networks.
3
IntermediateReLU: The Simple and Popular Choice
🤔Before reading on: do you think ReLU outputs negative values or zeros for negative inputs? Commit to your answer.
Concept: ReLU (Rectified Linear Unit) outputs zero for negative inputs and the input itself if positive, making it simple and efficient.
ReLU(x) = max(0, x). It means if the input is negative, output zero; if positive, output the same value. This helps networks learn faster and reduces problems like vanishing gradients. ReLU is widely used in hidden layers of deep networks.
Result
Neurons become active only when input is positive, speeding up learning.
Understanding ReLU’s behavior helps explain why it is the default choice for many deep learning models.
4
IntermediateSigmoid: Smooth Probability Output
🤔Before reading on: do you think sigmoid outputs values between -1 and 1, or 0 and 1? Commit to your answer.
Concept: Sigmoid squashes input values into a smooth curve between 0 and 1, useful for probabilities.
Sigmoid(x) = 1 / (1 + exp(-x)). It turns any input into a value between 0 and 1, which can be interpreted as a probability. This makes it useful for binary classification tasks where the output is yes/no or true/false.
Result
Outputs can be treated as probabilities, enabling decision-making.
Knowing sigmoid’s range clarifies why it’s used for outputs that represent chances or likelihoods.
5
IntermediateSoftmax: Multi-Class Probability Distribution
🤔Before reading on: does softmax output independent probabilities or a set that sums to 1? Commit to your answer.
Concept: Softmax converts a vector of numbers into probabilities that sum to 1, useful for multi-class classification.
Softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j. It turns raw scores into a probability distribution over classes. The highest score gets the highest probability, and all probabilities add up to 1. This helps the model pick one class among many.
Result
Model outputs clear probabilities for each class, enabling confident predictions.
Understanding softmax’s normalization explains how models handle multiple choices simultaneously.
6
AdvancedActivation Functions in TensorFlow
🤔Before reading on: do you think TensorFlow requires manual implementation of ReLU, sigmoid, and softmax or provides built-in functions? Commit to your answer.
Concept: TensorFlow provides built-in, optimized activation functions for easy use in models.
TensorFlow has tf.nn.relu, tf.nn.sigmoid, and tf.nn.softmax functions. You can use them directly in your model layers. For example, tf.keras.layers.Dense(10, activation='relu') applies ReLU automatically. These built-ins are optimized for speed and stability.
Result
You can quickly build models with reliable activation functions without extra code.
Knowing TensorFlow’s built-ins saves time and avoids common implementation errors.
7
ExpertWhy ReLU Can Cause Dead Neurons
🤔Before reading on: do you think ReLU neurons can stop learning forever? Commit to your answer.
Concept: ReLU can output zero for all inputs if weights push values negative, causing 'dead' neurons that never activate.
If a neuron’s input is always negative, ReLU outputs zero, and gradients become zero during training. This means the neuron stops updating and effectively dies. Techniques like Leaky ReLU or careful initialization help avoid this problem.
Result
Some neurons may become inactive and reduce model capacity if not handled properly.
Understanding dead neurons helps in designing robust networks and choosing better activation variants.
Under the Hood
Activation functions apply mathematical transformations to neuron inputs during the forward pass. During backpropagation, their derivatives control how errors flow backward to update weights. For example, ReLU’s derivative is 1 for positive inputs and 0 for negatives, which affects gradient flow. Sigmoid’s derivative involves the output itself, which can cause gradients to shrink for large inputs, leading to vanishing gradients. Softmax normalizes outputs into probabilities and its gradient involves the Jacobian matrix, making it suitable for multi-class loss calculations.
Why designed this way?
Activation functions were designed to introduce non-linearity so networks can learn complex patterns. Early functions like sigmoid were inspired by biological neurons firing probabilities. ReLU was introduced later for efficiency and to reduce vanishing gradients. Softmax was created to handle multi-class outputs by turning raw scores into probabilities. Alternatives were rejected due to slow training, gradient issues, or lack of interpretability.
Input x ──▶ [Neuron Sum] ──▶ Activation Function ──▶ Output y

Backpropagation:
Output Gradient ──▶ Activation Derivative ──▶ Weight Updates

Activation Functions:
  ┌─────────────┐
  │   ReLU      │
  │ f(x) = max(0,x) │
  └─────────────┘
  ┌─────────────┐
  │  Sigmoid    │
  │ f(x) = 1/(1+e^{-x}) │
  └─────────────┘
  ┌─────────────┐
  │  Softmax    │
  │ f(x_i) = exp(x_i)/sum(exp(x_j)) │
  └─────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does ReLU output negative values for negative inputs? Commit yes or no.
Common Belief:ReLU outputs negative values just like the input if the input is negative.
Tap to reveal reality
Reality:ReLU outputs zero for any negative input, never negative values.
Why it matters:Believing ReLU outputs negatives can lead to wrong assumptions about neuron activation and gradient flow, causing confusion in debugging.
Quick: Does sigmoid output values between -1 and 1? Commit yes or no.
Common Belief:Sigmoid outputs values ranging from -1 to 1.
Tap to reveal reality
Reality:Sigmoid outputs values strictly between 0 and 1.
Why it matters:Misunderstanding sigmoid’s range can cause errors in interpreting outputs as probabilities or in designing network architectures.
Quick: Does softmax output independent probabilities for each class? Commit yes or no.
Common Belief:Softmax outputs independent probabilities for each class without affecting others.
Tap to reveal reality
Reality:Softmax outputs probabilities that sum to 1, so increasing one class’s probability decreases others.
Why it matters:Ignoring softmax’s normalization can lead to incorrect loss calculations and poor model training.
Quick: Can ReLU neurons stop learning forever? Commit yes or no.
Common Belief:ReLU neurons always learn and never stop updating.
Tap to reveal reality
Reality:ReLU neurons can die if they output zero for all inputs, causing zero gradients and no learning.
Why it matters:Not knowing about dead neurons can cause unexplained model performance drops and training failures.
Expert Zone
1
ReLU’s zero output for negatives speeds up training but can cause dead neurons, so variants like Leaky ReLU or Parametric ReLU are often preferred in practice.
2
Sigmoid’s output saturates at extremes, causing vanishing gradients; this is why it’s mostly used only in output layers for binary classification, not hidden layers.
3
Softmax’s gradient involves a Jacobian matrix, which makes its backpropagation more complex but essential for multi-class classification with cross-entropy loss.
When NOT to use
Avoid sigmoid in hidden layers of deep networks due to vanishing gradients; prefer ReLU or its variants. Softmax is only suitable for multi-class outputs, not for regression or binary tasks. When ReLU causes dead neurons, use Leaky ReLU or ELU instead.
Production Patterns
In production, models often use ReLU or its variants in hidden layers for efficiency and stability. Sigmoid is reserved for binary classification outputs, while softmax is standard for multi-class outputs. Careful initialization and batch normalization are combined with activations to improve training robustness.
Connections
Biological Neurons
Activation functions mimic the firing behavior of biological neurons, deciding when to pass signals.
Understanding biological neurons helps appreciate why activation functions introduce thresholds and non-linearity in artificial networks.
Probability Theory
Sigmoid and softmax functions output probabilities, linking neural network outputs to probabilistic interpretations.
Knowing probability theory clarifies why these activations are used for classification and how outputs can be interpreted as confidence scores.
Signal Processing
Activation functions act like filters shaping signals passing through layers, similar to filters in signal processing.
Recognizing this connection helps understand how activations transform data representations step-by-step.
Common Pitfalls
#1Using sigmoid activation in all hidden layers causing slow training.
Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='sigmoid'), tf.keras.layers.Dense(64, activation='sigmoid'), tf.keras.layers.Dense(10, activation='softmax') ])
Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ])
Root cause:Misunderstanding that sigmoid causes vanishing gradients in deep layers, slowing or stopping learning.
#2Applying softmax activation to a single output neuron for binary classification.
Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(1, activation='softmax') ])
Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(1, activation='sigmoid') ])
Root cause:Confusing softmax’s multi-class use with binary classification, where sigmoid is appropriate.
#3Ignoring dead neurons caused by ReLU leading to poor model performance.
Wrong approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ]) # No measures to prevent dead neurons
Correct approach:model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.1)), tf.keras.layers.Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.1)), tf.keras.layers.Dense(10, activation='softmax') ])
Root cause:Not knowing ReLU can cause neurons to stop updating if inputs are always negative.
Key Takeaways
Activation functions add essential non-linearity to neural networks, enabling them to learn complex patterns beyond simple math.
ReLU is the most popular activation for hidden layers due to its simplicity and efficiency but can cause dead neurons if not managed.
Sigmoid outputs values between 0 and 1, making it ideal for binary classification outputs but problematic in deep hidden layers.
Softmax converts raw scores into a probability distribution over multiple classes, crucial for multi-class classification tasks.
Choosing the right activation function and understanding its behavior is key to building effective and efficient neural networks.