Bird
Raised Fist0
TensorFlowml~20 mins

Optimizers (SGD, Adam, RMSprop) in TensorFlow - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Optimizer Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding the role of learning rate in optimizers

Which statement best describes the effect of a very high learning rate when using the Adam optimizer?

AThe model always converges but more slowly than with a low learning rate.
BThe model converges quickly to the best solution without overshooting.
CThe model ignores the learning rate and uses default values internally.
DThe model may fail to converge and the loss can oscillate or diverge.
Attempts:
2 left
💡 Hint

Think about what happens if you take too big steps when trying to find the lowest point on a hill.

Predict Output
intermediate
2:00remaining
Output of training loss with different optimizers

Given the following code snippet training a simple model on dummy data, what will be the printed loss value after one training step using RMSprop optimizer?

TensorFlow
import tensorflow as tf
import numpy as np

x = np.array([[1.0], [2.0], [3.0], [4.0]])
y = np.array([[2.0], [4.0], [6.0], [8.0]])

model = tf.keras.Sequential([tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01)

loss_fn = tf.keras.losses.MeanSquaredError()

with tf.GradientTape() as tape:
    predictions = model(x, training=True)
    loss = loss_fn(y, predictions)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

print(round(float(loss), 3))
A15.0
B20.0
C10.0
D5.0
Attempts:
2 left
💡 Hint

Initial weights are random, so loss will be relatively high but not extremely large.

Model Choice
advanced
2:00remaining
Choosing the best optimizer for sparse gradients

You are training a neural network with very sparse gradients (many zeros). Which optimizer is generally the best choice to handle sparse updates efficiently?

ARMSprop optimizer
BAdam optimizer
CStochastic Gradient Descent (SGD) without momentum
DBatch Gradient Descent
Attempts:
2 left
💡 Hint

Consider which optimizer adapts learning rates per parameter and handles sparse gradients well.

Hyperparameter
advanced
2:00remaining
Effect of momentum parameter in SGD

What is the effect of increasing the momentum parameter in SGD optimizer during training?

AIt helps accelerate training by smoothing updates and avoiding local minima.
BIt decreases the learning rate automatically over time.
CIt slows down training by reducing step size.
DIt causes the optimizer to ignore gradients and update randomly.
Attempts:
2 left
💡 Hint

Think about how momentum in physics helps keep an object moving smoothly.

🔧 Debug
expert
2:00remaining
Identifying the cause of exploding gradients with Adam optimizer

Consider this training loop using Adam optimizer. The loss suddenly becomes NaN after several epochs. What is the most likely cause?

model = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=1.0)

for epoch in range(10):
    with tf.GradientTape() as tape:
        predictions = model(x)
        loss = tf.reduce_mean(tf.square(y - predictions))
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    print(f"Epoch {epoch} Loss: {loss.numpy()}")
AThe learning rate is too high causing unstable updates and exploding gradients.
BThe model architecture is incorrect and causes NaN values.
CThe loss function is incompatible with Adam optimizer.
DThe input data x contains NaN values causing loss to become NaN.
Attempts:
2 left
💡 Hint

Check the learning rate value and its effect on training stability.

Practice

(1/5)
1. Which optimizer in TensorFlow uses momentum to accelerate gradient descent and reduce oscillations?
easy
A. SGD with momentum
B. Adam
C. RMSprop
D. Adagrad

Solution

  1. Step 1: Understand momentum in optimizers

    Momentum helps speed up SGD by accumulating past gradients to smooth updates.
  2. Step 2: Identify optimizer using momentum

    SGD with momentum explicitly uses this technique, unlike Adam or RMSprop which use adaptive learning rates.
  3. Final Answer:

    SGD with momentum -> Option A
  4. Quick Check:

    Momentum = SGD with momentum [OK]
Hint: Momentum is a feature of SGD, not Adam or RMSprop [OK]
Common Mistakes:
  • Confusing Adam's adaptive learning with momentum
  • Thinking RMSprop uses momentum
  • Mixing up Adagrad with momentum
2. Which of the following is the correct way to create an Adam optimizer in TensorFlow with a learning rate of 0.001?
easy
A. tf.optimizers.Adam(lr=0.001)
B. tf.AdamOptimizer(0.001)
C. tf.optimizers.Adam(learning_rate=0.001)
D. tf.optimizers.AdamOptimizer(learning_rate=0.001)

Solution

  1. Step 1: Recall TensorFlow 2.x optimizer syntax

    In TensorFlow 2.x, optimizers are created via tf.optimizers.OptimizerName with named parameters.
  2. Step 2: Check correct Adam optimizer syntax

    The correct call is tf.optimizers.Adam(learning_rate=0.001). Other options use outdated or incorrect names.
  3. Final Answer:

    tf.optimizers.Adam(learning_rate=0.001) -> Option C
  4. Quick Check:

    Correct syntax = tf.optimizers.Adam(learning_rate=0.001) [OK]
Hint: Use tf.optimizers.Adam with named learning_rate [OK]
Common Mistakes:
  • Using old tf.AdamOptimizer from TF1.x
  • Passing learning rate as positional argument
  • Using non-existent tf.optimizers.AdamOptimizer
3. What will be the output loss value after one training step using RMSprop optimizer with learning rate 0.01 on a simple linear model trained on data x=[1,2], y=[2,4]? Assume initial weights are zero and mean squared error loss.
medium
A. 0.5
B. 9.5
C. 1.0
D. 4.0

Solution

  1. Step 1: Calculate initial prediction and loss

    Initial weights zero means prediction is 0 for inputs. Loss = mean squared error = mean([4,16]) = 10.
  2. Step 2: Perform one RMSprop update step

    RMSprop scales update by rms of gradient (first step rms ≈ 0.32*|g|). Gradients ≈[-10,-6] for [w,b], updates ≈[+0.032,+0.032]. New preds ≈[0.063,0.095], new loss ≈9.5.
  3. Final Answer:

    9.5 -> Option B
  4. Quick Check:

    Loss after step ≈ 9.5 [OK]
Hint: RMSprop first step small due to scaling, loss ~9.5 [OK]
Common Mistakes:
  • Expecting sharp loss drop after one step
  • Confusing learning rate effect
  • Ignoring initial zero weights impact
4. You wrote this code to use Adam optimizer but get an error:
optimizer = tf.optimizers.Adam(lr=0.01)
model.compile(optimizer=optimizer, loss='mse')

What is the likely cause of the error?
medium
A. Model.compile does not accept optimizer objects
B. Adam optimizer does not accept float arguments
C. Loss function 'mse' is invalid
D. Learning rate must be named as learning_rate=0.01

Solution

  1. Step 1: Check Adam optimizer argument requirements

    TF2.x Adam expects keyword 'learning_rate=', not TF1.x-style 'lr='.
  2. Step 2: Identify error cause in code

    Using lr=0.01 causes TypeError (unexpected keyword). Correct: tf.optimizers.Adam(learning_rate=0.01).
  3. Final Answer:

    Learning rate must be named as learning_rate=0.01 -> Option D
  4. Quick Check:

    Named argument needed [OK]
Hint: Always name learning_rate in Adam optimizer [OK]
Common Mistakes:
  • Using 'lr=0.01' keyword from TF1.x
  • Assuming 'mse' is invalid loss
  • Thinking optimizer object can't be passed
5. You want to train a model on noisy data that changes over time. Which optimizer is best suited to adapt learning rates per parameter and handle this noise effectively?
hard
A. Adam
B. Gradient Descent with fixed learning rate
C. RMSprop
D. SGD without momentum

Solution

  1. Step 1: Understand optimizer strengths for noisy data

    Adam adapts learning rates per parameter and combines momentum and RMSprop ideas, handling noise well.
  2. Step 2: Compare with other optimizers

    SGD without momentum and fixed learning rate struggle with noise. RMSprop adapts rates but Adam adds momentum for better stability.
  3. Final Answer:

    Adam -> Option A
  4. Quick Check:

    Best for noisy data = Adam [OK]
Hint: Adam adapts learning rates and handles noise best [OK]
Common Mistakes:
  • Choosing plain SGD for noisy data
  • Confusing RMSprop with Adam's momentum
  • Ignoring adaptive learning rate benefits