Which statement best describes the effect of a very high learning rate when using the Adam optimizer?
Think about what happens if you take too big steps when trying to find the lowest point on a hill.
A very high learning rate causes the optimizer to take large steps that can overshoot the minimum, causing the loss to oscillate or even increase, preventing convergence.
Given the following code snippet training a simple model on dummy data, what will be the printed loss value after one training step using RMSprop optimizer?
import tensorflow as tf import numpy as np x = np.array([[1.0], [2.0], [3.0], [4.0]]) y = np.array([[2.0], [4.0], [6.0], [8.0]]) model = tf.keras.Sequential([tf.keras.layers.Dense(1)]) optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01) loss_fn = tf.keras.losses.MeanSquaredError() with tf.GradientTape() as tape: predictions = model(x, training=True) loss = loss_fn(y, predictions) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) print(round(float(loss), 3))
Initial weights are random, so loss will be relatively high but not extremely large.
The initial loss is the mean squared error between the true values and predictions from random weights. It is around 10.0 in this setup.
You are training a neural network with very sparse gradients (many zeros). Which optimizer is generally the best choice to handle sparse updates efficiently?
Consider which optimizer adapts learning rates per parameter and handles sparse gradients well.
Adam adapts learning rates individually and works well with sparse gradients, making it a good choice for such cases.
What is the effect of increasing the momentum parameter in SGD optimizer during training?
Think about how momentum in physics helps keep an object moving smoothly.
Momentum in SGD helps accumulate past gradients to smooth updates, speeding up convergence and helping escape shallow local minima.
Consider this training loop using Adam optimizer. The loss suddenly becomes NaN after several epochs. What is the most likely cause?
model = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.Adam(learning_rate=1.0)
for epoch in range(10):
with tf.GradientTape() as tape:
predictions = model(x)
loss = tf.reduce_mean(tf.square(y - predictions))
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
print(f"Epoch {epoch} Loss: {loss.numpy()}")Check the learning rate value and its effect on training stability.
A learning rate of 1.0 is very high for Adam and can cause the model weights to update too aggressively, leading to exploding gradients and NaN loss.