The hidden state in an RNN acts like a memory that carries information from previous inputs, helping the model understand context when predicting the next character or word.
import torch import torch.nn as nn rnn = nn.RNN(input_size=8, hidden_size=16, batch_first=True) inputs = torch.randn(4, 10, 8) output, hidden = rnn(inputs) print(output.shape)
The RNN output has shape (batch_size, sequence_length, hidden_size) when batch_first=True. Here, batch_size=4, sequence_length=10, hidden_size=16.
Moderate sequence lengths allow the RNN to learn useful context without excessive memory use or vanishing gradients that happen with very long sequences.
Lower perplexity means the model assigns higher probabilities to the correct next tokens, showing better prediction performance.
optimizer = torch.optim.Adam(model.parameters(), lr=1.0) for inputs, targets in dataloader: optimizer.zero_grad() outputs = model(inputs) loss = loss_fn(outputs, targets) loss.backward() optimizer.step()
A very high learning rate like 1.0 can cause large weight updates, leading to exploding gradients and NaN loss values.
