Complete the code to calculate the total number of parameters in a transformer model.
total_params = num_layers * num_heads * [1]The total number of parameters depends on the hidden size, number of layers, and heads.
Complete the code to compute the loss decay rate over training steps.
loss = initial_loss * (step)[1]decay_rateLoss typically decays as a power law, so we use the power operator.
Fix the error in the code to calculate the optimal model size given compute budget.
optimal_size = compute_budget [1] (training_steps * batch_size)Optimal size is compute budget divided by total training tokens (steps * batch size).
Fill both blanks to create a dictionary of loss values over epochs where loss decreases.
losses = {epoch: initial_loss * (decay_rate [1] epoch) for epoch in range(1, num_epochs + 1) if epoch [2] max_epoch}Loss decreases exponentially with epochs (using '**'), and we include epochs up to max_epoch (using '<=').
Fill all three blanks to create a dictionary mapping model sizes to their expected loss if loss decreases with size.
expected_losses = {size: base_loss [1] (size [2] scale_factor) [3] 2 for size in model_sizes if size [2] scale_factor}Loss decreases with size: base_loss / ((size / scale_factor) ** 2).