LLM scaling laws show that performance improves following a power-law relationship with model size, meaning gains are significant but not linear or logarithmic.
Training loss decreases following a power-law as compute increases, showing diminishing returns but consistent improvement.
LLM scaling laws suggest an optimal balance between model size and training steps to best use compute and minimize loss.
Doubling parameters does not halve loss; improvements follow a power-law with diminishing returns, so this interpretation is incorrect.
loss = a * (N)^-b + c where N is the number of parameters, a=10, b=1/3, and c=0.1. What is the training loss when N=1000000?a = 10 b = 1/3 c = 0.1 N = 1000000 loss = a * (N)**(-b) + c print(round(loss, 4))
1000000 = 10^6, so (10^6)^{-1/3} = 10^{-6/3} = 10^{-2} = 0.01 exactly. Then 10 * 0.01 = 0.1, plus c = 0.1 gives loss = 0.2 exactly. round(0.2, 4) outputs 0.2000, corresponding to 0.2.
