When studying how large language models (LLMs) improve as they get bigger, the key metric is loss, especially cross-entropy loss. This loss measures how well the model predicts the next word. Lower loss means better predictions.
We focus on loss because scaling laws show a smooth, predictable drop in loss as model size, data, and compute increase. This helps us understand how much bigger or longer to train a model to get better results.