For translation tasks, the key metric is BLEU score. BLEU measures how close the machine's translation is to a human translation by comparing matching words and phrases. It helps us know if the model is producing accurate and natural sentences. Unlike simple accuracy, BLEU looks at the quality of the whole sentence, not just word-by-word correctness.
Translation with Hugging Face in NLP - Model Metrics & Evaluation
Translation does not use a confusion matrix like classification. Instead, we use BLEU score which ranges from 0 to 1 (or 0 to 100%). A BLEU score of 1 means perfect match with human translation, 0 means no match.
Example BLEU scores: Reference: "The cat is on the mat" Model output 1: "The cat is on the mat" --> BLEU = 1.0 (perfect) Model output 2: "Cat on mat" --> BLEU ≈ 0.5 (partial match) Model output 3: "Dog runs fast" --> BLEU ≈ 0.0 (no match)
In translation, precision and recall are less direct but relate to how much of the correct words are used (precision) and how many correct words are covered (recall). BLEU balances these by checking overlapping phrases.
For example, a model that only outputs very common words might have high precision but low recall (missing details). A model that outputs many words might cover more meaning (higher recall) but include errors (lower precision). BLEU helps balance this.
A good BLEU score for translation is usually above 0.5 (50%), meaning the model's output is quite close to human translation.
A bad BLEU score is below 0.2 (20%), showing the model's translation is poor and often incorrect or missing key words.
Keep in mind BLEU scores depend on language pairs and dataset difficulty. Scores around 0.3-0.5 are common for many models.
- Overfitting: Model may memorize training sentences, scoring high BLEU on training but low on new sentences.
- Data leakage: If test sentences appear in training, BLEU scores will be unrealistically high.
- BLEU limitations: BLEU does not capture meaning perfectly; a sentence can have a low BLEU but still be a good translation.
- Ignoring fluency: BLEU focuses on matching words, not grammar or natural flow.
Your translation model has a BLEU score of 0.98 on training data but only 0.25 on new sentences. Is it good for production? Why or why not?
Answer: No, it is not good. The very high training BLEU and low new data BLEU shows overfitting. The model memorized training sentences but does not generalize well to new ones. You need to improve training or use more data.