For audio transcription, the main goal is to convert spoken words into text accurately. The key metric is Word Error Rate (WER). WER measures how many words the model got wrong compared to the true transcript. It counts substitutions, deletions, and insertions of words. A lower WER means better transcription quality.
WER is important because it directly shows how close the transcription is to the real speech. Other metrics like Character Error Rate (CER) can also be used, especially for languages without clear word boundaries.