Prompt Engineering / GenAIml~8 mins

Audio transcription (Whisper) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Audio transcription (Whisper)

Which metric matters for Audio Transcription (Whisper) and WHY

For audio transcription, the main goal is to convert spoken words into text accurately. The key metric is Word Error Rate (WER). WER measures how many words the model got wrong compared to the true transcript. It counts substitutions, deletions, and insertions of words. A lower WER means better transcription quality.

WER is important because it directly shows how close the transcription is to the real speech. Other metrics like Character Error Rate (CER) can also be used, especially for languages without clear word boundaries.

Confusion Matrix or Equivalent Visualization

Unlike classification tasks, audio transcription does not use a confusion matrix. Instead, errors are counted as:

True Transcript:  "I love machine learning"
Model Output:    "I love machine"

Errors:
- Deletion: missing "learning"
- Substitution: none
- Insertion: none

WER = (Substitutions + Deletions + Insertions) / Number of words in true transcript
WER = (0 + 1 + 0) / 4 = 0.25 (25%)

This shows 1 error out of 4 words, so 25% WER.

Precision vs Recall Tradeoff (Adapted for Transcription)

In transcription, precision and recall relate to how many words are correctly recognized versus missed or wrongly added.

Precision: Of all words the model wrote, how many are correct? High precision means few extra or wrong words.
Recall: Of all words spoken, how many did the model capture? High recall means few missed words.

For example, in a meeting transcript, missing important words (low recall) can cause misunderstanding. Adding wrong words (low precision) can confuse the meaning.

Balancing precision and recall is key. Whisper models aim to minimize overall errors (WER), which balances these aspects.

What "Good" vs "Bad" Metric Values Look Like

Good transcription: WER below 10% means the transcript is very close to the original speech. Most words are correct, and the text is easy to understand.

Bad transcription: WER above 30% means many words are wrong, missing, or extra. The transcript may be confusing or unusable.

Context matters: For noisy audio or multiple speakers, a WER around 15-20% might still be acceptable.

Common Metrics Pitfalls

Ignoring context: Some errors may be minor and not affect meaning, but WER treats all errors equally.
Data leakage: Testing on audio the model has seen before inflates performance.
Overfitting: Model performs well on training accents or speakers but poorly on new ones.
Ignoring language specifics: Some languages have complex word boundaries affecting WER calculation.

Self-Check Question

Your Whisper model has a 98% accuracy on a simple yes/no speech test but a 40% Word Error Rate on a long conversation. Is it good for real use? Why or why not?

Answer: No, it is not good for real use. The high accuracy on a simple test shows it can recognize very limited speech well, but the 40% WER on real conversations means many words are wrong or missing. This makes the transcript unreliable for understanding or decision-making.

Key Result

Word Error Rate (WER) is the key metric; lower WER means better transcription accuracy.

Practice

(1/5)

1. What is the main purpose of the Whisper model in audio transcription?

easy

A. Translate text from one language to another

B. Convert spoken words in audio files into written text

C. Generate music from text descriptions

D. Detect objects in images

Audio transcription (Whisper) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand Whisper's function

Step 2: Compare options to Whisper's purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall the official Whisper method name

Step 2: Match method call syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand the output of `transcribe()`

Step 2: Identify the Python type of the output

Final Answer:

Quick Check:

Solution

Step 1: Check method call requirements

Step 2: Identify missing argument

Final Answer:

Quick Check:

Solution

Step 1: Understand model size trade-offs

Step 2: Choose model balancing speed and accuracy

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand Whisper's function

Step 2: Compare options to Whisper's purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall the official Whisper method name

Step 2: Match method call syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand the output of transcribe()

Step 2: Identify the Python type of the output

Final Answer:

Quick Check:

Solution

Step 1: Check method call requirements

Step 2: Identify missing argument

Final Answer:

Quick Check:

Solution

Step 1: Understand model size trade-offs

Step 2: Choose model balancing speed and accuracy

Final Answer:

Quick Check:

Step 1: Understand the output of `transcribe()`