For text-to-speech (TTS), the main goal is to create natural, clear, and understandable speech from text. Metrics that measure how close the generated speech sounds to real human speech are important. These include Mean Opinion Score (MOS), which is a human rating of speech quality, and Word Error Rate (WER), which checks how well speech recognition can understand the generated audio. Lower WER means clearer speech. Also, Mel Cepstral Distortion (MCD) measures how close the sound features are to real speech. These metrics help us know if the TTS sounds natural and is easy to understand.
Text-to-speech generation in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Unlike classification tasks, TTS does not use a confusion matrix. Instead, we use Word Error Rate (WER) calculated by comparing the text recognized from generated speech to the original text.
Original text: "Hello world"
Recognized text: "Hello word"
WER = (Substitutions + Deletions + Insertions) / Number of words in original
WER = 1 / 2 = 0.5 (50%)
Lower WER means better speech clarity.
In TTS, precision and recall are not typical metrics. Instead, there is a tradeoff between naturalness and intelligibility. For example:
- If the model focuses too much on sounding natural (like human voice), it might produce unclear words, lowering intelligibility.
- If the model focuses too much on clear pronunciation, the speech might sound robotic and less natural.
Balancing these is key. High naturalness with low intelligibility means listeners enjoy the voice but can't understand it well. High intelligibility with low naturalness means clear words but a boring or robotic voice.
Good values:
- MOS: Around 4.0 to 5.0 (on a scale of 1 to 5) means listeners find the speech natural and pleasant.
- WER: Close to 0% means the speech is very clear and easy to understand.
- MCD: Lower values (e.g., below 5 dB) mean the generated speech closely matches real speech acoustics.
Bad values:
- MOS: Below 3.0 means the speech sounds unnatural or robotic.
- WER: High values (e.g., above 20%) mean the speech is hard to understand.
- MCD: High values mean the speech sounds very different from real human speech.
- Relying only on automatic metrics: Metrics like WER and MCD do not capture how natural or pleasant the speech sounds. Human listening tests (MOS) are essential.
- Ignoring context: Some words or phrases may be harder to pronounce. Evaluating only on easy sentences can give a false sense of quality.
- Overfitting to training voices: The model may sound great on voices it trained on but poorly on new voices or accents.
- Data leakage: Using test sentences seen during training can inflate metric scores.
Your TTS model has a Mean Opinion Score (MOS) of 4.5 but a Word Error Rate (WER) of 30%. Is this good for production? Why or why not?
Answer: This means listeners find the speech very natural (high MOS), but the speech is hard to understand (high WER). For production, this is not good because users may enjoy the voice but struggle to understand the words. You should improve clarity (reduce WER) while keeping naturalness.
Practice
Solution
Step 1: Understand the function of TTS
Text-to-speech technology changes written words into sound that can be heard.Step 2: Compare options with TTS purpose
Only To convert written text into spoken audio describes converting text to speech, which matches TTS.Final Answer:
To convert written text into spoken audio -> Option DQuick Check:
TTS = convert text to speech [OK]
- Confusing TTS with translation
- Thinking TTS summarizes text
- Mixing TTS with emotion detection
Solution
Step 1: Identify libraries related to TTS
gTTS is a Python library designed for text-to-speech conversion.Step 2: Eliminate unrelated libraries
NumPy, Matplotlib, and Pandas are for math, plotting, and data, not TTS.Final Answer:
gTTS -> Option BQuick Check:
gTTS = text-to-speech library [OK]
- Choosing data or plotting libraries by mistake
- Confusing gTTS with general Python packages
- Assuming TTS needs complex libraries always
from gtts import gTTS
text = 'Hello world'
tts = gTTS(text)
tts.save('hello.mp3')
print('Audio saved')Solution
Step 1: Analyze the code steps
The code imports gTTS, creates speech from 'Hello world', saves it as 'hello.mp3', then prints a message.Step 2: Check for errors or missing parts
gTTS defaults to English if no language is given, so no syntax error occurs. Internet is needed but code runs assuming connection.Final Answer:
An audio file named 'hello.mp3' is created and 'Audio saved' is printed -> Option AQuick Check:
Code saves audio and prints message [OK]
- Thinking language parameter is mandatory
- Assuming print outputs the text spoken
- Ignoring that gTTS needs internet but code runs
from gtts import gTTS
tts = gTTS('Hello')
tts.save()Solution
Step 1: Check gTTS usage
gTTS constructor accepts text string; language is optional. So no error there.Step 2: Check save() method
save() requires a filename string argument to save the audio file. Missing argument causes error.Final Answer:
Missing filename argument in save() method -> Option AQuick Check:
save() needs filename [OK]
- Assuming language is always required
- Thinking text must be a list
- Believing import statement is wrong
Solution
Step 1: Understand multilingual TTS needs
The system must speak different languages based on user choice, so language must be flexible.Step 2: Evaluate options for language flexibility
Use gTTS with a dynamic language parameter set from user input sets language dynamically in gTTS, allowing correct speech for each language. Others fix language or use static audio, which won't adapt.Final Answer:
Use gTTS with a dynamic language parameter set from user input -> Option CQuick Check:
Dynamic language parameter enables multilingual TTS [OK]
- Ignoring language parameter flexibility
- Assuming default English works for all
- Using static audio files for dynamic text
