0
0
Prompt Engineering / GenAIml~20 mins

Text-to-speech generation in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text-to-Speech Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main role of the vocoder in text-to-speech systems?

In text-to-speech (TTS) systems, the vocoder is a key component. What does it do?

AIt analyzes the input text to detect emotions.
BIt converts text into phonemes for pronunciation.
CIt converts the acoustic features into audible speech waveforms.
DIt translates speech back into text.
Attempts:
2 left
💡 Hint

Think about how the system creates sound from intermediate data.

Predict Output
intermediate
2:00remaining
What is the output shape of the mel spectrogram in this TTS preprocessing code?

Given the code below that converts audio to a mel spectrogram, what is the shape of mel_spectrogram?

Prompt Engineering / GenAI
import numpy as np
import librosa

audio, sr = librosa.load('audio.wav', sr=22050)
mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80, hop_length=256, n_fft=1024)
output_shape = mel_spectrogram.shape
A(number_of_frames, 80)
B(80, number_of_frames)
C(1024, 80)
D(256, number_of_frames)
Attempts:
2 left
💡 Hint

Check the documentation for librosa.feature.melspectrogram output dimensions.

Model Choice
advanced
2:00remaining
Which model architecture is best suited for generating natural-sounding speech waveforms in TTS?

Among these options, which model architecture is designed specifically to generate high-quality speech waveforms in text-to-speech systems?

AWaveNet
BBERT
CResNet
DTacotron 2
Attempts:
2 left
💡 Hint

Consider which model is a neural vocoder producing raw audio.

Hyperparameter
advanced
2:00remaining
Which hyperparameter directly controls the time resolution of mel spectrogram frames in TTS preprocessing?

In mel spectrogram extraction, which hyperparameter affects how often frames are sampled over time?

An_fft
Bn_mels
Cwin_length
Dhop_length
Attempts:
2 left
💡 Hint

Think about the step size between frames in the spectrogram.

Metrics
expert
2:00remaining
Which metric best evaluates the naturalness of synthesized speech in TTS systems?

When assessing how natural synthesized speech sounds, which metric is most appropriate?

AMean Opinion Score (MOS) from human listeners
BWord Error Rate (WER) from speech recognition on synthesized audio
CBLEU score comparing synthesized text to reference text
DMean Squared Error (MSE) between predicted and target spectrograms
Attempts:
2 left
💡 Hint

Naturalness is subjective and often measured by human judgment.