In text-to-speech (TTS) systems, the vocoder is a key component. What does it do?
Think about how the system creates sound from intermediate data.
The vocoder takes acoustic features like spectrograms and generates the actual sound wave, making speech audible.
Given the code below that converts audio to a mel spectrogram, what is the shape of mel_spectrogram?
import numpy as np import librosa audio, sr = librosa.load('audio.wav', sr=22050) mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80, hop_length=256, n_fft=1024) output_shape = mel_spectrogram.shape
Check the documentation for librosa.feature.melspectrogram output dimensions.
The mel spectrogram has shape (n_mels, time frames). Here, n_mels=80, so shape is (80, number_of_frames).
Among these options, which model architecture is designed specifically to generate high-quality speech waveforms in text-to-speech systems?
Consider which model is a neural vocoder producing raw audio.
WaveNet is a neural vocoder that generates raw audio waveforms sample-by-sample, producing natural speech sounds.
In mel spectrogram extraction, which hyperparameter affects how often frames are sampled over time?
Think about the step size between frames in the spectrogram.
The hop_length sets the number of audio samples between successive frames, controlling time resolution.
When assessing how natural synthesized speech sounds, which metric is most appropriate?
Naturalness is subjective and often measured by human judgment.
MOS is a subjective score collected from human listeners rating speech naturalness, making it the best metric.