0
0
Prompt Engineering / GenAIml~12 mins

Audio transcription (Whisper) in Prompt Engineering / GenAI - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Audio transcription (Whisper)

This pipeline converts spoken audio into written text using the Whisper model. It listens to audio, processes it, and outputs the transcription.

Data Flow - 5 Stages
1Audio Input
1 audio file, 30 seconds, 16kHz sample rateRaw audio waveform loaded1 audio array, 480,000 samples
Audio of a person saying 'Hello, how are you?'
2Preprocessing
1 audio array, 480,000 samplesConvert waveform to log-Mel spectrogram1 spectrogram, 80 Mel bands x 3000 frames
Spectrogram showing energy patterns of speech sounds
3Feature Encoding
1 spectrogram, 80 x 3000Encode spectrogram into feature vectors1 feature tensor, 3000 x 512
Encoded features representing audio content
4Decoder (Text Generation)
1 feature tensor, 3000 x 512Generate text tokens from featuresSequence of text tokens, length 15
Tokens representing words: ['Hello', ',', 'how', 'are', 'you', '?']
5Postprocessing
Sequence of text tokens, length 15Convert tokens to readable textString: 'Hello, how are you?'
Final transcription text
Training Trace - Epoch by Epoch

Loss
2.5 |****
2.0 |*** 
1.5 |**  
1.0 |*   
0.5 |    
     +----
      1 2 3 4 5 Epochs
EpochLoss ↓Accuracy ↑Observation
12.30.45Model starts learning basic audio-text alignment
21.80.60Loss decreases, accuracy improves as model learns speech patterns
31.40.72Model better understands phonemes and word boundaries
41.10.80Improved transcription quality, fewer errors
50.90.85Model converges with good transcription accuracy
Prediction Trace - 5 Layers
Layer 1: Audio Input
Layer 2: Preprocessing
Layer 3: Feature Encoding
Layer 4: Decoder (Text Generation)
Layer 5: Postprocessing
Model Quiz - 3 Questions
Test your understanding
What does the preprocessing stage do to the audio?
ATurns audio into text tokens directly
BConverts audio into a spectrogram showing sound energy over time
CSplits audio into separate words
DRemoves noise from the audio
Key Insight
This visualization shows how Whisper transforms raw audio into text by first converting sound waves into visual patterns, then encoding these into features, and finally decoding them into words. Training improves the model's ability to match audio to text, reducing errors over time.