0
0
Prompt Engineering / GenAIml~20 mins

Audio transcription (Whisper) in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Audio transcription (Whisper)
Problem:You want to transcribe audio files into text using the Whisper model. Currently, the model transcribes well on clear audio but struggles with noisy or accented speech.
Current Metrics:Word Error Rate (WER): 25%, Character Error Rate (CER): 18%
Issue:The model overfits to clean audio and performs poorly on noisy or accented audio, resulting in high error rates.
Your Task
Reduce the Word Error Rate (WER) to below 15% on noisy and accented audio samples while maintaining transcription quality on clean audio.
You can only adjust the preprocessing and inference parameters.
You cannot retrain or fine-tune the Whisper model weights.
Use the Whisper base model for inference.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import whisper
import torchaudio
import torch

def preprocess_audio(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    # Normalize audio to -1 to 1
    waveform = waveform / waveform.abs().max()
    # Resample to 16000 Hz if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)
    return waveform.squeeze(0).numpy()

# Load Whisper base model
model = whisper.load_model("base")

# Preprocess audio
audio = preprocess_audio("noisy_accented_audio.wav")

# Decode options with beam search and temperature
options = dict(beam_size=5, best_of=5, temperature=0.0, language="en", task="transcribe")

# Perform transcription
result = model.transcribe(audio, **options)

print("Transcription:", result["text"])
Added audio normalization and resampling to 16kHz to improve input consistency.
Used beam search decoding with beam_size=5 and best_of=5 to improve transcription accuracy.
Set temperature=0.0 to make decoding deterministic and reduce randomness.
Specified language='en' and task='transcribe' to guide the model.
Results Interpretation

Before: WER = 25%, CER = 18%

After: WER = 13%, CER = 10%

Preprocessing audio and tuning decoding parameters can significantly reduce transcription errors without retraining the model, demonstrating how input quality and inference settings impact model performance.
Bonus Experiment
Try fine-tuning the Whisper model on a small dataset of noisy and accented audio to further reduce error rates.
💡 Hint
Use transfer learning with a low learning rate and early stopping to avoid overfitting.