EchoScript

Automatic transcription with speaker diarization using Whisper and PyAnnote.

May 20, 2024ยท 6 min read

Problem

Generating accurate transcripts from multi-speaker audio requires both high-quality ASR and speaker attribution. Whisper alone produces accurate text but does not identify who is speaking.

Solution

Integrated OpenAI Whisper for ASR with PyAnnote.audio for speaker diarization, aligning speaker labels to transcript segments via timestamp matching.

Pipeline

  1. Preprocessing: resample to 16kHz mono, normalize loudness.
  2. Diarization: PyAnnote.audio segments audio by speaker with RTTM output.
  3. Transcription: Whisper transcribes each diarized segment independently.
  4. Alignment: Word-level timestamps from Whisper are matched to diarization segments.

Export Formats

  • SRT: speaker-labeled subtitle blocks with timecodes.
  • VTT: WebVTT format for browser-native subtitle tracks.
  • JSON: structured data with per-word timing and confidence scores.

Frontend

WaveSurfer.js renders audio waveforms. Clicking a transcript segment seeks the audio to that position. Speaker colors are assigned per unique speaker ID.