WhisperBenchmarks

This repo has been alterted to aid in my process of understanding the process of benchmarking WhisperX, after I take the time to do a bit more research I can start refocusing the idea to a simple script.

Videos

Videos are chosen for being short and matching their given category

Categories	Title	Link	Length	Instant Download
Poor mic placement	Body camera footage from July 10 traffic stop	Internet Archive	2:22	MP4
Thick accents	Moonshine for Medicine Popcorn Sutton	Internet Archive	1:35	MP4
Artifacts in audio	2002 007 Movie Trailer Commercial Bad Video	Internet Archive	0:14	MP4
Ideal audio (one speaker)	8 Bit Bookclub	Internet Archive	1:44	MP3
Long form (many speakers)	Bionic Woman "Black Magic" (1976)	Internet Archive	43:53	MP4

How to Run Whisper Benchmarks

Two tools are recommended, Hyperfine is a shell benchmarking tool written in rust, and WhisperX a re-implementation of whisper that boasts a 70x increase in performance.

To use WhisperX I would recommend having a HuggingFace account and agreeing to these two repos (https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main) (https://huggingface.co/pyannote/segmentation-3.0). Then, you'll have to include your HF token in every WhisperX command.

Results

Results are for the complete run which includes loading the model, running VAD, and running the transcription. Links are embeded in the results for each category

CPU Benchmarks

CPU Model	Poor mic placement (m:s:ms)	Thick accents (m:s:ms)	Artifacts in audio (m:s:ms)	Ideal audio (m:s:ms)	Long form	(Docker/Native)	Model

GPU Benchmarks

GPU Model	Poor mic placement (m:s:ms)	Thick accents (m:s:ms)	Artifacts in audio (m:s:ms)	Ideal audio (m:s:ms)	Long form (m:s:ms)	(Docker/Native)	Model

3 KiB Raw Permalink Blame History