brooke/whisperbenchmarks

Fork 0

Easy-to-use benchmarks using audio and video content from the Internet Archive, specifically targeting various challenging scenarios in audio recordings.

Find a file

brooke db7c3726be Update README.md		2023-12-04 07:56:15 +00:00
benchmark-outputs	update	2023-12-03 00:00:48 -05:00
LICENSE	Initial commit	2023-12-02 03:16:17 +00:00
README.md	Update README.md	2023-12-04 07:56:15 +00:00

README.md

WhisperBenchmarks

This repository provides easy-to-use benchmarks using audio and video content from the Internet Archive, specifically targeting various challenging scenarios in audio recordings.

Based on https://gitlab.com/aadnk/whisper-webui

Models

Model	Command
faster-large-v3	--whisper_implementation faster-whisper --model large-v3
faster-medium	--whisper_implementation faster-whisper --model medium
faster-small	--whisper_implementation faster-whisper --model small
faster-tiny	--whisper_implementation faster-whisper --model tiny

Videos

Videos are chosen for being short and matching their given category

Categories	Title	Link	Length	Instant Download
Poor mic placement	Body camera footage from July 10 traffic stop	Internet Archive	2:22	MP4
Thick accents	Moonshine for Medicine Popcorn Sutton	Internet Archive	1:35	MP4
Artifacts in audio	2002 007 Movie Trailer Commercial Bad Video	Internet Archive	0:14	MP4
Ideal audio (one speaker)	8 Bit Bookclub	Internet Archive	1:44	MP3
Long form (many speakers)	Bionic Woman "Black Magic" (1976)	Internet Archive	43:53	MP4

How to Run Whisper Benchmarks

-- TODO --

Results

Results are for the complete run which includes loading the model, running VAD, and running the transcription. Links are embeded in the results for each category

CPU Benchmarks

CPU Model	Poor mic placement (m:s:ms)	Thick accents (m:s:ms)	Artifacts in audio (m:s:ms)	Ideal audio (m:s:ms)	Long form	(Docker/Native)	Model

GPU Benchmarks

GPU Model	Poor mic placement (m:s:ms)	Thick accents (m:s:ms)	Artifacts in audio (m:s:ms)	Ideal audio (m:s:ms)	Long form (m:s:ms)	(Docker/Native)	Model
RTX 2060S	00:11:12	00:06:96	00:04:40	00:08:31	03:34:00	Native	faster-medium
RTX A5000	00:10:86	00:07.84	00:07:75	00:08:91	03:17:10	Native	faster-large-v3

Todo:

Write easy bash scripts for running a set of benchmarks with an easy cleanup
Finalize a standard format for exporting the data into a spreadsheet
Some outputs seem to be presenting artifacts when using faster-whisper, there should be some kind of easy review process so that batches can quickly be re-run, maybe a link to open vlc with the new subtitles loaded?

Notes:

From running multiple benchmarks the resulting time it takes seems to vary a lot more than I would have expected for shorter clips, "Poor Mic placement" tested anywhere from 8s to 20s on an A5000.