This repo has been alterted to aid in my process of understanding the process of benchmarking WhisperX, after I take the time to do a bit more research I can start refocusing the idea to a simple script.
| Poor mic placement | Body camera footage from July 10 traffic stop | [Internet Archive](https://archive.org/details/cobmn-Body_camera_footage_from_July_10_traffic_stop) | 2:22 | [MP4](https://archive.org/download/cobmn-Body_camera_footage_from_July_10_traffic_stop/Body_camera_footage_from_July_10_traffic_stop.mp4) |
| Thick accents | Moonshine for Medicine Popcorn Sutton | [Internet Archive](https://archive.org/details/this-is-the-last-dam-run-of-likker-ill-ever-make-full-movie/+Moonshine+for+Medicine++++Popcorn+Sutton.mp4) | 1:35 | [MP4](https://archive.org/download/this-is-the-last-dam-run-of-likker-ill-ever-make-full-movie/%20Moonshine%20for%20Medicine%20%20%20%20Popcorn%20Sutton.mp4) |
| Artifacts in audio | 2002 007 Movie Trailer Commercial Bad Video | [Internet Archive](https://archive.org/details/2002variouscommercials/2002+007+Movie+Trailer+Commercial+Bad+Video.mp4) | 0:14 | [MP4](https://archive.org/download/2002variouscommercials/2002%20A%20Touch%20Of%20Class%20Limos%20Bridal%20Show%20Wilton%20Mall%20Saratoga%20Commercial.mp4) |
Two tools are recommended, [Hyperfine](https://github.com/sharkdp/hyperfine) is a shell benchmarking tool written in rust, and [WhisperX](https://github.com/m-bain/whisperX) a re-implementation of whisper that boasts a 70x increase in performance.
To use WhisperX I would recommend having a HuggingFace account and agreeing to these two repos (https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main) (https://huggingface.co/pyannote/segmentation-3.0).
Then, you'll have to include your HF token in every WhisperX command.
Results are for the complete run which includes loading the model, running VAD, and running the transcription. Links are embeded in the results for each category
| CPU Model | Poor mic placement (m:s:ms) | Thick accents (m:s:ms) | Artifacts in audio (m:s:ms) | Ideal audio (m:s:ms) | Long form | (Docker/Native) | Model |
| GPU Model | Poor mic placement (m:s:ms) | Thick accents (m:s:ms) | Artifacts in audio (m:s:ms) | Ideal audio (m:s:ms) | Long form (m:s:ms) | (Docker/Native) | Model |