Blog

Fun with the Whisper library and Python

A python application to add subtitles to your media

Iain (Bill) Wiseman

Morning Windy

python cuda

27 Apr 2025

6 min read

I built this project after running into a familiar problem: great shows, hard-to-follow dialogue, and subtitles that were missing or just not very good. So I put together a command line tool called Bibble Subs, a Python app that takes a video file, extracts the audio, runs speech to text with faster-whisper, and writes proper subtitle files you can drop straight into a media player.

The goal was practical, not academic. I wanted something I could point at a real episode, choose how much of it to process, and get usable results quickly, whether I was running on CPU or CUDA. Along the way, I added guardrails for file validation, time slicing, model download options, language selection, and multiple subtitle output formats including SRT, VTT, and ASS.

In this post, I will walk through the approach, key parts of the code, and what worked well in practice so you can build your own subtitle pipeline or adapt this one to your workflow.

The Problem

A while ago I wanted to watch television series "When the Boat Comes In". It is set in North East of England. I attach a clip of the start so you can see what you are up against.

Tooling

I didn’t want to build a speech‑recognition system from scratch — I just wanted subtitles that didn’t look like they’d been typed by someone who’d only heard English described to them once, in a pub, in 1973. So I leaned on a few solid tools to do the heavy lifting while I stitched everything together with Python.

Here’s what’s under the hood.

PyAV (FFmpeg under the hood) — the media workhorse

Every video file is a different flavour of chaos. I use PyAV, which sits on top of FFmpeg libraries, to open media files, decode audio streams, and write extracted audio for transcription.

faster‑whisper — Whisper, but without the gym membership

This is the speech‑to‑text engine. It runs on CTranslate2, which means:

fast on CPU
very fast on CUDA

PyTorch is also included for GPU device detection and CUDA support. It also gives me word‑level timestamps, which is perfect for building proper subtitle files instead of "vibes‑based timing".

Torch + CUDA — optional turbo mode

If you’ve got a GPU, Bibble Subs will happily use it.
If you don’t, it won’t sulk — it just switches to CPU and carries on.
The goal was flexibility, not “you must own a gaming PC to watch telly”.

External Whisper models — pick your fighter

You can choose from:

tiny
base
small
medium
large‑v3

The tool will download the model for you if it’s missing.
Bigger models give better accuracy; smaller ones are faster.
It’s like choosing between instant noodles and a slow‑cooked stew — both work, it just depends how hungry you are.

Python CLI — because a GUI would have slowed me down

The whole thing runs from the command line.
You point it at a video, choose a model, optionally pick a time range, and it spits out subtitles in:

SRT (classic)
VTT (web‑friendly)
ASS (for people who want karaoke effects or just enjoy chaos)

It’s simple, scriptable, and easy to automate.

How It Works

Bibble Subs follows a simple idea: take a video, pull the sound out of it, let a very clever model listen to it, and then turn the results into subtitles that don’t make you question your life choices.

Under the hood, the pipeline looks like this:

1. You point it at a video

Bibble Subs validates the path and checks for a real video track with MediaInfo before doing any heavy work. It can handle most formats supported by your local PyAV/FFmpeg build.

You can also choose:

where to start
how long to process
which model to use
which subtitle format you want

It’s your show.

2. PyAV extracts and slices audio

This is the “scoop the audio out of the video” step. The app opens the container, finds the audio stream, seeks to the requested start time, and processes frames until it reaches the requested duration.

This keeps transcription fast and avoids feeding Whisper an entire hour when you only needed the first scene.

3. faster‑whisper does the listening

This is where the magic happens.

The model:

loads the Whisper size you chose
runs on CUDA if you’ve got it
falls back to CPU if you don’t
detects the language (unless you tell it)
uses VAD to avoid hallucinating dialogue that never existed
produces word‑level timestamps

The result is a clean, structured transcription with segment timings and text that actually matches what people said.

4. The transcript gets turned into subtitles

Once the text is ready, Bibble Subs converts it into your chosen format:

SRT — the classic
VTT — for browsers and modern players
ASS — for people who enjoy karaoke effects or just want to feel powerful

Each format has its own writer, so the timing stays accurate and the output looks like something a human might have made.

5. You get a subtitle file you can drop straight into a media player

That’s it.
No cloud services, no accounts, no “upload your entire video to our server and hope for the best”.

Just a local tool that does one job well.

Future Improvements

Bibble Subs works well today, but there are a few things I want to polish so it feels less like a “developer tool I built for myself” and more like a “proper grown‑up application”.

1. Smarter output handling

Right now, if you don’t specify an output path, Bibble Subs politely drops your subtitles into:

/tmp/subtitles.srt

Which is fine… unless you forget to move it, reboot your machine, and then wonder why your subtitles have mysteriously evaporated. Future versions will:

create output files next to the original video
or inside a user‑specified folder
or inside a project‑level “subs/” directory
with automatic naming based on the video file

Basically: fewer disappearing subtitles, more “oh, there it is”.

2. Batch / folder processing

At the moment, Bibble Subs handles one file at a time.
That’s great for a single episode, less great when you’re staring at a folder containing an entire season.

Planned improvements include: