About Auto Subtitles
AntiUpload's Auto Subtitles is an in-browser speech-to-text transcriber that runs the Whisper model (the same open-source model that powers OpenAI's audio API) entirely on your device. Drop in an audio or video file and you get back a subtitle file in SRT, VTT, plain text, JSON, or TikTok-style word-by-word ASS format. The model runs in a Web Worker via the @xenova/transformers.js library; no audio is uploaded to any server.
Compare the math: Rev / Otter / Trint charge $0.10-0.25 per minute of transcription. A 1-hour podcast costs $6-15 to transcribe on those services, with the audio uploaded to their servers and processed in their cloud. The same hour transcribes for $0 here, with the file never leaving your browser. The first transcription downloads the ~75 MB Whisper-tiny model from Hugging Face; subsequent runs reuse the cached model so they start instantly.
The tradeoff is speed and accuracy. Whisper-tiny is fast (~2-5× realtime on a modern laptop) but less accurate than Whisper-large (the model the cloud services use). For clean speech (podcasts, lectures, interviews) tiny gets ~95% of words right. For heavy accents, noisy backgrounds, or technical jargon, the Base model (145 MB, available as an option) helps. Translation mode forces multilingual and outputs English from any source language — useful for foreign-language source material destined for English-audience captions.
How it works
- Drop an audio or video fileAccepts every common format: MP4, MOV, WebM, MKV, AVI for video; MP3, WAV, M4A, OGG, FLAC, AAC, OPUS for audio. The tool extracts the audio track via FFmpeg before running Whisper.
- Pick output format + languageSRT (universal), VTT (web video), TXT (plain transcript), JSON (with timestamps), or ASS (TikTok-style word-by-word captions). Set language to "en" for fastest English-only model, "auto" for 99-language detection, or pick a specific language for best accuracy.
- First run: download model (~75 MB, one-time)On the first transcription the Whisper-tiny model downloads from Hugging Face's CDN (~75 MB for English-only, ~78 MB for multilingual). The browser caches it; subsequent runs are instant. No download on follow-ups.
- TranscribeWhisper runs on your CPU (with SIMD acceleration). On a modern laptop expect ~2-5× realtime — a 10-minute podcast transcribes in 2-5 minutes. Older devices are slower; mobile devices much slower (avoid mobile for files longer than 5 min).
- Download the subtitle fileOutput is timestamp-aligned and ready to use. Drop the SRT into Premiere / DaVinci Resolve, or upload to YouTube / Vimeo as a caption track. For TikTok-style burn-in, pair with our subtitle-burn-in tool.