AntiUpload// browser-resident file tools
ENESFRPTDE
SESSION · 
← Back to home

100% free · Whisper local · No upload · No watermark

Auto Subtitles

Generate .srt or .vtt captions from speech using Whisper-tiny — runs entirely in your browser, downloads a 75 MB model on first use.

100% freeNo file size limitNo watermarkNo sign-up
  1. 1Pick file
  2. 2Configure
  3. 3Download
First-run note: the Whisper model downloads from Hugging Face (~75 MB for English, ~78 MB multilingual) and caches in your browser. Subsequent runs are instant — no re-download.
  • Files never leave your browser — processed entirely on your device
  • No upload, no queue, no waiting for a worker to free up
  • No file-size cap from us — limit is your device's RAM

About Auto Subtitles

AntiUpload's Auto Subtitles is an in-browser speech-to-text transcriber that runs the Whisper model (the same open-source model that powers OpenAI's audio API) entirely on your device. Drop in an audio or video file and you get back a subtitle file in SRT, VTT, plain text, JSON, or TikTok-style word-by-word ASS format. The model runs in a Web Worker via the @xenova/transformers.js library; no audio is uploaded to any server.

Compare the math: Rev / Otter / Trint charge $0.10-0.25 per minute of transcription. A 1-hour podcast costs $6-15 to transcribe on those services, with the audio uploaded to their servers and processed in their cloud. The same hour transcribes for $0 here, with the file never leaving your browser. The first transcription downloads the ~75 MB Whisper-tiny model from Hugging Face; subsequent runs reuse the cached model so they start instantly.

The tradeoff is speed and accuracy. Whisper-tiny is fast (~2-5× realtime on a modern laptop) but less accurate than Whisper-large (the model the cloud services use). For clean speech (podcasts, lectures, interviews) tiny gets ~95% of words right. For heavy accents, noisy backgrounds, or technical jargon, the Base model (145 MB, available as an option) helps. Translation mode forces multilingual and outputs English from any source language — useful for foreign-language source material destined for English-audience captions.

How it works

  1. Drop an audio or video fileAccepts every common format: MP4, MOV, WebM, MKV, AVI for video; MP3, WAV, M4A, OGG, FLAC, AAC, OPUS for audio. The tool extracts the audio track via FFmpeg before running Whisper.
  2. Pick output format + languageSRT (universal), VTT (web video), TXT (plain transcript), JSON (with timestamps), or ASS (TikTok-style word-by-word captions). Set language to "en" for fastest English-only model, "auto" for 99-language detection, or pick a specific language for best accuracy.
  3. First run: download model (~75 MB, one-time)On the first transcription the Whisper-tiny model downloads from Hugging Face's CDN (~75 MB for English-only, ~78 MB for multilingual). The browser caches it; subsequent runs are instant. No download on follow-ups.
  4. TranscribeWhisper runs on your CPU (with SIMD acceleration). On a modern laptop expect ~2-5× realtime — a 10-minute podcast transcribes in 2-5 minutes. Older devices are slower; mobile devices much slower (avoid mobile for files longer than 5 min).
  5. Download the subtitle fileOutput is timestamp-aligned and ready to use. Drop the SRT into Premiere / DaVinci Resolve, or upload to YouTube / Vimeo as a caption track. For TikTok-style burn-in, pair with our subtitle-burn-in tool.

When to use Auto Subtitles

Captioning a podcast episode for YouTube
YouTube auto-captions are notoriously inaccurate; manual captions get prioritised in search. Generate an SRT locally, upload as the caption track, get cleaner copy + better YouTube SEO.
TikTok / Reels caption track without paying CapCut Pro
Pick the ASS (word-by-word) output, then burn in via our Subtitle Burn-In tool. Matches the CapCut auto-captions look without the $7.99/month subscription.
Transcribing interviews for journalism / research
Sending sensitive interview audio to Rev or Otter raises privacy / consent issues. Local transcription keeps the audio on your device — no third party hears it.
Foreign-language video → English subtitles for a client
Set task to "translate", input language to "auto" — Whisper produces English captions directly from any of 99 source languages. Skip the Google Translate round-trip.
Lecture / meeting transcription for accessibility
Generate a TXT transcript for searchability, or SRT for video accessibility compliance (ADA / EAA). Free and private — appropriate for sensitive educational content.

Frequently asked questions

How accurate is Whisper-tiny for transcription?
On clean English speech (podcasts, lectures), Whisper-tiny.en lands ~95% word accuracy — comparable to most paid services on equivalent audio. Accuracy drops with heavy accents, technical jargon, or background noise. The optional Base model (145 MB, ~2× slower) recovers most of that loss. Whisper-large (Cloud-only via OpenAI API) is still the gold standard but isn't a 75 MB browser download.
Why does the first run take so long?
First use downloads the Whisper model (~75 MB tiny, ~78 MB multilingual, ~145 MB base) from Hugging Face's CDN. The browser caches it in IndexedDB. Every subsequent transcription starts instantly — no re-download. If you clear your browser cache the model re-downloads on next use.
Is my audio really never uploaded?
Confirmed. Open DevTools → Network → reproduce a transcription. The only network requests are: the page JavaScript (small), the FFmpeg WASM engine on first use (~12 MB), the Whisper model on first use (~75 MB), and the page assets. No request contains your audio. Transcription runs in a Web Worker on your CPU.
What's the difference between transcribe and translate mode?
Transcribe: output text is in the same language as the source audio (Spanish in → Spanish out). Translate: output text is in English regardless of source language (Spanish in → English out). Translate forces the multilingual model and is slower / slightly less accurate than transcribe, but skips the manual translation step.
Can it handle a 2-hour podcast / lecture?
Yes, but expect 30-60 minutes of CPU time on a modern laptop. The audio is processed in chunks so memory usage stays bounded (Whisper-tiny needs about 200 MB of working RAM regardless of file length). For files longer than 30 minutes consider closing other tabs to free up CPU.
Why is the SRT timing slightly off in places?
Whisper occasionally drifts on long silences (the model lacks the duration signal to anchor itself). We pre-filter silent regions with Voice Activity Detection to mitigate this — typical silence-heavy audio (podcasts with intro music) sees 30-50% speed and accuracy improvement from this step. For broadcast-grade timestamp accuracy you'd still want a paid service like Descript with manual editing.
How does this compare to the paid OpenAI Whisper API?
The paid API uses Whisper-large-v3 (significantly more accurate on noisy / accented audio) and runs on OpenAI's servers. It costs $0.006/min ($0.36/hour). Our tool runs the smaller Whisper-tiny model on your device for $0. For clean speech the accuracy gap is small (~3-5 percentage points word error rate). For noisy / accented speech the API is meaningfully better. Trade-off: privacy + cost + size limit (browser) vs accuracy (cloud).

Related tools