OpenAI charges by the minute, so speed up your audio
Core trick: speeding audio to cut cost/time
- Original post describes using ffmpeg to speed a 40‑minute talk up 2–3× to fit OpenAI’s 25‑minute upload cap, reducing cost and latency while still getting usable transcripts/summaries.
- Several commenters report similar discoveries (e.g., 2× for social-media reels) and note it feels “obvious” once you think in terms of model time vs. wall‑clock time.
- Some point out this is conceptually similar to lowering sample rate or downsampling intermediate encoder layers in Whisper to gain throughput.
Alternatives, pricing, and business angles
- Multiple people suggest bypassing OpenAI’s transcription API entirely:
- Run Whisper (or faster‑whisper/whisper.cpp) locally, especially on Apple Silicon.
- Use cheaper hosted Whisper from Groq, Cloudflare Workers AI, DeepInfra, etc., citing ~10× lower prices.
- Use other LLMs with audio support (Gemini 2.x, Phi multimodal) or specialized ASR services.
- Some are already selling “speech is cheap”‑style APIs, arguing you must add value (classification, diarization, UI) beyond raw transcription.
Accuracy, limits, and evaluation
- People question accuracy at 2–4× speed, asking for word error rate or diff‑based comparisons; others argue what matters is summary fidelity, not verbatim text.
- Suggestions include:
- LLM‑based evaluation of whether key themes persist across different speeds.
- Measuring variance by running the same audio multiple times.
- An OpenAI engineer confirms 2–3× still works “reasonably well” but with probably measurable accuracy loss that grows with speed.
Local vs. cloud, privacy, and efficiency
- Strong thread arguing that local Whisper is “good enough,” essentially free, and avoids sending personal interests or sensitive data to OpenAI.
- Others counter that newer proprietary models (e.g., gpt‑4o‑transcribe) can be faster or better, but can’t be run locally.
Preprocessing tricks and tooling
- Multiple ffmpeg recipes shared to:
- Remove silence (and thus cost/time) before transcription.
- Normalize audio to reduce hallucinations.
- Many tips on grabbing and using YouTube transcripts (yt‑dlp, unofficial APIs), and on playback‑speed extensions (up to 4–10×).
Meta: speed vs. understanding
- Substantial side‑discussion:
- Some argue summaries and 2–3× playback are “contentmaxing” but degrade depth of thought.
- Others say speeding content just matches their natural processing rate, and depth comes from intentional re‑watching and reflection.