OpenAI Audio Models

Overall Audio Quality & Style Control

  • Many find the voices impressive, with strong prosody control: the “vibe” box can change attitude, pacing, and emotion in surprisingly nuanced ways (e.g., pirates, villains, sleepy Bostonian, cows).
  • Others hear a metallic/vibrating timbre and clear “AI-ness,” sometimes worse than Siri or ElevenLabs; some voices (e.g., older ones) sound robotic or “NPC-like.”
  • Strong “uncanny valley” reactions: expressive but slightly off or theatrical; some prefer OpenAI’s Advanced Voice Mode or competitors for long-form listening.
  • Style steering works for simple, concise instructions and playful constraints (“replace every second word with potato”), but detailed or regional accent prompts (Somerset farmer, UK regional, AAVE) often fail or revert to generic/American-ish delivery.

Determinism, Safety, and Controls

  • Users report high non-determinism: identical text/voice/vibe can yield very different tones, accents, and quality, which is viewed as a major problem for assistants and production use.
  • Safety behavior seems persona-dependent: some “NYC cabbie”/edgy vibes freely read profanity and copypasta, while “Santa,” “Medieval Knight,” etc. refuse with policy messages.
  • Slurs are blocked, but users say homophonic workarounds often bypass filters.

Speech-to-Text: Capabilities & Gaps

  • OpenAI claims new STT models outperform Whisper on FLUERS; commenters note this benchmark focuses on read speech and may not reflect real-world conversational, shouted, or whispered audio.
  • Longstanding concerns: hallucinations, autocorrections, mixed-language handling, loss of word-level timestamps, and lack of diarization or dual-channel awareness.
  • Strong demand for: speaker attribution, word timestamps/speech marks, diarization, and training that preserves exact phrasing and numbers.

Ecosystem, Openness, and On-Device Use

  • Disappointment that new STT/TTS models are not open-sourced like Whisper and are not downloadable; OpenAI says they are too large for consumer hardware.
  • Several users need robust local STT/TTS (accessibility, AAC apps) and discuss alternatives (Whisper.cpp, Piper, Kokoro, Orpheus, Sesame, Nvidia Canary), typically with trade-offs in latency, quality, or language support.

Pricing, Competition, and Use Cases

  • Pricing (~$0.015/min TTS; significantly cheaper STT) is seen as dramatically undercutting ElevenLabs and competitive with Google TTS, enabling personal audiobooks and consumer apps.
  • Many still judge ElevenLabs ahead on naturalness and especially on speech-to-speech voice conversion that preserves timing and prosody.
  • Strong interest in using these models for agents, audiobooks, and long-form content, but concerns remain around nondeterminism, accent realism, and missing STT features.