2024-05-13

GPT-4o

Model capabilities & demos

GPT-4o is a new “flagship” multimodal model: text + images now via API, with end‑to‑end audio and video promised to a small set of partners soon.
Key claims: 2× faster and ~50% cheaper than GPT‑4 Turbo, with 5× higher rate limits; still 128k context.
Live demos highlighted: real‑time voice conversation with interruptions, video-based understanding (e.g., reading equations, commenting on scenes), translation, breathing/voice emotion cues, simple tutoring and coding help.
Some viewers found the demo “best ever” and close to sci‑fi (“Her”, universal translator); others saw it as evolutionary, not revolutionary.

Voice, emotion, and UX reactions

Audio2audio (no explicit text TTS layer) is widely seen as a big leap: natural intonation, emotions, sarcasm, singing, responsive interruption.
Many dislike the default “over‑enthusiastic podcast host” personality and want concise, neutral or “stoic” modes; some already use custom instructions to reduce verbosity.
Strong uncanny‑valley reactions: laughter, flirting tone, and “AI girlfriend” implications made some users uneasy.

Performance, cost, and API details

New tokenizer (200k vocab) significantly reduces token counts, especially for non‑English languages (e.g., big gains for Gujarati, Japanese).
Developers report 4o is noticeably faster than 4‑Turbo, sometimes approaching 3.5‑level latency, but not as fast as some specialized hosts (e.g., Groq+Llama3).
As of the discussion, API supports text+vision; audio/video streaming and image output are not yet exposed broadly.

Model quality, reasoning & benchmarks

Many say 4o is “not much smarter” than GPT‑4; described as between 3.5 and 4 Turbo for reasoning, but better at “not being lazy” and goal‑seeking across tool calls.
Some independent tests: modest improvement over 4‑Turbo on certain programming and reasoning tasks; big jump on one chess‑puzzle benchmark; but no clear GPT‑3→4‑style leap.
Multiple reports of increased hallucinations vs gpt‑4‑0125‑preview; some users are sticking with older 4‑Turbo for critical work.
Debate over scaling limits: some think reasoning has plateaued due to data constraints; others argue scaling and multimodal training still have runway.

Free vs paid, business model

GPT‑4o text+vision is being rolled out to free users with lower message limits; Plus gets ~5× higher limits and likely earlier access to future “frontier” models.
Many paid users question what they now get for $20–25/month beyond limits and early access; some consider canceling until GPT‑5 or a clearly superior model ships.
Others speculate this move signals either confidence in a much better upcoming model or competitive pressure from open models (e.g., Llama 3) and other providers.

Privacy, safety, and misuse

Real‑time screen‑sharing and continuous camera use are seen as both powerful and a “privacy nightmare.”
Deepfake and voice‑cloning concerns raised; current plan is preset voices only, no arbitrary custom cloning.
Obvious misuse vectors: romance scams, call‑center fraud, mass propaganda; many worry about older or vulnerable users.
Some expect regulators and platform policies to heavily constrain custom voices and agentic behaviors.

Accessibility and positive use cases

Strong excitement around applications for blind/low‑vision and DeafBlind users (e.g., Be My Eyes), navigation help, reading environments, playing instruments with guidance.
Real‑time translation + natural voice seen as potentially transformative for language learning and cross‑lingual collaboration, though current pronunciation/tones can be poor in some languages.

Broader implications & skepticism

Split sentiment: some see this as clear progress toward conversational AGI; others say it’s sophisticated “stochastic parroting” with no true world model.
Concerns about economic impact (job displacement, surveillance, enshittification via ad deals) and about training future models on AI‑generated, private conversational data.
Meta‑discussion on hype: many note advances are stunning, yet core reasoning hasn’t leapt; some predict an “AI crash” if expectations aren’t reset.

Related topics