4o Image Generation
Speed, Architecture, and Integration
- Livestream showed image generation taking ~20–30s; some found it “dialup‑era slow,” others said it feels similar to DALL‑E and acceptable given quality.
- Debate over architecture: some think 4o generates image tokens autoregressively (like original DALL‑E), enabling top‑down streaming; others argue the UI animation is misleading and 4o is calling a separate diffusion-based image tool.
- Evidence for the tool-call view: visible post‑upscaling, no image tokens visible in 4o’s context when queried, and API traces indicating a separate image-generation tool.
- Others counter that 4o is explicitly described as a single multimodal model and that it may still use internal adapters/decoders; exact design remains unclear.
Quality, Capabilities, and Limitations
- Many commenters say this is the first time AI images “pass the uncanny valley,” especially humans, whiteboards, UI mocks, and infographics; character consistency and text rendering are major jumps.
- Strong prompt adherence and iterative editing via text (“keep everything, just change X”) impress users; good at transparency in demos and “ghiblifying” photos.
- Known weak spots persist: hands/fingers, some anatomy, reflections/physics correctness, clocks often stuck at 10:10, polygon counts (pentagons/stars), and transparent backgrounds in practice.
- Several “litmus tests” showed mixed but improved results: some users now get a truly brim‑full wine glass; others still don’t, likely also affected by rollout and model routing.
- Editing user photos (especially faces, outfits) is currently unreliable; OpenAI acknowledges a bug on face‑edit consistency.
Comparisons to Other Models
- Versus Midjourney/Flux/Imagen/Gemini:
- Some say 4o is behind dedicated art models in raw aesthetics; others find its prompt following, layout, text, and structural edits clearly ahead.
- Gemini 2.0/2.5 has similar multimodal image abilities but is described as harder to access and often weaker on text coherence and resolution.
- Video space: several say OpenAI is behind Chinese video models (Kling, Hailuo, Wan, Hunyuan); 4o is seen as an image-play, not a video leap.
Rollout Confusion and UX
- Many users initially saw DALL‑E‑style outputs and thought 4o was overhyped; only later realized rollout is staggered and sometimes per‑server.
- Heuristics to detect the new model: top‑down progressive rendering, absence of “Created with DALL‑E” badges, different filename prefixes, or using sora.com where it’s already live.
- Frustration that OpenAI markets features as “available today” while access trickles out, with no clear UI indicator of which model actually answered.
Impact on Startups, Artists, and Society
- Some claim “tens of thousands” of image-gen startups are now effectively dead and digital artists further squeezed; others argue this is incremental since DALL‑E already existed and specialized tools still matter (ControlNet/ComfyUI pipelines, LoRAs, motion control).
- Concerns about deepfakes and politics: “seeing is believing” is now clearly broken; some are openly frightened by how real people and scenes look.
- Others say society was already saturated with misleading visuals (Photoshop, social media) and this just accelerates an inevitable shift in trust models.
- Safety/moderation is a pain point: users report overly aggressive blocks on harmless edits (e.g., stylizing personal photos, maps of sensitive regions), while IP‑like styles and some copyrighted characters still slip through.
Technical Debates and Open Questions
- Long back‑and‑forth on autoregressive vs diffusion, how multimodal chains of thought over images might work, and whether this counts as “reasoning in pixel space.”
- Some envision “truly generative UIs” where each app frame is rendered by a model; others see this as impractical and terrifying from reliability and compute standpoints.
- Open questions: API pricing, guaranteed resolution/aspect control, whether DALL‑E remains accessible, and when/if an open‑weight competitor (possibly from China) will appear.