2025-08-05

Open models by OpenAI

Release context & strategic motives

Many were surprised to see OpenAI ship strong open‑weight models (20B and 120B) with Apache 2.0, viewing it as a sharp pivot toward Meta’s “scorched earth” open‑model strategy.
A common hypothesis is that this precedes a significantly stronger GPT‑5: these models set a high “free floor” while preserving demand for a much better closed frontier tier.
Others argue it’s simply competitive pressure from Qwen/DeepSeek/GLM and a way to stay relevant in the open‑weights ecosystem, seed tooling, and generate future licensing/support revenue.

Performance, benchmarks & comparisons

Marketing claims of “near o3 / o4‑mini” performance drew skepticism. Early independent benchmarks and user tests show:
- 120B: very strong reasoning for its active size, competitive on some reasoning benchmarks (e.g. GPQA Diamond, Humanity’s Last Exam), decent coding but generally behind top Qwen3/GLM4.5/Kimi models on agentic and coding tasks.
- 20B: impressive for its size, often beating other mid‑size opens on certain tasks (spam filtering, some coding), but clearly not frontier level.
Many note the models hallucinate on factual queries (geography, history, niche trivia) and fail simple sanity tests (dates, “strawberry” letter count, some riddles), reinforcing the view that benchmarks are heavily gamed and not predictive of real‑world behavior.

Architecture, quantization & training

Both models are sparse MoE transformers with very low active parameters (~3.6B and ~5.1B), standard GQA, RoPE+YaRN, and alternating sparse/dense attention.
The standout technical piece is native MXFP4 (≈4.25‑bit) quantization on >90% of weights, allowing the 120B to fit on a single 80GB GPU and run efficiently on Macs/consumer GPUs with minimal perceived quality loss.
Commenters infer the real “secret sauce” is in training and distillation (likely heavy use of o‑series reasoning traces and synthetic data), not novel architecture.

Open weights vs “open source” debate

Lengthy argument over terminology:
- One side: publishing weights without training data/recipes is like shipping only binaries; this should be called “open weights”, not “open source”.
- Others counter that Apache 2.0 on weights plus full modifiability/fine‑tuning makes the release meaningfully open in practice, even if not fully reproducible.
Several propose a clear distinction: SaaS (API only) vs open‑weights vs truly open‑source (weights + training code + data/recipes).

Safety, censorship & misuse

Many experience the models as heavily “lobotomized”: frequent refusals, overcautious content filters, and degraded translation/creative writing performance, especially compared to relatively uncensored Chinese models.
Some speculate pre‑training data were aggressively filtered (e.g. CBRN content), making jailbreaks harder because the knowledge simply isn’t present.
Others show that with enough prompt steering the models can still output problematic technical details (e.g., lab protocols), though less readily than typical open models.
This fuels a split: some appreciate strong guardrails; others see the models as “safe but useless” for broad creative or research use.

Local deployment, tooling & performance in practice

Users report the 20B model running acceptably on:
- 16–24GB VRAM GPUs (or Mac unified memory) with quantization; 30–70 tokens/s is common on mid‑range GPUs and high‑RAM M‑series Macs.
- 8–16GB machines with offloading/low‑bit quants at slower but usable speeds.
The 120B model is viable on 80GB+ VRAM or 96–128GB unified memory; community MLX, llama.cpp, Ollama, LM Studio, and GGUF ports appeared within hours.
Harmony, the new response format, is praised as a cleaner multi‑channel structure but currently breaks many agents/tool‑calling frameworks until they adapt.
Several people attempt to plug gpt‑oss into existing coding agents (Claude Code, Aider, Cline, Roo, etc.) with mixed success—quality of reasoning is promising but prefill latency and tool‑use reliability are still rough.

Ecosystem impact & outlook

Many see this as raising the floor for open models: a reasoning‑tuned, highly efficient 20B that runs on consumer hardware changes local‑first and hybrid architectures (cheap local “worker” + expensive cloud “expert”).
Others note that Qwen3, GLM‑4.5, DeepSeek and Kimi still hold clear advantages in some domains (coding, multilingual knowledge, less censorship), so this does not obsolete existing open models.
Strategically, commenters expect a pattern of N‑1 open‑weight releases from US labs: last‑generation but still very strong models open‑sourced to squeeze competitors’ margins and accelerate ecosystem innovation.

Related topics