2026-06-22

GLM 5.2 vs. Opus

Test setup and validity

Many see the comparison as a “vibe test” rather than a serious benchmark: single zero‑shot task, one run, different harnesses (Claude Code vs Pi), unknown GLM provider, unspecified “thinking level,” and no repeated trials.
Critics argue this makes conclusions about “hype being real” misleading and largely irrelevant to real-world, multi‑turn agent workflows.
Several suggest redoing the test with:
- Same harness and tools for both models.
- Multiple runs and progressively more detailed specs.
- Brownfield tasks on existing codebases, not greenfield “build a game” prompts.

Capabilities & behavior

Many report GLM‑5.2 as the first open‑weights model that feels close to recent frontier models for real coding work, especially in algorithms and code quality.
Others find it clearly behind Opus (and GPT‑5.5) on:
- UI/UX, aesthetics, “taste,” and adherence to existing project conventions.
- Steering, instruction‑following, and not “going its own way.”
GLM’s visible reasoning trace is widely praised for debugging its thought process and intervening early.

Speed, efficiency, and tokens

Multiple users experience GLM‑5.2 as slow, especially time‑to‑first‑edit; it “overthinks” for minutes before coding.
Some say that despite lower per‑token price, it burns many more tokens and time per task, making real cost similar to or worse than frontier models.
Others report acceptable token usage and see it as highly cost‑effective, especially on favorable hosted plans.

Pricing & subscription economics

API list prices favor GLM‑5.2 (≈Haiku cost for near‑Opus quality), but:
- Frontier vendors’ coding/Max plans can be cheaper in practice for intensive users.
- Some GLM subscriptions are hard to obtain or feel stingy; infra can be overloaded.
Debate over whether subscriptions are “subsidized,” but consumers mostly care about out‑of‑pocket cost.

Multimodality and vision

Opus has vision; GLM‑5.2 is text‑only. In this game‑building task, that’s a major advantage for Opus.
GLM resorted to crude pixel‑inspection scripts; critics note you could pair it with a separate vision sub‑agent, but that wasn’t done.

Open weights, local hosting, and trust

GLM‑5.2’s open weights are seen as strategically important: optional local/self‑hosted use, more control, data privacy, and a price ceiling on closed APIs.
Others stress that genuinely running it locally at competitive speeds requires very expensive hardware; for most, “open” mainly means more cloud providers and jurisdictional choice.

Workflows: one‑shot vs agentic

Strong thread arguing that real value is in:
- Long‑running, tool‑using agents.
- Collaborative, stepwise coding with human steering.
- Adhering to guardrails, project conventions, and specs over time.
One‑shot “build X” tests are seen as:
- Still useful as a proxy for autonomy, intent inference, and problem‑solving.
- But easily gamed and misaligned with how serious engineers actually work.

Safety, guardrails, and helpfulness

GLM is noted for rarely refusing tasks and being less constrained than Anthropic models, which can over‑trigger safety filters (e.g., on country lists).
Some are increasingly frustrated with heavy guardrails and “LLM‑ish” writing style in proprietary models; others value Claude‑like “helpfulness” and hope GLM will match it.

Related topics