GLM-4.7-Flash

Hosting & Availability

  • Initially only a few cloud options: z.ai directly, Novita via OpenRouter, HuggingFace Inference, Cerebras, DeepInfra; more expected to follow.
  • One provider (Novita) is criticized for serving undisclosed quantized variants that noticeably degrade quality; concern that OpenRouter’s “cheapest by default” UX misleads new users.
  • Cerebras’ GLM-4.7 endpoint is praised for raw speed but panned for per‑minute rate limits and counting cached tokens fully for both rate and billing, making it effectively slow and expensive.

Model Size, Architecture & Use Cases

  • Flash is a MoE model ~30–32B total parameters with ~3–4B active per token (A3B/A3.9B), positioned as a “free-tier” / Haiku‑equivalent version of GLM‑4.7.
  • Seen as attractive for home or small‑server setups and fine‑tuning experiments, though still larger than typical “tiny” local models.

Performance & Comparisons

  • Benchmarks: SWE‑Bench Verified score ~59 is viewed as strong for this size, but some point out Devstral 2 Small (24B dense) scores higher.
  • Several users say GLM‑4.7 (full) is roughly Sonnet‑3.5‑level for code, clearly behind Sonnet 4.x and Opus, despite competitive benchmark numbers.
  • Some find GLM models better than Qwen and acceptable replacements for mid‑tier Claude levels; others report poor general knowledge, invalid code, and looping behavior in early tests, especially with quantized variants.
  • Comparisons to GPT‑OSS‑20B/120B are mixed: on paper Flash looks good, but some users find GPT‑OSS‑20B more reliable in practice.

Benchmarks & Evaluation Skepticism

  • SWE‑Bench Verified is criticized for limited repos/languages and suspected memorization; alternatives like SWE‑REBench and Terminal Bench 2.0 are preferred.
  • Multiple comments emphasize that public benchmarks often fail to predict real‑world coding‑agent performance.

Local Inference, Tooling & Quantization

  • Users report running Flash via vLLM (including ROCm/MI300x), llama.cpp (GGUF), Ollama, LM Studio, and LM Studio/MLX on Apple silicon.
  • Architecture is similar to DeepSeek V3; llama.cpp support landed quickly after a PR.
  • 4‑bit and 8‑bit GGUF quants (Unsloth, ngxson, byteshape, etc.) are popular; 30B Flash fits in ~20–22GB VRAM at Q4_K_M with large contexts, vs ~60GB for BF16.
  • Some experience severe repetition, spelling errors, and broken tool calling with certain quants/frontends; others report it working “fine” in tools like OpenCode and Ollama once templates and versions are updated.

Pricing & Plans

  • z.ai’s coding subscription is repeatedly praised as extremely cheap with high limits, using Flash as a default mid‑tier coding model.
  • Users note GLM‑4.7 models can feel slow due to more internal “thinking,” but reliability is generally considered good for the price.