2025-11-06

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Deployment, Hosting, and Pricing

Users welcome that weights are open and already on Hugging Face, OpenRouter and MLX, but many want first‑party support on major clouds (Bedrock, Vertex, Azure) for data residency and reliability.
Some report OpenRouter being “laggy” or degraded (possibly due to quantization or provider choice); direct Moonshot API is perceived as higher quality.
Pricing via OpenRouter looks substantially cheaper than comparable frontier models; some argue US labs run with large margins, especially on cached tokens.
There’s interest in subscriptions and agentic use, as “thinking” mode burns many tokens.

Performance, Benchmarks, and UX

ArtificialAnalysis benchmarks show very strong scores (e.g. HLE 44.9), with several people saying Kimi K2 Thinking feels better in practice than benchmarks suggest.
Comparisons to Qwen 3 Max Thinking are mixed: Qwen benchmarks well but is widely reported as disappointing in real use; Kimi is expected to outperform relative to size.
Subjective reports:
- Very good at coding, math, and multi-step reasoning; impresses on tricky “stacking” puzzles where other models flail.
- Strong writing and “non‑sycophantic” brainstorming, though some dislike older K2’s punchy, over‑structured style.
The “pelican on a bicycle in SVG” test reappears as an informal visual‑reasoning benchmark; Kimi produces detailed, coherent SVGs. Some debate whether this correlates with “intelligence.”

Censorship, Alignment, and Country Topics

Multiple users probe Tiananmen and Taiwan. Behavior varies:
- Non‑thinking mode may block Tiananmen queries; thinking mode sometimes gives relatively frank historical answers, though direct “massacre” phrasing can still be refused.
- Taiwan answers are reported as closer to English Wikipedia than earlier Chinese models.
Some see these inconsistencies as “bugs” in safety layers; others expect censorship and argue it must be highlighted repeatedly, just like US‑centric election and geopolitical guardrails.

“Reasoning Model” Concept and Agentic Use

Consensus: a “reasoning model” is one fine‑tuned/RL‑trained to produce hidden chain‑of‑thought in special tokens (scratchpad) rather than just prompted to “think step by step.”
Users experiment with multi‑iteration agent frameworks (e.g. running multiple K2 sessions with an arbiter), noting that long chains can devolve into loops (“quantum attractor”) after a few iterations.
Thoughtworks’ recent “AI antipatterns” (e.g., naive text‑to‑SQL, naive API→MCP conversion) are cited as real‑world areas where better reasoning and tool use matter.

Running Locally and Hardware

K2 Thinking is MoE and INT4‑aware, so active parameters per token are much smaller than 1T, but full unquantized hosting is still heavy: e.g. dual CPU + 8×L20 reported at ~46 tok/s.
Enthusiasts describe home builds: high‑end Xeon/EPYC with 512–768 GB DDR5 plus one or more GPUs can reach 6–10 tok/s via llama.cpp‑style engines; Mac Studio with 512 GB and MLX can also host large quants.
Even with clever MoE and quantization, this remains beyond typical “consumer” hardware; many argue cloud remains cheaper than electricity and hardware for most users.

Open Source vs Open Weights and Licensing

Heated debate over calling these models “open source” when only weights (not training data/recipes) are released. Some insist “open weight” is more honest; others say the term has shifted in “model land.”
K2’s license is essentially MIT with an attribution clause for very large commercial deployments, seen by some as a reasonable compromise.
Critics point out weights alone are not reproducible and hide copyrighted or sensitive training data; supporters argue that full end‑to‑end reproducibility is economically and ecologically unrealistic at trillion‑parameter scale.

Model Size, Efficiency, and Small‑Model Hopes

Long sub‑thread debates whether massive frontier models are the only path to strong coding/reasoning, or whether smaller specialized models + agents can catch up.
Some emphasize empirical reality: small models remain far behind, despite much research; others see promise in better data, distillation, MoE, shared weights, sparse attention, and tool‑using agents.
Practical takeaway in the discussion: for simple tasks, smaller local models often suffice and feel snappier; for multi‑hop reasoning and complex agentic coding, bigger frontier or “reasoning” models like K2 still dominate.

China vs US/Europe and Energy/Geopolitics

Several note that multiple Chinese labs (Kimi, DeepSeek, Qwen, GLM, MiniMax) are pushing high‑end open‑weight models, while US labs mostly keep weights closed and European efforts (except a few like Mistral) lag.
Explanations offered: US labs must monetize huge GPU investments; Chinese labs may lack unconstrained access to top GPUs and thus lean into open weights; Chinese domestic energy and data‑center build‑out is massive.
Some speculate geopolitically: open Chinese models weaken US incumbents’ pricing power and enable Western startups to build on them instead of paying frontier‑lab API margins.

Related topics