Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model
Deployment, Hosting, and Pricing
- Users welcome that weights are open and already on Hugging Face, OpenRouter and MLX, but many want first‑party support on major clouds (Bedrock, Vertex, Azure) for data residency and reliability.
- Some report OpenRouter being “laggy” or degraded (possibly due to quantization or provider choice); direct Moonshot API is perceived as higher quality.
- Pricing via OpenRouter looks substantially cheaper than comparable frontier models; some argue US labs run with large margins, especially on cached tokens.
- There’s interest in subscriptions and agentic use, as “thinking” mode burns many tokens.
Performance, Benchmarks, and UX
- ArtificialAnalysis benchmarks show very strong scores (e.g. HLE 44.9), with several people saying Kimi K2 Thinking feels better in practice than benchmarks suggest.
- Comparisons to Qwen 3 Max Thinking are mixed: Qwen benchmarks well but is widely reported as disappointing in real use; Kimi is expected to outperform relative to size.
- Subjective reports:
- Very good at coding, math, and multi-step reasoning; impresses on tricky “stacking” puzzles where other models flail.
- Strong writing and “non‑sycophantic” brainstorming, though some dislike older K2’s punchy, over‑structured style.
- The “pelican on a bicycle in SVG” test reappears as an informal visual‑reasoning benchmark; Kimi produces detailed, coherent SVGs. Some debate whether this correlates with “intelligence.”
Censorship, Alignment, and Country Topics
- Multiple users probe Tiananmen and Taiwan. Behavior varies:
- Non‑thinking mode may block Tiananmen queries; thinking mode sometimes gives relatively frank historical answers, though direct “massacre” phrasing can still be refused.
- Taiwan answers are reported as closer to English Wikipedia than earlier Chinese models.
- Some see these inconsistencies as “bugs” in safety layers; others expect censorship and argue it must be highlighted repeatedly, just like US‑centric election and geopolitical guardrails.
“Reasoning Model” Concept and Agentic Use
- Consensus: a “reasoning model” is one fine‑tuned/RL‑trained to produce hidden chain‑of‑thought in special tokens (scratchpad) rather than just prompted to “think step by step.”
- Users experiment with multi‑iteration agent frameworks (e.g. running multiple K2 sessions with an arbiter), noting that long chains can devolve into loops (“quantum attractor”) after a few iterations.
- Thoughtworks’ recent “AI antipatterns” (e.g., naive text‑to‑SQL, naive API→MCP conversion) are cited as real‑world areas where better reasoning and tool use matter.
Running Locally and Hardware
- K2 Thinking is MoE and INT4‑aware, so active parameters per token are much smaller than 1T, but full unquantized hosting is still heavy: e.g. dual CPU + 8×L20 reported at ~46 tok/s.
- Enthusiasts describe home builds: high‑end Xeon/EPYC with 512–768 GB DDR5 plus one or more GPUs can reach 6–10 tok/s via llama.cpp‑style engines; Mac Studio with 512 GB and MLX can also host large quants.
- Even with clever MoE and quantization, this remains beyond typical “consumer” hardware; many argue cloud remains cheaper than electricity and hardware for most users.
Open Source vs Open Weights and Licensing
- Heated debate over calling these models “open source” when only weights (not training data/recipes) are released. Some insist “open weight” is more honest; others say the term has shifted in “model land.”
- K2’s license is essentially MIT with an attribution clause for very large commercial deployments, seen by some as a reasonable compromise.
- Critics point out weights alone are not reproducible and hide copyrighted or sensitive training data; supporters argue that full end‑to‑end reproducibility is economically and ecologically unrealistic at trillion‑parameter scale.
Model Size, Efficiency, and Small‑Model Hopes
- Long sub‑thread debates whether massive frontier models are the only path to strong coding/reasoning, or whether smaller specialized models + agents can catch up.
- Some emphasize empirical reality: small models remain far behind, despite much research; others see promise in better data, distillation, MoE, shared weights, sparse attention, and tool‑using agents.
- Practical takeaway in the discussion: for simple tasks, smaller local models often suffice and feel snappier; for multi‑hop reasoning and complex agentic coding, bigger frontier or “reasoning” models like K2 still dominate.
China vs US/Europe and Energy/Geopolitics
- Several note that multiple Chinese labs (Kimi, DeepSeek, Qwen, GLM, MiniMax) are pushing high‑end open‑weight models, while US labs mostly keep weights closed and European efforts (except a few like Mistral) lag.
- Explanations offered: US labs must monetize huge GPU investments; Chinese labs may lack unconstrained access to top GPUs and thus lean into open weights; Chinese domestic energy and data‑center build‑out is massive.
- Some speculate geopolitically: open Chinese models weaken US incumbents’ pricing power and enable Western startups to build on them instead of paying frontier‑lab API margins.