Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Deployment, Hosting, and Pricing

  • Users welcome that weights are open and already on Hugging Face, OpenRouter and MLX, but many want first‑party support on major clouds (Bedrock, Vertex, Azure) for data residency and reliability.
  • Some report OpenRouter being “laggy” or degraded (possibly due to quantization or provider choice); direct Moonshot API is perceived as higher quality.
  • Pricing via OpenRouter looks substantially cheaper than comparable frontier models; some argue US labs run with large margins, especially on cached tokens.
  • There’s interest in subscriptions and agentic use, as “thinking” mode burns many tokens.

Performance, Benchmarks, and UX

  • ArtificialAnalysis benchmarks show very strong scores (e.g. HLE 44.9), with several people saying Kimi K2 Thinking feels better in practice than benchmarks suggest.
  • Comparisons to Qwen 3 Max Thinking are mixed: Qwen benchmarks well but is widely reported as disappointing in real use; Kimi is expected to outperform relative to size.
  • Subjective reports:
    • Very good at coding, math, and multi-step reasoning; impresses on tricky “stacking” puzzles where other models flail.
    • Strong writing and “non‑sycophantic” brainstorming, though some dislike older K2’s punchy, over‑structured style.
  • The “pelican on a bicycle in SVG” test reappears as an informal visual‑reasoning benchmark; Kimi produces detailed, coherent SVGs. Some debate whether this correlates with “intelligence.”

Censorship, Alignment, and Country Topics

  • Multiple users probe Tiananmen and Taiwan. Behavior varies:
    • Non‑thinking mode may block Tiananmen queries; thinking mode sometimes gives relatively frank historical answers, though direct “massacre” phrasing can still be refused.
    • Taiwan answers are reported as closer to English Wikipedia than earlier Chinese models.
  • Some see these inconsistencies as “bugs” in safety layers; others expect censorship and argue it must be highlighted repeatedly, just like US‑centric election and geopolitical guardrails.

“Reasoning Model” Concept and Agentic Use

  • Consensus: a “reasoning model” is one fine‑tuned/RL‑trained to produce hidden chain‑of‑thought in special tokens (scratchpad) rather than just prompted to “think step by step.”
  • Users experiment with multi‑iteration agent frameworks (e.g. running multiple K2 sessions with an arbiter), noting that long chains can devolve into loops (“quantum attractor”) after a few iterations.
  • Thoughtworks’ recent “AI antipatterns” (e.g., naive text‑to‑SQL, naive API→MCP conversion) are cited as real‑world areas where better reasoning and tool use matter.

Running Locally and Hardware

  • K2 Thinking is MoE and INT4‑aware, so active parameters per token are much smaller than 1T, but full unquantized hosting is still heavy: e.g. dual CPU + 8×L20 reported at ~46 tok/s.
  • Enthusiasts describe home builds: high‑end Xeon/EPYC with 512–768 GB DDR5 plus one or more GPUs can reach 6–10 tok/s via llama.cpp‑style engines; Mac Studio with 512 GB and MLX can also host large quants.
  • Even with clever MoE and quantization, this remains beyond typical “consumer” hardware; many argue cloud remains cheaper than electricity and hardware for most users.

Open Source vs Open Weights and Licensing

  • Heated debate over calling these models “open source” when only weights (not training data/recipes) are released. Some insist “open weight” is more honest; others say the term has shifted in “model land.”
  • K2’s license is essentially MIT with an attribution clause for very large commercial deployments, seen by some as a reasonable compromise.
  • Critics point out weights alone are not reproducible and hide copyrighted or sensitive training data; supporters argue that full end‑to‑end reproducibility is economically and ecologically unrealistic at trillion‑parameter scale.

Model Size, Efficiency, and Small‑Model Hopes

  • Long sub‑thread debates whether massive frontier models are the only path to strong coding/reasoning, or whether smaller specialized models + agents can catch up.
  • Some emphasize empirical reality: small models remain far behind, despite much research; others see promise in better data, distillation, MoE, shared weights, sparse attention, and tool‑using agents.
  • Practical takeaway in the discussion: for simple tasks, smaller local models often suffice and feel snappier; for multi‑hop reasoning and complex agentic coding, bigger frontier or “reasoning” models like K2 still dominate.

China vs US/Europe and Energy/Geopolitics

  • Several note that multiple Chinese labs (Kimi, DeepSeek, Qwen, GLM, MiniMax) are pushing high‑end open‑weight models, while US labs mostly keep weights closed and European efforts (except a few like Mistral) lag.
  • Explanations offered: US labs must monetize huge GPU investments; Chinese labs may lack unconstrained access to top GPUs and thus lean into open weights; Chinese domestic energy and data‑center build‑out is massive.
  • Some speculate geopolitically: open Chinese models weaken US incumbents’ pricing power and enable Western startups to build on them instead of paying frontier‑lab API margins.