Kimi K2 1T model runs on 2 512GB M3 Ultras
Model details and quantization
- The demo uses Kimi K2 at 4‑bit quantization on two 512GB M3 Ultras; several people note this should be explicitly stated, though some assume “1T parameters” implicitly means heavy quantization.
- There’s confusion between Kimi K2 vs K2 Thinking (K2T): they are different models with very different capabilities and post‑training. K2T is seen as closer to top-tier models like Sonnet 4.5.
- Questions arise about context length and prefill speed; commenters warn that “it runs” at small context doesn’t imply usable performance at large, coding‑style contexts.
Behavior, style, and use cases
- Kimi K2 is described as less capable than frontier models on complex reasoning but unusually strong at:
- Short-form writing (emails, communication)
- Editing and blunt critique
- “Emotional intelligence” / social nuances in messages
- Geospatial tasks
- It is perceived as unusually direct, willing to call out user mistakes, and to clearly say “there is no answer in the search results.” Some users value this non‑sycophantic style.
Instruction-following vs pushback
- One camp wants strict, assumption‑free instruction following (especially for coding), with the model asking clarifying questions rather than disagreeing.
- Another camp prefers agents that take initiative, push back on dubious instructions, and warn about dangerous consequences (e.g., potential SQL injection).
- A middle ground emerges: models should sometimes ask clarifying questions and sometimes challenge the request, but not blindly comply.
Training, architecture, and RLHF
- Kimi is said to be based on a DeepSeek-style MoE architecture, trained with the Muon optimizer and “mainly finetuning.”
- Debate over whether most Chinese models are downstream of DeepSeek/GPT; others point to Qwen, Mistral, Llama, ERNIE, etc. as independent efforts.
- Several comments criticize mainstream RLHF for over-optimizing for politeness and flattery; Kimi is praised as a counterexample.
Benchmarks and prompting
- Kimi K2 reportedly performs unusually well on the “clock test” and EQBench (with the caveat that EQBench is LLMs grading LLMs).
- Discussion around more “linguistically technical” system prompts to force blunt, “bald-on-record” responses, illustrating how prompt wording strongly shapes behavior.
- One commenter argues these are really “word models,” not true “language models,” since phrasing and register substantially affect outputs.
Local vs cloud, cost, and privacy
- Running a 1T model locally on dual M3 Ultras (~$19K) is viewed by many as uneconomical versus cloud inference, especially given low personal utilization and very fast providers (Groq, Cerebras, etc.).
- Others argue local is about:
- Privacy and sensitive data (including “record everything” workflows and codebases)
- Autonomy from future “enshittification” of cloud AI
- Hobbyist experimentation and research
- There’s disagreement over whether local makes sense only for privacy/hobby vs. future-proofing or high‑value bespoke uses.
Hardware and interconnect
- Some speculate about macOS RDMA over Thunderbolt; the original demo is confirmed not to be using it yet, with expected future speedups.
- Questions arise about Linux equivalents: vLLM can scale over standard Ethernet, but peak performance requires RDMA‑class interconnects.
- Commenters also note refurbished/discounted M3 Ultras but point out that the lower-cost refurb configs don’t match the 512GB RAM spec in the demo.