Kimi K2 1T model runs on 2 512GB M3 Ultras

Model details and quantization

  • The demo uses Kimi K2 at 4‑bit quantization on two 512GB M3 Ultras; several people note this should be explicitly stated, though some assume “1T parameters” implicitly means heavy quantization.
  • There’s confusion between Kimi K2 vs K2 Thinking (K2T): they are different models with very different capabilities and post‑training. K2T is seen as closer to top-tier models like Sonnet 4.5.
  • Questions arise about context length and prefill speed; commenters warn that “it runs” at small context doesn’t imply usable performance at large, coding‑style contexts.

Behavior, style, and use cases

  • Kimi K2 is described as less capable than frontier models on complex reasoning but unusually strong at:
    • Short-form writing (emails, communication)
    • Editing and blunt critique
    • “Emotional intelligence” / social nuances in messages
    • Geospatial tasks
  • It is perceived as unusually direct, willing to call out user mistakes, and to clearly say “there is no answer in the search results.” Some users value this non‑sycophantic style.

Instruction-following vs pushback

  • One camp wants strict, assumption‑free instruction following (especially for coding), with the model asking clarifying questions rather than disagreeing.
  • Another camp prefers agents that take initiative, push back on dubious instructions, and warn about dangerous consequences (e.g., potential SQL injection).
  • A middle ground emerges: models should sometimes ask clarifying questions and sometimes challenge the request, but not blindly comply.

Training, architecture, and RLHF

  • Kimi is said to be based on a DeepSeek-style MoE architecture, trained with the Muon optimizer and “mainly finetuning.”
  • Debate over whether most Chinese models are downstream of DeepSeek/GPT; others point to Qwen, Mistral, Llama, ERNIE, etc. as independent efforts.
  • Several comments criticize mainstream RLHF for over-optimizing for politeness and flattery; Kimi is praised as a counterexample.

Benchmarks and prompting

  • Kimi K2 reportedly performs unusually well on the “clock test” and EQBench (with the caveat that EQBench is LLMs grading LLMs).
  • Discussion around more “linguistically technical” system prompts to force blunt, “bald-on-record” responses, illustrating how prompt wording strongly shapes behavior.
  • One commenter argues these are really “word models,” not true “language models,” since phrasing and register substantially affect outputs.

Local vs cloud, cost, and privacy

  • Running a 1T model locally on dual M3 Ultras (~$19K) is viewed by many as uneconomical versus cloud inference, especially given low personal utilization and very fast providers (Groq, Cerebras, etc.).
  • Others argue local is about:
    • Privacy and sensitive data (including “record everything” workflows and codebases)
    • Autonomy from future “enshittification” of cloud AI
    • Hobbyist experimentation and research
  • There’s disagreement over whether local makes sense only for privacy/hobby vs. future-proofing or high‑value bespoke uses.

Hardware and interconnect

  • Some speculate about macOS RDMA over Thunderbolt; the original demo is confirmed not to be using it yet, with expected future speedups.
  • Questions arise about Linux equivalents: vLLM can scale over standard Ethernet, but peak performance requires RDMA‑class interconnects.
  • Commenters also note refurbished/discounted M3 Ultras but point out that the lower-cost refurb configs don’t match the 512GB RAM spec in the demo.