2025-12-14

Kimi K2 1T model runs on 2 512GB M3 Ultras

Model details and quantization

The demo uses Kimi K2 at 4‑bit quantization on two 512GB M3 Ultras; several people note this should be explicitly stated, though some assume “1T parameters” implicitly means heavy quantization.
There’s confusion between Kimi K2 vs K2 Thinking (K2T): they are different models with very different capabilities and post‑training. K2T is seen as closer to top-tier models like Sonnet 4.5.
Questions arise about context length and prefill speed; commenters warn that “it runs” at small context doesn’t imply usable performance at large, coding‑style contexts.

Behavior, style, and use cases

Kimi K2 is described as less capable than frontier models on complex reasoning but unusually strong at:
- Short-form writing (emails, communication)
- Editing and blunt critique
- “Emotional intelligence” / social nuances in messages
- Geospatial tasks
It is perceived as unusually direct, willing to call out user mistakes, and to clearly say “there is no answer in the search results.” Some users value this non‑sycophantic style.

Instruction-following vs pushback

One camp wants strict, assumption‑free instruction following (especially for coding), with the model asking clarifying questions rather than disagreeing.
Another camp prefers agents that take initiative, push back on dubious instructions, and warn about dangerous consequences (e.g., potential SQL injection).
A middle ground emerges: models should sometimes ask clarifying questions and sometimes challenge the request, but not blindly comply.

Training, architecture, and RLHF

Kimi is said to be based on a DeepSeek-style MoE architecture, trained with the Muon optimizer and “mainly finetuning.”
Debate over whether most Chinese models are downstream of DeepSeek/GPT; others point to Qwen, Mistral, Llama, ERNIE, etc. as independent efforts.
Several comments criticize mainstream RLHF for over-optimizing for politeness and flattery; Kimi is praised as a counterexample.

Benchmarks and prompting

Kimi K2 reportedly performs unusually well on the “clock test” and EQBench (with the caveat that EQBench is LLMs grading LLMs).
Discussion around more “linguistically technical” system prompts to force blunt, “bald-on-record” responses, illustrating how prompt wording strongly shapes behavior.
One commenter argues these are really “word models,” not true “language models,” since phrasing and register substantially affect outputs.

Local vs cloud, cost, and privacy

Running a 1T model locally on dual M3 Ultras (~$19K) is viewed by many as uneconomical versus cloud inference, especially given low personal utilization and very fast providers (Groq, Cerebras, etc.).
Others argue local is about:
- Privacy and sensitive data (including “record everything” workflows and codebases)
- Autonomy from future “enshittification” of cloud AI
- Hobbyist experimentation and research
There’s disagreement over whether local makes sense only for privacy/hobby vs. future-proofing or high‑value bespoke uses.

Hardware and interconnect

Some speculate about macOS RDMA over Thunderbolt; the original demo is confirmed not to be using it yet, with expected future speedups.
Questions arise about Linux equivalents: vLLM can scale over standard Ethernet, but peak performance requires RDMA‑class interconnects.
Commenters also note refurbished/discounted M3 Ultras but point out that the lower-cost refurb configs don’t match the 512GB RAM spec in the demo.

Related topics