The path to ubiquitous AI (17k tokens/sec)
Demo experience & perceived speed
- Many tried the ChatJimmy demo and were shocked: multi-paragraph answers appear essentially instantly (15–17k tok/s), feeling like a page load rather than streaming “typing.”
- The UX feels qualitatively different: some found it delightful, others found “wall of text all at once” disorienting and suggested artificial throttling to match reading or interaction pace.
- A few noted bugs or odd behavior (caching issues, broken attachment handling), but accepted it as a tech demo.
Hardware approach & architecture
- The chip is an inference ASIC with the model’s weights hardwired in ROM and a limited KV cache in SRAM, on a large
6nm die (880 mm², ~53B transistors, ~200W). - Clarification in the thread: the current Llama‑3.1‑8B Q3, ~1k context demo likely fits on a single chip; earlier claims of needing 10 chips for one model were later walked back.
- Future roadmap: larger “thinking” models, FP4 generation, and multi‑card setups for frontier‑class models; claimed ~2‑month turnaround from model to silicon.
- Comparisons to GPUs and TPUs suggest similar Pareto efficiency overall but access to an extreme low‑latency corner that general‑purpose chips can’t reach.
Model quality, limitations, and what small models are for
- People repeatedly stress the demo uses an old, 8B, heavily quantized Llama, so hallucinations and wrong answers (sports trivia, Monty Python lines, counting letters, basic sentiment) are expected.
- Debate centers on misunderstanding LLM roles: small models are weak as encyclopedias but strong at:
- Converting unstructured → structured data
- Classification, tagging, routing, scoring
- Simple transformations (markdown, translations, schema filling)
- There’s a side argument over whether LLMs “just regurgitate text” vs genuinely solve novel problems; no consensus.
Proposed use cases for ultra-fast small models
- High‑throughput NLP:
- PII detection and log scanning
- Large‑scale summarization, Wikidata/Wikipedia enrichment
- Mass data extraction, email/attachment parsing, column‑wise database tagging
- Orchestration and agents:
- Routing in agent pipelines and API gateways
- Speculative decoding in front of frontier models
- Swarms of cheap “minion” models exploring many solution paths in parallel
- Real‑time / embedded:
- Voice assistants and turn detection with sub‑second response
- Robotics, drones, industrial automation, on‑device UX, “smart” appliances, possibly games/NPCs.
- Many note that for these, “good enough + insanely fast + cheap” often beats “frontier but slow/expensive.”
Scaling, obsolescence, and economics
- Skeptics question:
- Whether this approach scales to 80B–800B models given SRAM limits, context constraints, and power.
- The value of etching a model that may be outdated in 6–12 months, raising e‑waste and depreciation concerns.
- Supporters reply:
- Models are approaching “good enough” for many workloads; once plateaued, fixed‑model silicon becomes attractive.
- Chips can remain useful for years as narrow specialists, especially with RAG, tool use, and LoRA‑style fine‑tuning.
- Broader implications:
- Possible shift from per‑token SaaS to “AI as appliance” (cards, cartridges, on‑prem).
- Could nibble at 5–10% of inference workloads (low‑latency, small‑context tasks) while GPUs remain dominant for training.
- Raises questions about evaluation (static benchmarks vs massive adversarial/agentic testing) and the need for safety “circuit breakers” when tokens become extremely cheap and fast.