Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Model & Capabilities

  • 26M-parameter, INT4-quantized (~14 MB) tool-calling model distilled from Gemini, intended to run on CPUs and tiny devices.
  • Focus is on selecting tools and filling arguments, not general conversation or “knowledge.”
  • Currently lacks robust in-context learning; authors say that’s “in the works.”
  • Architecturally notable for removing MLP/FFN blocks, relying on attention plus external tools/knowledge.

Use Cases & Integration Ideas

  • OS-level or shared “natural language parser” that all CLI programs can use (NL to program flags).
  • Voice assistants and smart-home control (timers, weather, lights), Siri-like behavior, Home Assistant integration.
  • Embedded in devices like watches, earphones, glasses, Raspberry Pi–based smart speakers.
  • As a thin tool router in larger agent systems (small model chooses tools, big model handles reasoning/summarization).
  • Used with MCP or similar to abstract away direct API integrations (“just give tools, let model figure it out”).
  • Local helpers for complex build/test infrastructures or privacy-first desktop/mobile apps.

Demos, Deployment & Tooling

  • Initial tokenizer repo access issue on Hugging Face was fixed.
  • Community quickly deployed a Hugging Face Space with a very simple Dockerfile.
  • Suggestions for demos: short videos, terminal recordings (e.g., asciinema), WebGPU/WASM/Transformers.js browser demo.

Performance, Limitations & Open Questions

  • Some users report strong results for basic tasks (timers, shopping lists), even surpassing Siri for simple flows.
  • Others find limitations: confusion with overlapping tools, repeated/duplicated tool calls, weak handling of ambiguous requests, and reliance on tight contexts/prompts.
  • Multi-step workflows and stateful chains are only partially demonstrated; long-horizon tool planning behavior remains unclear.
  • Questions raised on ONNX/browser deployment and formal tool-use benchmarks; answers are not fully detailed in the thread.

Distillation, Gemini Choice & Ethics

  • Several posts explain distillation as training a small “student” on a big model’s outputs; note it’s lossy but efficient.
  • Gemini chosen mainly for cheaper APIs and solid one-shot tool-calling, though multiple commenters say Gemini is weak at tool use compared with alternatives.
  • Concerns raised that Gemini’s ToS forbids distilling competing models; warnings about possible bans or degraded outputs via Google’s anti-distillation defenses.
  • Others highlight perceived double standards, given that large labs trained on web data without individual consent.