Show HN: Route your prompts to the best LLM

Concept & Architecture

  • Service routes each prompt to one of many LLMs based on predicted quality, latency, and cost.
  • Uses a separate neural network “router” (~20ms inference) plus ~150ms extra when using their public endpoints; on‑prem deployment avoids most added latency.
  • Router is trained supervised on open LLM datasets, using GPT‑4 (or similar) as a judge to generate scores; it learns a score function over prompts plus per‑model latent vectors.

Use Cases & Benefits

  • Seen as most useful at scale, where inference cost and speed matter (sales call agents, copilots, autocomplete, real‑time UX).
  • Some users report quality gains by combining strengths of multiple models.
  • Platform also offers benchmarking: run your prompts against many models/providers to compare cost, speed, and judged quality; can be used even without routing.

Customization & Integrations

  • Supports training custom routers on app‑specific data to better match a given task.
  • Integrations mentioned: LlamaIndex RAG, LangChain‑style routing concepts, planned support for more models (e.g., Gemini variants, Gemini Flash) and on‑prem/local deployment.
  • Future API planned to expose raw router scores so clients can keep routing logic and model‑specific prompts on their side.

Data Usage & Privacy

  • By default, user data is used (anonymized) to improve the base router.
  • Opt‑out is supported; creator claims no downside other than losing that feedback signal.

Business Model & Sustainability

  • Currently passes through provider rates, takes no margin, and offers free credits to new signups.
  • Future revenue ideas: take a small margin on “optimized” router configs that still reduce user costs vs. a single model; possibly negotiate provider discounts.
  • Some commenters prefer explicit, stable pricing (e.g., fixed fee or small commission) to avoid future surprises.

Comparisons & Alternatives

  • Compared to openrouter‑style abstraction, other AI gateways, and MoE/“composition of experts” architectures.
  • Key difference vs. MoE: operates at a higher level, routing between entire black‑box models, not internal layers or tokens.

Skepticism & Limitations

  • Several practitioners argue models are not interchangeable; prompts are heavily tuned per model and even minor changes or quantization shifts affect behavior.
  • Concern that dynamic routing undermines consistency, especially for complex or high‑stakes content generation and agentic systems.
  • Others see routing as overkill for many apps, with benchmarking and single‑endpoint access being the more broadly valuable features.