Show HN: Route your prompts to the best LLM
Concept & Architecture
- Service routes each prompt to one of many LLMs based on predicted quality, latency, and cost.
- Uses a separate neural network “router” (~20ms inference) plus ~150ms extra when using their public endpoints; on‑prem deployment avoids most added latency.
- Router is trained supervised on open LLM datasets, using GPT‑4 (or similar) as a judge to generate scores; it learns a score function over prompts plus per‑model latent vectors.
Use Cases & Benefits
- Seen as most useful at scale, where inference cost and speed matter (sales call agents, copilots, autocomplete, real‑time UX).
- Some users report quality gains by combining strengths of multiple models.
- Platform also offers benchmarking: run your prompts against many models/providers to compare cost, speed, and judged quality; can be used even without routing.
Customization & Integrations
- Supports training custom routers on app‑specific data to better match a given task.
- Integrations mentioned: LlamaIndex RAG, LangChain‑style routing concepts, planned support for more models (e.g., Gemini variants, Gemini Flash) and on‑prem/local deployment.
- Future API planned to expose raw router scores so clients can keep routing logic and model‑specific prompts on their side.
Data Usage & Privacy
- By default, user data is used (anonymized) to improve the base router.
- Opt‑out is supported; creator claims no downside other than losing that feedback signal.
Business Model & Sustainability
- Currently passes through provider rates, takes no margin, and offers free credits to new signups.
- Future revenue ideas: take a small margin on “optimized” router configs that still reduce user costs vs. a single model; possibly negotiate provider discounts.
- Some commenters prefer explicit, stable pricing (e.g., fixed fee or small commission) to avoid future surprises.
Comparisons & Alternatives
- Compared to openrouter‑style abstraction, other AI gateways, and MoE/“composition of experts” architectures.
- Key difference vs. MoE: operates at a higher level, routing between entire black‑box models, not internal layers or tokens.
Skepticism & Limitations
- Several practitioners argue models are not interchangeable; prompts are heavily tuned per model and even minor changes or quantization shifts affect behavior.
- Concern that dynamic routing undermines consistency, especially for complex or high‑stakes content generation and agentic systems.
- Others see routing as overkill for many apps, with benchmarking and single‑endpoint access being the more broadly valuable features.