2024-05-22

Show HN: Route your prompts to the best LLM

Original Article ↗ Hacker News Discussion ↗

Concept & Architecture

Service routes each prompt to one of many LLMs based on predicted quality, latency, and cost.
Uses a separate neural network “router” (~20ms inference) plus ~150ms extra when using their public endpoints; on‑prem deployment avoids most added latency.
Router is trained supervised on open LLM datasets, using GPT‑4 (or similar) as a judge to generate scores; it learns a score function over prompts plus per‑model latent vectors.

Use Cases & Benefits

Seen as most useful at scale, where inference cost and speed matter (sales call agents, copilots, autocomplete, real‑time UX).
Some users report quality gains by combining strengths of multiple models.
Platform also offers benchmarking: run your prompts against many models/providers to compare cost, speed, and judged quality; can be used even without routing.

Customization & Integrations

Supports training custom routers on app‑specific data to better match a given task.
Integrations mentioned: LlamaIndex RAG, LangChain‑style routing concepts, planned support for more models (e.g., Gemini variants, Gemini Flash) and on‑prem/local deployment.
Future API planned to expose raw router scores so clients can keep routing logic and model‑specific prompts on their side.

Data Usage & Privacy

By default, user data is used (anonymized) to improve the base router.
Opt‑out is supported; creator claims no downside other than losing that feedback signal.

Business Model & Sustainability

Currently passes through provider rates, takes no margin, and offers free credits to new signups.
Future revenue ideas: take a small margin on “optimized” router configs that still reduce user costs vs. a single model; possibly negotiate provider discounts.
Some commenters prefer explicit, stable pricing (e.g., fixed fee or small commission) to avoid future surprises.

Comparisons & Alternatives

Compared to openrouter‑style abstraction, other AI gateways, and MoE/“composition of experts” architectures.
Key difference vs. MoE: operates at a higher level, routing between entire black‑box models, not internal layers or tokens.

Skepticism & Limitations

Several practitioners argue models are not interchangeable; prompts are heavily tuned per model and even minor changes or quantization shifts affect behavior.
Concern that dynamic routing undermines consistency, especially for complex or high‑stakes content generation and agentic systems.
Others see routing as overkill for many apps, with benchmarking and single‑endpoint access being the more broadly valuable features.