What we've learned from a year of building with LLMs

Fine-tuning vs RAG

  • Strong debate over when to fine-tune vs rely on RAG and prompting.
  • Some argue fine-tuning is “on the way out” for most apps: it’s costly, hard to do well, and RAG is better for injecting new, changing domain knowledge.
  • Others say fine-tuning is essential in some cases, e.g., teaching a model a custom DSL or getting small models to match larger ones on narrow tasks.
  • Several note it’s not either/or: RAG + prompting often comes first; fine-tuning may be added for style, robustness, or specific workflows.
  • Disagreement over whether small (e.g., 8B) models are “worth” fine-tuning and whether they’re already “saturated” with knowledge; no clear resolution.

Real-world use cases and skepticism

  • Repeated demand for “show me real production use cases” before accepting long lists of best practices.
  • Examples offered:
    • BI / analytics assistants (text-to-SQL, query generation and refinement).
    • Automated mail/fax/phone document triage and extraction, freeing several staff.
    • High-volume unstructured data analysis and anomaly surfacing.
    • Web data extraction with LLM-generated scrapers.
    • Domain-specific assistants (observability, real estate CRM, freight operations).
    • Internal tools for policy drafting, DnD content, summarization, translation, regex help, and code snippets.
  • Some users report disappointing experiences with current “AI features” in products and remain unconvinced for accuracy-critical tasks.

Workflows, agents, and multi-step processes

  • Many emphasize multi-step workflows over “god prompts”: break tasks into phases, maintain intermediate artifacts, and orchestrate with code, queues, and databases.
  • Suggestions include agent systems, task decomposition prompts, document-by-section generation, and having one LLM supervise another.

Structured output and JSON constraints

  • Persistent pain around getting reliable JSON/schema-conformant output at scale.
  • Techniques mentioned: grammar-based constrained decoding, custom parsers, post-processing + retries, or splitting outputs into simpler units.
  • Tension between wanting strict schemas and relying on hosted APIs that don’t fully support constrained decoding.

Hallucinations, reliability, and “good enough”

  • RAG is seen as helpful but not a cure for hallucinations; legal-domain results cited as only ~65% accurate.
  • Some argue that source-quoting and guardrails make LLMs “good enough” for many real-world apps; others call this disingenuous because users rarely verify citations.
  • Broader split: some see LLMs as transformative new compute, others as unreliable probabilistic text tools suited mainly to transformation, not high-stakes reasoning.