2024-05-31

What we've learned from a year of building with LLMs

Fine-tuning vs RAG

Strong debate over when to fine-tune vs rely on RAG and prompting.
Some argue fine-tuning is “on the way out” for most apps: it’s costly, hard to do well, and RAG is better for injecting new, changing domain knowledge.
Others say fine-tuning is essential in some cases, e.g., teaching a model a custom DSL or getting small models to match larger ones on narrow tasks.
Several note it’s not either/or: RAG + prompting often comes first; fine-tuning may be added for style, robustness, or specific workflows.
Disagreement over whether small (e.g., 8B) models are “worth” fine-tuning and whether they’re already “saturated” with knowledge; no clear resolution.

Real-world use cases and skepticism

Repeated demand for “show me real production use cases” before accepting long lists of best practices.
Examples offered:
- BI / analytics assistants (text-to-SQL, query generation and refinement).
- Automated mail/fax/phone document triage and extraction, freeing several staff.
- High-volume unstructured data analysis and anomaly surfacing.
- Web data extraction with LLM-generated scrapers.
- Domain-specific assistants (observability, real estate CRM, freight operations).
- Internal tools for policy drafting, DnD content, summarization, translation, regex help, and code snippets.
Some users report disappointing experiences with current “AI features” in products and remain unconvinced for accuracy-critical tasks.

Workflows, agents, and multi-step processes

Many emphasize multi-step workflows over “god prompts”: break tasks into phases, maintain intermediate artifacts, and orchestrate with code, queues, and databases.
Suggestions include agent systems, task decomposition prompts, document-by-section generation, and having one LLM supervise another.

Structured output and JSON constraints

Persistent pain around getting reliable JSON/schema-conformant output at scale.
Techniques mentioned: grammar-based constrained decoding, custom parsers, post-processing + retries, or splitting outputs into simpler units.
Tension between wanting strict schemas and relying on hosted APIs that don’t fully support constrained decoding.

Hallucinations, reliability, and “good enough”

RAG is seen as helpful but not a cure for hallucinations; legal-domain results cited as only ~65% accurate.
Some argue that source-quoting and guardrails make LLMs “good enough” for many real-world apps; others call this disingenuous because users rarely verify citations.
Broader split: some see LLMs as transformative new compute, others as unreliable probabilistic text tools suited mainly to transformation, not high-stakes reasoning.

Related topics