What we've learned from a year of building with LLMs
Fine-tuning vs RAG
- Strong debate over when to fine-tune vs rely on RAG and prompting.
- Some argue fine-tuning is “on the way out” for most apps: it’s costly, hard to do well, and RAG is better for injecting new, changing domain knowledge.
- Others say fine-tuning is essential in some cases, e.g., teaching a model a custom DSL or getting small models to match larger ones on narrow tasks.
- Several note it’s not either/or: RAG + prompting often comes first; fine-tuning may be added for style, robustness, or specific workflows.
- Disagreement over whether small (e.g., 8B) models are “worth” fine-tuning and whether they’re already “saturated” with knowledge; no clear resolution.
Real-world use cases and skepticism
- Repeated demand for “show me real production use cases” before accepting long lists of best practices.
- Examples offered:
- BI / analytics assistants (text-to-SQL, query generation and refinement).
- Automated mail/fax/phone document triage and extraction, freeing several staff.
- High-volume unstructured data analysis and anomaly surfacing.
- Web data extraction with LLM-generated scrapers.
- Domain-specific assistants (observability, real estate CRM, freight operations).
- Internal tools for policy drafting, DnD content, summarization, translation, regex help, and code snippets.
- Some users report disappointing experiences with current “AI features” in products and remain unconvinced for accuracy-critical tasks.
Workflows, agents, and multi-step processes
- Many emphasize multi-step workflows over “god prompts”: break tasks into phases, maintain intermediate artifacts, and orchestrate with code, queues, and databases.
- Suggestions include agent systems, task decomposition prompts, document-by-section generation, and having one LLM supervise another.
Structured output and JSON constraints
- Persistent pain around getting reliable JSON/schema-conformant output at scale.
- Techniques mentioned: grammar-based constrained decoding, custom parsers, post-processing + retries, or splitting outputs into simpler units.
- Tension between wanting strict schemas and relying on hosted APIs that don’t fully support constrained decoding.
Hallucinations, reliability, and “good enough”
- RAG is seen as helpful but not a cure for hallucinations; legal-domain results cited as only ~65% accurate.
- Some argue that source-quoting and guardrails make LLMs “good enough” for many real-world apps; others call this disingenuous because users rarely verify citations.
- Broader split: some see LLMs as transformative new compute, others as unreliable probabilistic text tools suited mainly to transformation, not high-stakes reasoning.