Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Overview of Forge & “Guardrails”
- Forge is a harness around LLMs focused on mechanical reliability, not model quality.
- “Guardrails” here means structural tool-call correctness and workflow enforcement, not safety/moderation.
- Core idea: small and mid-size models can perform complex agentic tasks if the framework keeps them from failing on format/order errors and lets them self-correct.
How Forge Works
- Validates each tool call against declared tools and schemas before execution.
- Adds “rescue parsing” to extract tool calls from various ad-hoc formats (JSON in code fences, custom bracket/XML syntaxes) into a canonical tool_calls schema.
- Uses retry loops with structured, domain-agnostic nudges (e.g., “you must call a tool,” “you skipped a prerequisite step”) sent as tool results.
- Supports optional workflow-level constraints (e.g., “must read before edit”) but can also run with only a tool set and no predefined plan.
- Includes context management with tiered compaction; current implementation uses token thresholds, with future interest in more task-shaped triggers.
Eval Methodology & Findings
- Evals are framed as stress-tests of the recovery loop, not overall agent quality.
- Scenarios range from simple 2-step flows to longer workflows with prerequisites, dead ends, and misleading cues.
- Guardrails significantly increase completion rates (e.g., 53%→99% on an 8B model in the article; similar trends mentioned for other sizes and tasks).
- There is an observed “effective attention” limit on smaller models: they degrade on long sessions even within nominal context.
Backend / Serving-Layer Effects
- Same model weights show large performance differences across backends (llama.cpp server native FC vs prompt mode vs llamafile vs Ollama).
- Exact reasons are unclear; suspected factors include function-calling templates and low-level decoding / chat template differences.
- Commenters find the magnitude “bonkers” and under-discussed; they request full eval settings and configs.
Relation to Other Harnesses & Tools
- Similar ideas exist in other frameworks: structured output enforcement, JSON/grammar constraints, state machines, retry nudges, and coding-specific harnesses.
- Forge emphasizes tool-call-level recovery more than workflow-level orchestration and can act as proxy middleware or as a Python library.
Use Cases, Benefits & Limits
- Targeted especially at local small models (8B–30B class), but also helpful for frontier models by reducing thrashing and failed calls.
- Suggested applications include coding agents, home assistants, and external agent frameworks; proxy mode minimizes integration friction.
- Acknowledged limit: guardrails don’t fix bad reasoning or bad plans (e.g., booking wrong things); they only make chosen actions execute reliably.
Critiques & Confusions
- Some readers find the README unclear about what Forge actually does and how it differs from plain tool-calling with type enforcement.
- The overloaded term “guardrails” is noted as potentially confusing given its other uses (safety, sandboxing, PII filtering).
- A few commenters express skepticism about LLM-written project descriptions and “AI slop,” while others push back and focus on technical merits.