Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Overview of Forge & “Guardrails”

  • Forge is a harness around LLMs focused on mechanical reliability, not model quality.
  • “Guardrails” here means structural tool-call correctness and workflow enforcement, not safety/moderation.
  • Core idea: small and mid-size models can perform complex agentic tasks if the framework keeps them from failing on format/order errors and lets them self-correct.

How Forge Works

  • Validates each tool call against declared tools and schemas before execution.
  • Adds “rescue parsing” to extract tool calls from various ad-hoc formats (JSON in code fences, custom bracket/XML syntaxes) into a canonical tool_calls schema.
  • Uses retry loops with structured, domain-agnostic nudges (e.g., “you must call a tool,” “you skipped a prerequisite step”) sent as tool results.
  • Supports optional workflow-level constraints (e.g., “must read before edit”) but can also run with only a tool set and no predefined plan.
  • Includes context management with tiered compaction; current implementation uses token thresholds, with future interest in more task-shaped triggers.

Eval Methodology & Findings

  • Evals are framed as stress-tests of the recovery loop, not overall agent quality.
  • Scenarios range from simple 2-step flows to longer workflows with prerequisites, dead ends, and misleading cues.
  • Guardrails significantly increase completion rates (e.g., 53%→99% on an 8B model in the article; similar trends mentioned for other sizes and tasks).
  • There is an observed “effective attention” limit on smaller models: they degrade on long sessions even within nominal context.

Backend / Serving-Layer Effects

  • Same model weights show large performance differences across backends (llama.cpp server native FC vs prompt mode vs llamafile vs Ollama).
  • Exact reasons are unclear; suspected factors include function-calling templates and low-level decoding / chat template differences.
  • Commenters find the magnitude “bonkers” and under-discussed; they request full eval settings and configs.

Relation to Other Harnesses & Tools

  • Similar ideas exist in other frameworks: structured output enforcement, JSON/grammar constraints, state machines, retry nudges, and coding-specific harnesses.
  • Forge emphasizes tool-call-level recovery more than workflow-level orchestration and can act as proxy middleware or as a Python library.

Use Cases, Benefits & Limits

  • Targeted especially at local small models (8B–30B class), but also helpful for frontier models by reducing thrashing and failed calls.
  • Suggested applications include coding agents, home assistants, and external agent frameworks; proxy mode minimizes integration friction.
  • Acknowledged limit: guardrails don’t fix bad reasoning or bad plans (e.g., booking wrong things); they only make chosen actions execute reliably.

Critiques & Confusions

  • Some readers find the README unclear about what Forge actually does and how it differs from plain tool-calling with type enforcement.
  • The overloaded term “guardrails” is noted as potentially confusing given its other uses (safety, sandboxing, PII filtering).
  • A few commenters express skepticism about LLM-written project descriptions and “AI slop,” while others push back and focus on technical merits.