2026-05-19

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Overview of Forge & “Guardrails”

Forge is a harness around LLMs focused on mechanical reliability, not model quality.
“Guardrails” here means structural tool-call correctness and workflow enforcement, not safety/moderation.
Core idea: small and mid-size models can perform complex agentic tasks if the framework keeps them from failing on format/order errors and lets them self-correct.

How Forge Works

Validates each tool call against declared tools and schemas before execution.
Adds “rescue parsing” to extract tool calls from various ad-hoc formats (JSON in code fences, custom bracket/XML syntaxes) into a canonical tool_calls schema.
Uses retry loops with structured, domain-agnostic nudges (e.g., “you must call a tool,” “you skipped a prerequisite step”) sent as tool results.
Supports optional workflow-level constraints (e.g., “must read before edit”) but can also run with only a tool set and no predefined plan.
Includes context management with tiered compaction; current implementation uses token thresholds, with future interest in more task-shaped triggers.

Eval Methodology & Findings

Evals are framed as stress-tests of the recovery loop, not overall agent quality.
Scenarios range from simple 2-step flows to longer workflows with prerequisites, dead ends, and misleading cues.
Guardrails significantly increase completion rates (e.g., 53%→99% on an 8B model in the article; similar trends mentioned for other sizes and tasks).
There is an observed “effective attention” limit on smaller models: they degrade on long sessions even within nominal context.

Backend / Serving-Layer Effects

Same model weights show large performance differences across backends (llama.cpp server native FC vs prompt mode vs llamafile vs Ollama).
Exact reasons are unclear; suspected factors include function-calling templates and low-level decoding / chat template differences.
Commenters find the magnitude “bonkers” and under-discussed; they request full eval settings and configs.

Relation to Other Harnesses & Tools

Similar ideas exist in other frameworks: structured output enforcement, JSON/grammar constraints, state machines, retry nudges, and coding-specific harnesses.
Forge emphasizes tool-call-level recovery more than workflow-level orchestration and can act as proxy middleware or as a Python library.

Use Cases, Benefits & Limits

Targeted especially at local small models (8B–30B class), but also helpful for frontier models by reducing thrashing and failed calls.
Suggested applications include coding agents, home assistants, and external agent frameworks; proxy mode minimizes integration friction.
Acknowledged limit: guardrails don’t fix bad reasoning or bad plans (e.g., booking wrong things); they only make chosen actions execute reliably.

Critiques & Confusions

Some readers find the README unclear about what Forge actually does and how it differs from plain tool-calling with type enforcement.
The overloaded term “guardrails” is noted as potentially confusing given its other uses (safety, sandboxing, PII filtering).
A few commenters express skepticism about LLM-written project descriptions and “AI slop,” while others push back and focus on technical merits.

Related topics