2025-03-11

Show HN: Factorio Learning Environment – Agents Build Factories

Project and Setup

Framework exposes Factorio via a text-based Python API, built on remote console (RCON) tools running in-game and hot-loadable.
Agents write small programs that call these tools; interaction is turn-based rather than pixel/mouse RL.
No post-training: all models are off-the-shelf, given tool signatures, docstrings, and short “manuals” with examples.
Humans have completed all early “lab” tasks using only the API, showing it is technically sufficient but slower than normal play.

Model Capabilities and Failures

Strong correlation between coding ability and in-game performance; top models progress beyond early-game resource extraction.
Two main weaknesses:
- Spatial reasoning: off‑by‑one placements, tangled layouts, mis-rotated inserters, and pipes with incompatible fluids adjacent.
- Long‑term planning: agents focus on local fixes and small constructs, rarely develop scalable production or compounding growth.
Models often loop on failing actions (“target fixation”) and struggle to recover from earlier design mistakes (e.g. broken topology).
Oil and complex production chains are notably hard; agents can handle simple “essential tasks” in isolation but don’t reliably invoke them in open‑ended “build the biggest factory” episodes.

Spatial Representation and Modalities

Current eval is text-only over object lists with coordinates and neighborhood info.
Attempts with Mermaid diagrams, visual DSLs, and screenshots (VLMs) did not help; more entities made models more confused and hallucinatory.
ASCII/grid or Unicode encodings are discussed but raise token-budget and tokenization issues; sparse symbolic encodings seem less confusing.
Future ideas: relative-position vectors between entities, factorio-specific visual encoders, and a dedicated “FLE‑V” visual benchmark.

Benchmarks, Metrics, and Tasks

Main metric is “production score” (value-weighted total output), with milestones for first-time automation of items; SPM is tracked but not primary.
Community suggests richer tasks: tower‑defense style biter waves, factory-debugging and throughput optimization, belt balancers, train signaling, and large banks of tiny, auto-generated “IQ test” scenarios.
Reward shaping analogies to Pokémon RL: incremental rewards for automating new items/science.

Classical AI, Tools, and Game AI Implications

Several argue Factorio could largely be “solved” with GOFAI/OR/metaheuristics; FLE agents can, in principle, write or call such solvers (e.g., Z3), but none have yet.
Broader view: LLMs should orchestrate specialized planners rather than directly micromanage all actions.
Debate over using LLMs as in‑game opponents: many doubt it’s necessary or fun for most genres, but see promise for coaches, strategy AIs, and diplomacy.

Related topics