Show HN: Factorio Learning Environment – Agents Build Factories
Project and Setup
- Framework exposes Factorio via a text-based Python API, built on remote console (RCON) tools running in-game and hot-loadable.
- Agents write small programs that call these tools; interaction is turn-based rather than pixel/mouse RL.
- No post-training: all models are off-the-shelf, given tool signatures, docstrings, and short “manuals” with examples.
- Humans have completed all early “lab” tasks using only the API, showing it is technically sufficient but slower than normal play.
Model Capabilities and Failures
- Strong correlation between coding ability and in-game performance; top models progress beyond early-game resource extraction.
- Two main weaknesses:
- Spatial reasoning: off‑by‑one placements, tangled layouts, mis-rotated inserters, and pipes with incompatible fluids adjacent.
- Long‑term planning: agents focus on local fixes and small constructs, rarely develop scalable production or compounding growth.
- Models often loop on failing actions (“target fixation”) and struggle to recover from earlier design mistakes (e.g. broken topology).
- Oil and complex production chains are notably hard; agents can handle simple “essential tasks” in isolation but don’t reliably invoke them in open‑ended “build the biggest factory” episodes.
Spatial Representation and Modalities
- Current eval is text-only over object lists with coordinates and neighborhood info.
- Attempts with Mermaid diagrams, visual DSLs, and screenshots (VLMs) did not help; more entities made models more confused and hallucinatory.
- ASCII/grid or Unicode encodings are discussed but raise token-budget and tokenization issues; sparse symbolic encodings seem less confusing.
- Future ideas: relative-position vectors between entities, factorio-specific visual encoders, and a dedicated “FLE‑V” visual benchmark.
Benchmarks, Metrics, and Tasks
- Main metric is “production score” (value-weighted total output), with milestones for first-time automation of items; SPM is tracked but not primary.
- Community suggests richer tasks: tower‑defense style biter waves, factory-debugging and throughput optimization, belt balancers, train signaling, and large banks of tiny, auto-generated “IQ test” scenarios.
- Reward shaping analogies to Pokémon RL: incremental rewards for automating new items/science.
Classical AI, Tools, and Game AI Implications
- Several argue Factorio could largely be “solved” with GOFAI/OR/metaheuristics; FLE agents can, in principle, write or call such solvers (e.g., Z3), but none have yet.
- Broader view: LLMs should orchestrate specialized planners rather than directly micromanage all actions.
- Debate over using LLMs as in‑game opponents: many doubt it’s necessary or fun for most genres, but see promise for coaches, strategy AIs, and diplomacy.