Show HN: Factorio Learning Environment – Agents Build Factories

Project and Setup

  • Framework exposes Factorio via a text-based Python API, built on remote console (RCON) tools running in-game and hot-loadable.
  • Agents write small programs that call these tools; interaction is turn-based rather than pixel/mouse RL.
  • No post-training: all models are off-the-shelf, given tool signatures, docstrings, and short “manuals” with examples.
  • Humans have completed all early “lab” tasks using only the API, showing it is technically sufficient but slower than normal play.

Model Capabilities and Failures

  • Strong correlation between coding ability and in-game performance; top models progress beyond early-game resource extraction.
  • Two main weaknesses:
    • Spatial reasoning: off‑by‑one placements, tangled layouts, mis-rotated inserters, and pipes with incompatible fluids adjacent.
    • Long‑term planning: agents focus on local fixes and small constructs, rarely develop scalable production or compounding growth.
  • Models often loop on failing actions (“target fixation”) and struggle to recover from earlier design mistakes (e.g. broken topology).
  • Oil and complex production chains are notably hard; agents can handle simple “essential tasks” in isolation but don’t reliably invoke them in open‑ended “build the biggest factory” episodes.

Spatial Representation and Modalities

  • Current eval is text-only over object lists with coordinates and neighborhood info.
  • Attempts with Mermaid diagrams, visual DSLs, and screenshots (VLMs) did not help; more entities made models more confused and hallucinatory.
  • ASCII/grid or Unicode encodings are discussed but raise token-budget and tokenization issues; sparse symbolic encodings seem less confusing.
  • Future ideas: relative-position vectors between entities, factorio-specific visual encoders, and a dedicated “FLE‑V” visual benchmark.

Benchmarks, Metrics, and Tasks

  • Main metric is “production score” (value-weighted total output), with milestones for first-time automation of items; SPM is tracked but not primary.
  • Community suggests richer tasks: tower‑defense style biter waves, factory-debugging and throughput optimization, belt balancers, train signaling, and large banks of tiny, auto-generated “IQ test” scenarios.
  • Reward shaping analogies to Pokémon RL: incremental rewards for automating new items/science.

Classical AI, Tools, and Game AI Implications

  • Several argue Factorio could largely be “solved” with GOFAI/OR/metaheuristics; FLE agents can, in principle, write or call such solvers (e.g., Z3), but none have yet.
  • Broader view: LLMs should orchestrate specialized planners rather than directly micromanage all actions.
  • Debate over using LLMs as in‑game opponents: many doubt it’s necessary or fun for most genres, but see promise for coaches, strategy AIs, and diplomacy.