2026-02-27

We gave terabytes of CI logs to an LLM

Practical effectiveness of LLMs on CI logs

Some commenters report strong success using recent models to debug tricky, flaky infra/CI issues from logs, when paired with good tooling and instructions.
Others note earlier attempts often hallucinated causes because failures are multi-factor and spread across large, noisy logs.
The Mendral team and others claim it does work in production for CI failures (especially flaky tests), including identifying root causes and proposing fixes, but emphasize that the setup and orchestration matter more than raw model capability.

Context management, agents, and orchestration

A recurring theme: let the model pull relevant context via tools instead of pushing huge logs into the prompt.
Described pattern: a main “planner” agent (stronger model) creates an investigation plan, then spawns sub‑agents (cheaper/faster model) to scan restricted log slices and return only relevant snippets or patterns.
This “recursive” or agentic style is likened to “Recursive Language Models” or coding agents with a REPL, even though the underlying LLM is unchanged.

Logs, noise, and preprocessing

Many highlight that logs are extremely noisy; only a tiny fraction of lines matter, and cause/effect often spans services or containers.
Good logging quality is seen as a hard, separate problem; if logs were clear enough for LLMs, humans would also debug faster.
Two main strategies emerge:
- Pre-filter/compress logs before the LLM (e.g., TF‑IDF/BERT classifiers, pattern clustering, log compression like CLP).
- Avoid heavy ingestion-time filtering and instead invest in schema/indexes so agents can issue efficient queries that filter at retrieval time.

LLMs and SQL for observability

Several argue SQL is an ideal “common language” between agents and observability data: models generate good SQL when given schemas, and humans can easily review queries.
Tools mentioned include Text2SQL engines for Prometheus/Loki/Splunk and ClickHouse‑backed log viewers where agents directly emit SQL.
Others caution that LLM‑generated SQL for analytics remains mixed and must be heavily guided; reasoning and codegen can diverge.

Risk, cost, and human oversight

Commenters stress nondeterminism and “review fatigue”: long successful sessions can suddenly produce bad output, which is risky for business‑critical analytics or automated fixes.
Mendral’s workflow keeps a human approval step for remediation/PRs, despite customers asking for full automation.
There are questions about token cost at scale; Mendral says per‑investigation costs are significant but currently profitable, and they’re optimizing orchestration to reduce spend.

Product scope and skepticism

Mendral is positioned as automating a platform engineer’s CI debugging workflow: reading logs, inspecting commits/tests, suggesting fixes, and opening PRs.
Some see this as disciplined, well-scoped RAG/agent design; others criticize the blog post as marketing-heavy, under‑quantified (no success rates), or “what existing tools already do.”

Related topics