ETH Zurich and EPFL to release a LLM developed on public infrastructure

Respecting Crawling Opt-Outs & Data Completeness

  • Several comments debate the claim that respecting robots.txt and opt‑outs causes “virtually no performance degradation.”
  • One side argues models that skip blocked content are inherently disadvantaged, especially for specific APIs or niche docs that might be uniquely hosted.
  • Others reply that:
    • Intelligence is more than memorization; models (like humans) can often infer missing details probabilistically.
    • The empirical gap appears small per the linked paper, and architecture, training duration, and fine-tuning may matter more than squeezing out the last scraps of data.
    • Some blocked content is effectively still captured indirectly via mirrors and copies by less-compliant scrapers.

Open Training Data & Legal Constraints

  • Strong enthusiasm for transparent, reproducible training data; this is seen as a major differentiator vs existing “open-weight” models.
  • Clarification: data is “transparent and reproducible,” but likely not fully redistributable due to copyright; expect recipes/URLs rather than a packaged dataset.
  • Practical issues raised: web content mutability, huge size (tens of TB), and legal differences:
    • One commenter claims EU text-and-data-mining exceptions allow training on copyrighted data if opt-outs are respected.
    • Another counters that EU authorities say those exceptions don’t apply to LLM training at scale, and that Swiss law requires licenses. This remains unresolved in the thread.

Architecture, Scale, and Benchmarks

  • Model will be 8B and 70B parameters, Apache 2.0 licensed; many want concrete benchmark tables vs LLaMA, DeepSeek, Teuken, EuroLLM, etc.
  • Someone involved in the project states:
    • It is trained from scratch with their own architecture, not a LLaMA finetune.
    • Main data source is FineWeb2, with compliance, toxicity, and quality filters (FineWeb2-HQ), while still retaining 1800+ language/script pairs.
  • Some worry 70B lags frontier mega‑models (e.g. MoE with hundreds of billions of parameters), others note 70B is a sweet spot for strong capability plus on-premise usability.

Multilingual Modeling

  • Interest in coverage of 24 EU languages and impact of quality filtering on multilingual performance.
  • Tokenization challenges are noted: common approaches are biased toward English subwords, which can hurt other languages.
  • Preliminary cited research suggests quality filtering can partially mitigate the “curse of multilinguality,” but the effect at large scale is still “open.”

Public Infrastructure, Motivation, and Critiques

  • Supporters frame this as:
    • Building sovereign, European, non-US/China, non‑“enshittified” AI infrastructure.
    • A high‑impact use of university supercomputers and a way to train the next generation of large‑scale ML researchers.
  • Critics compare it to designing an internal‑combustion car in the EV era, questioning:
    • What this adds over existing open‑weight models.
    • Whether such large-scale training is a “gross use” of public compute.
  • Proponents respond that:
    • Fully open models with open data, methods, and infrastructure experience are valuable in themselves.
    • Academic projects often have broader goals than short‑term capability: independence, transparency, and education.

Announcement Timing & Missing Details

  • Some question announcing before release; others note the timing aligns with an open-source LLM summit and likely helps funding and ecosystem building.
  • Open questions in the thread:
    • Exact context length.
    • Detailed benchmark results.
    • How well the model will perform on scientific/math/code tasks and sustainability-related applications.