Open Euro LLM: Open LLMs for Transparent AI in Europe

Current State of OpenEuroLLM

  • Thread notes there is only a press-release/frontpage so far; no models are released yet.
  • Project claims prior “pilot LLMs” and large existing datasets from earlier EU projects, so it is not entirely from scratch, but concrete technical details are still unclear.

Budget, Compute and Feasibility

  • Official budget (~€37–52M depending on source) is widely seen as an order of magnitude too small compared to frontier efforts, once hardware, energy, experimentation, and staff are counted.
  • Some argue EuroHPC supercomputers (Leonardo, LUMI, JUWELS, etc.) and upcoming AI clusters provide substantial “free” compute that effectively enlarges the budget.
  • Others counter these clusters are modest by frontier LLM standards and worry that believing DeepSeek’s claimed low costs at face value would be a mistake.

Regulation, “European values” and Data Legality

  • Strong skepticism that training only on “legally clean” data within the EU regulatory framework can yield competitive models, especially for smaller languages.
  • Counter-argument: good models can be trained on textbooks, legal ebooks, public-domain and free works, without scraping social media or pop culture.
  • Dispute over practicality and cost of licensing large book corpora, and over whether synthetic data from existing frontier models is legally and politically acceptable in an EU transparency-branded project.

Multilingual and Small-Language Performance

  • Mixed experiences reported: Mistral models praised for English, German, Dutch, Romanian, but seen as weaker in some Slavic languages; Gemma, Llama 3.1 and DeepSeek are cited as strong in niche languages like Finnish.
  • Consensus that truly high-quality models for small languages with limited corpus (hundreds of millions of tokens) likely require synthetic data; without that, results are expected to be weak.

EU Strategy: Regulation vs Sovereignty

  • One camp: EU can safely “lead in legislation,” reuse open frontier models (DeepSeek, Llama), and focus on preventing abuses (social/credit scoring) rather than chasing the frontier.
  • Opposing camp: relying on US/Chinese models creates strategic and political dependence and embeds foreign biases; EU needs its own strong models and even chip autonomy.

Academia, Grants and “Death by Committee”

  • Many expect a typical EU pattern: large multi-party consortia, heavy bureaucracy, reports and conferences, weak incentives, and little usable output (“translation: a few fine-tunes of Llama plus travel grants”).
  • Others with Horizon/Euro projects experience push back, describing strict milestones, audits, and some real successes (e.g. Firefox’s local translations, large scientific projects like CERN).
  • Concern that 20+ institutions and unclear commercial ownership will slow execution and hinder continuous improvement needed to compete in live markets.

Openness, Expectations and Usefulness

  • Promise that models, code, data and evaluation will be “fully open” is seen as the main differentiator if training data truly ships.
  • Some say a slightly worse-than-Llama, fully transparent EU model would still be valuable for public institutions and compliance-sensitive use.
  • Overall sentiment skews skeptical: optimism about the goal and more open models in Europe, but low confidence that this structure, budget and regulatory constraints will yield a model close to current frontier systems.