EuroLLM: LLM made in Europe built to support all 24 official EU languages
Linguistic Scope and Classification
- Thread starts by listing the 24 official EU languages and noting their families: mostly Indo‑European, with Maltese as Semitic (Afro‑Asiatic), and Finnish/Estonian/Hungarian as Uralic.
- Long side-thread on whether Baltic and Slavic should be grouped as “Balto‑Slavic” and how close various Slavic subgroups actually are in practice.
- Many comparisons of “language vs dialect” for German/Swiss German, Chinese varieties, Hindi/Urdu, Scots/English, Flemish/Dutch, etc., stressing that the boundary is largely political and social.
Maltese Focus
- Multiple questions to native speakers about Maltese: name (“Il‑Malti”), Arabic roots, loanwords from Italian/English, and how mutually intelligible it is with North African and Levantine Arabic.
- Experiences differ: some Arabic speakers report Maltese is “surprisingly easy to follow”; others say resemblance is deceptive and it’s not mutually intelligible after ~1000 years of divergence.
- Discussion of heavy code‑switching between Maltese and English, loanwords, and concerns about long‑term language vitality; locals say Maltese is still widely used at home and in media.
Non‑official and Regional Languages
- Debate on why Frisian, Basque, Catalan, Galician, etc. are not in the “24 languages” list: EU takes one official language per member state, others go under “regional/minority” charters.
- Irish vs Frisian numbers are compared; some argue historical suppression justifies stronger protection for Irish despite fewer native speakers.
- Ulster Scots, Flemish, and other regional varieties spark arguments about authenticity, politicization, and codification vs genuine community use.
Model Coverage, Quality and Benchmarks
- EuroLLM supports the 24 EU languages plus 11 extra (e.g. Russian, Arabic, Catalan, Norwegian, Ukrainian).
- Benchmarks on Hugging Face and the paper show the 9B model roughly comparable to 2024-era 9B models (e.g. Gemma‑2‑9B) but far from current frontier systems; MMLU‑Pro is only modestly above chance.
- Some users report it’s markedly better than other open models for small languages like Latvian, but overall “a bit dumb” for coding, tooling, and reasoning.
- Observed issues: confusion between very similar languages (e.g. Lithuanian vs Latvian), and generally weaker abilities than English‑centric frontier models.
Why a Dedicated European LLM?
- One side argues major US/Chinese models already cover all these languages, so this is redundant and worse-performing.
- Supporters counter that multilingual capability degrades sharply away from English, and that data balance/quality per language matters.
- Others emphasize legal, sovereignty, and cultural reasons: a model trained on “homegrown EU data,” aligned with EU laws and values, and not dependent on US platforms.
European AI Strategy and Funding
- EuroLLM is funded via Horizon 2020/Horizon Europe and trained on EuroHPC public supercomputers; some see this as modest, non‑commercial research, not a “frontier race”.
- Broader debate about Europe’s tech lag vs US/China: weaker capital markets, fragmented regulations, language and legal diversity, and limited scale compared to US single market.
- Strong disagreement over regulation and grants: some say EU bureaucracy and compliance kill innovation; others argue VC is the real bottleneck and public research funding is essential and relatively well‑run.
Reception and Practicalities
- Mixed reactions: enthusiasm for multilingual, open European models; skepticism about real-world usefulness given middling benchmarks and year‑old release.
- Some annoyance that downloading from Hugging Face requires sharing contact info, even under Apache 2.0.
- A few users simply treat it as a valuable specialized translator/formatter for under‑resourced European languages, alongside more capable general models for reasoning and tools.