Norway's 2 petabytes of Huawei flash storage and LLM training
Hardware and Storage Setup
- 2 PB of Huawei flash is used as a fast preprocessing tier (dedupe, cleaning, normalization) for ~60 PB of raw data, not as total training data.
- Several commenters argue 2 PB is modest by modern standards; individuals can approach that in spinning disks. Others note cost and performance at this scale are still substantial.
- Some see the article as oddly focused on the storage brand and appliance rather than on overall system architecture.
- Concerns are raised about extreme density (multi-Tbit/s links into a 2U box) creating power/network “hotspots” and limited real-world throughput.
Feasibility of Training and Hardware Adequacy
- The Cray system (hundreds of GPUs, tens of thousands of CPU cores) is called “meager” by some, who doubt Norway can train a “fully fledged” frontier LLM and suspect waste.
- Others counter that comparable clusters have trained strong models, that most labs experiment on <500 GPUs, and that this is sufficient for national-scale work and institutional learning.
- Some see it as an impressive proof-of-concept rather than a final, globally competitive system.
Sovereign LLM vs Using Frontier Models
- Core argument for a sovereign Norwegian LLM:
- Reduce dependence on foreign, especially US, corporations.
- Capture local history, law, news, and culture, including dialects and older orthography.
- Avoid American-centric style, morals, and values being imposed even when chatting in Norwegian.
- Skeptics ask who will actually use a weaker domestic model, and argue better search/agents over the library corpus or sharing curated Norwegian datasets with major labs might yield more utility.
- Some frame the project as cultural preservation and “applied humanities,” not pure economic efficiency.
Language, Tokenization, and Cultural Representation
- Multiple comments note that current English-heavy tokenizers make Norwegian and similar languages 1.5–3x more token-expensive, affecting cost and sometimes quality.
- Frontier models often “think” and search in English, then translate, leading to style drift and cultural misalignment even when the output language is correct.
- Debate over whether English-trained models already “know Norwegian well enough” versus needing a dedicated model to capture nuance and non‑Anglo tone.
Data, Copyright, and Access
- Norway’s National Library holds a uniquely comprehensive, legally deposited digital collection; private companies lack equivalent access.
- Special agreements allow LLM training on copyrighted news and other material that cannot simply be released as an open dataset.
- Some would prefer more public access to out‑of‑copyright data; legal and institutional constraints limit that, making LLM training a workaround for exploiting the archive.