2026-05-25

Norway's 2 petabytes of Huawei flash storage and LLM training

Hardware and Storage Setup

2 PB of Huawei flash is used as a fast preprocessing tier (dedupe, cleaning, normalization) for ~60 PB of raw data, not as total training data.
Several commenters argue 2 PB is modest by modern standards; individuals can approach that in spinning disks. Others note cost and performance at this scale are still substantial.
Some see the article as oddly focused on the storage brand and appliance rather than on overall system architecture.
Concerns are raised about extreme density (multi-Tbit/s links into a 2U box) creating power/network “hotspots” and limited real-world throughput.

Feasibility of Training and Hardware Adequacy

The Cray system (hundreds of GPUs, tens of thousands of CPU cores) is called “meager” by some, who doubt Norway can train a “fully fledged” frontier LLM and suspect waste.
Others counter that comparable clusters have trained strong models, that most labs experiment on <500 GPUs, and that this is sufficient for national-scale work and institutional learning.
Some see it as an impressive proof-of-concept rather than a final, globally competitive system.

Sovereign LLM vs Using Frontier Models

Core argument for a sovereign Norwegian LLM:
- Reduce dependence on foreign, especially US, corporations.
- Capture local history, law, news, and culture, including dialects and older orthography.
- Avoid American-centric style, morals, and values being imposed even when chatting in Norwegian.
Skeptics ask who will actually use a weaker domestic model, and argue better search/agents over the library corpus or sharing curated Norwegian datasets with major labs might yield more utility.
Some frame the project as cultural preservation and “applied humanities,” not pure economic efficiency.

Language, Tokenization, and Cultural Representation

Multiple comments note that current English-heavy tokenizers make Norwegian and similar languages 1.5–3x more token-expensive, affecting cost and sometimes quality.
Frontier models often “think” and search in English, then translate, leading to style drift and cultural misalignment even when the output language is correct.
Debate over whether English-trained models already “know Norwegian well enough” versus needing a dedicated model to capture nuance and non‑Anglo tone.

Data, Copyright, and Access

Norway’s National Library holds a uniquely comprehensive, legally deposited digital collection; private companies lack equivalent access.
Special agreements allow LLM training on copyrighted news and other material that cannot simply be released as an open dataset.
Some would prefer more public access to out‑of‑copyright data; legal and institutional constraints limit that, making LLM training a workaround for exploiting the archive.

Related topics