Norway's 2 petabytes of Huawei flash storage and LLM training

Hardware and Storage Setup

  • 2 PB of Huawei flash is used as a fast preprocessing tier (dedupe, cleaning, normalization) for ~60 PB of raw data, not as total training data.
  • Several commenters argue 2 PB is modest by modern standards; individuals can approach that in spinning disks. Others note cost and performance at this scale are still substantial.
  • Some see the article as oddly focused on the storage brand and appliance rather than on overall system architecture.
  • Concerns are raised about extreme density (multi-Tbit/s links into a 2U box) creating power/network “hotspots” and limited real-world throughput.

Feasibility of Training and Hardware Adequacy

  • The Cray system (hundreds of GPUs, tens of thousands of CPU cores) is called “meager” by some, who doubt Norway can train a “fully fledged” frontier LLM and suspect waste.
  • Others counter that comparable clusters have trained strong models, that most labs experiment on <500 GPUs, and that this is sufficient for national-scale work and institutional learning.
  • Some see it as an impressive proof-of-concept rather than a final, globally competitive system.

Sovereign LLM vs Using Frontier Models

  • Core argument for a sovereign Norwegian LLM:
    • Reduce dependence on foreign, especially US, corporations.
    • Capture local history, law, news, and culture, including dialects and older orthography.
    • Avoid American-centric style, morals, and values being imposed even when chatting in Norwegian.
  • Skeptics ask who will actually use a weaker domestic model, and argue better search/agents over the library corpus or sharing curated Norwegian datasets with major labs might yield more utility.
  • Some frame the project as cultural preservation and “applied humanities,” not pure economic efficiency.

Language, Tokenization, and Cultural Representation

  • Multiple comments note that current English-heavy tokenizers make Norwegian and similar languages 1.5–3x more token-expensive, affecting cost and sometimes quality.
  • Frontier models often “think” and search in English, then translate, leading to style drift and cultural misalignment even when the output language is correct.
  • Debate over whether English-trained models already “know Norwegian well enough” versus needing a dedicated model to capture nuance and non‑Anglo tone.

Data, Copyright, and Access

  • Norway’s National Library holds a uniquely comprehensive, legally deposited digital collection; private companies lack equivalent access.
  • Special agreements allow LLM training on copyrighted news and other material that cannot simply be released as an open dataset.
  • Some would prefer more public access to out‑of‑copyright data; legal and institutional constraints limit that, making LLM training a workaround for exploiting the archive.