How CERN serves 1EB of data via FUSE [video]

Storage Scale, Cost, and Architecture

  • CERN stores ~1 EB, mostly on Ceph and a homegrown distributed filesystem (EOS) over commodity hardware; commercial systems are mainly for tape.
  • A cited cost of ~1 CHF/TB/month (10+2 erasure coding) is debated:
    • Some see it as expensive at this scale and want a breakdown (hardware, staff, DC, networking, tape, etc.).
    • Others call it very cheap compared to universities charging >100× more, especially given tape, networking, and availability.
  • Bandwidth is noted as a major factor, but less so if data stays within a few data centers.

Data Ingestion, Reduction, and Backup

  • Experiments can generate ~1 PB/s of raw data; multi-stage trigger systems (including GPU-based software triggers) discard ~97% or more.
  • Backups rely heavily on tape and globally distributed replicas; not all tapes are themselves backed up due to cost and acceptable statistical loss.
  • Rucio is used to manage and replicate datasets across heterogeneous storage backends worldwide.

FUSE and Filesystem Concerns

  • FUSE is central to the approach, but performance concerns remain:
    • Context switching and metadata-heavy workloads are pain points.
    • Read-ahead and new features like FUSE passthrough help; io_uring integration is still work-in-progress.
  • Some users report issues with inotify over SSHFS/FUSE in containerized setups.

Budget, Resourcing, and Talent

  • CERN’s overall budget (1.4B EUR) and IT slice (50M EUR) are described as modest for the scale; rising energy costs even reduced accelerator run time.
  • Many argue “unlimited budget” is a myth; success comes from small, highly skilled, highly motivated teams and large in-kind contributions from member institutes.
  • Salaries and facilities are portrayed as unglamorous relative to Switzerland, but the scientific mission attracts top talent.

Open Source vs Microsoft Ecosystem

  • There was a major initiative to move away from Microsoft products toward open source; commenters say this later reversed under new leadership.
  • Some in the thread criticize a perceived “Microsoft push” (including partnerships/programs) and lament degraded user experience.
  • Others investigate leadership backgrounds and see this as part of broader institutional strategy rather than simple vendor capture.

Value of High-Energy Physics

  • A debate asks what practical social/economic benefits high-energy physics has produced.
  • Responses emphasize:
    • Basic research’s intrinsic value, not goal-driven utility.
    • “Side effects” such as advanced sensors, magnets, cryogenics, control and data systems.
    • Synchrotron light sources as a notable direct spin-off, heavily used in materials science and structural biology (e.g., early COVID-19 studies).

Reproducibility and Long-Term Data Retention

  • Experiments are, in principle, reproducible, but keeping historical data is crucial to achieve statistical significance against sensor noise.
  • Past “almost discoveries” that later resolved into noise underscore the need for long-term, large-scale datasets.

CERN as a Place and Culture

  • Commenters highlight the strong “mission” motivation versus typical profit-driven tech work and contrast it with adtech/banking.
  • CERN’s museum, tours, and on-site exhibits (including old accelerators and historical hardware) are praised as uniquely good at explaining cyberinfrastructure and big science.