How CERN serves 1EB of data via FUSE [video]
Storage Scale, Cost, and Architecture
- CERN stores ~1 EB, mostly on Ceph and a homegrown distributed filesystem (EOS) over commodity hardware; commercial systems are mainly for tape.
- A cited cost of ~1 CHF/TB/month (10+2 erasure coding) is debated:
- Some see it as expensive at this scale and want a breakdown (hardware, staff, DC, networking, tape, etc.).
- Others call it very cheap compared to universities charging >100× more, especially given tape, networking, and availability.
- Bandwidth is noted as a major factor, but less so if data stays within a few data centers.
Data Ingestion, Reduction, and Backup
- Experiments can generate ~1 PB/s of raw data; multi-stage trigger systems (including GPU-based software triggers) discard ~97% or more.
- Backups rely heavily on tape and globally distributed replicas; not all tapes are themselves backed up due to cost and acceptable statistical loss.
- Rucio is used to manage and replicate datasets across heterogeneous storage backends worldwide.
FUSE and Filesystem Concerns
- FUSE is central to the approach, but performance concerns remain:
- Context switching and metadata-heavy workloads are pain points.
- Read-ahead and new features like FUSE passthrough help; io_uring integration is still work-in-progress.
- Some users report issues with inotify over SSHFS/FUSE in containerized setups.
Budget, Resourcing, and Talent
- CERN’s overall budget (
1.4B EUR) and IT slice (50M EUR) are described as modest for the scale; rising energy costs even reduced accelerator run time. - Many argue “unlimited budget” is a myth; success comes from small, highly skilled, highly motivated teams and large in-kind contributions from member institutes.
- Salaries and facilities are portrayed as unglamorous relative to Switzerland, but the scientific mission attracts top talent.
Open Source vs Microsoft Ecosystem
- There was a major initiative to move away from Microsoft products toward open source; commenters say this later reversed under new leadership.
- Some in the thread criticize a perceived “Microsoft push” (including partnerships/programs) and lament degraded user experience.
- Others investigate leadership backgrounds and see this as part of broader institutional strategy rather than simple vendor capture.
Value of High-Energy Physics
- A debate asks what practical social/economic benefits high-energy physics has produced.
- Responses emphasize:
- Basic research’s intrinsic value, not goal-driven utility.
- “Side effects” such as advanced sensors, magnets, cryogenics, control and data systems.
- Synchrotron light sources as a notable direct spin-off, heavily used in materials science and structural biology (e.g., early COVID-19 studies).
Reproducibility and Long-Term Data Retention
- Experiments are, in principle, reproducible, but keeping historical data is crucial to achieve statistical significance against sensor noise.
- Past “almost discoveries” that later resolved into noise underscore the need for long-term, large-scale datasets.
CERN as a Place and Culture
- Commenters highlight the strong “mission” motivation versus typical profit-driven tech work and contrast it with adtech/banking.
- CERN’s museum, tours, and on-site exhibits (including old accelerators and historical hardware) are praised as uniquely good at explaining cyberinfrastructure and big science.