2024-09-13

My 71 TiB ZFS NAS After 10 Years and Zero Drive Failures

Drive longevity & power‑cycling

Thread debates whether powering disks off extends life or increases risk.
Some argue continuous running avoids wear from start/stop cycles, stiction, bearing issues, and inrush current.
Others note many consumer/NAS drives already spin down frequently and are rated for large load/unload counts; for homelabs electricity savings may outweigh marginal wear.
Several anecdotes of:
- Old “stiction” problems and drives that die after sitting powered off for years.
- Bearings failing more on always‑on systems vs rarely on systems that spin down.
Statistical back‑of‑envelope using Backblaze AFRs suggests 24 drives lasting 10 years without failure is “lucky but not extraordinary,” especially once early failures are past.

Use cases for large home storage

Common uses: media libraries (Plex/Jellyfin), photography/video (terabytes per project), ML datasets and models, torrents, Docker, personal archiving of web content, social media art, and conference talks.
Some systems are mostly cold storage: backups or archives powered on only for sync or access.

ZFS, data integrity & ECC

Many emphasize ZFS scrubs with block‑level checksums as key for detecting bit rot; scrubs are easy to schedule.
ZFS checksums are per record/block, not file‑level cryptographic hashes; some layer file hashes on top.
ECC RAM is repeatedly described as important for serious data integrity; others note ECC can be hard/expensive to deploy on consumer hardware.
Some have personal horror stories of silent corruption on non‑checksummed filesystems, motivating ZFS.

RAID levels, mirrors, and backups

Strong reminder: RAID/ZFS ≠ backup. Still need offline/air‑gapped or off‑site copies to handle user error, ransomware, or catastrophic failures.
Several argue parity RAID (RAID5/6, RAIDZ) is overused at home:
- Slow, risky rebuilds on large drives; correlated failures in same‑batch disks.
- Mirrored vdevs or simple volumes plus good backups are seen as simpler, safer, and more expandable.
Others defend RAID6/RAIDZ2 for larger arrays, but stress drive diversity and rotation.

Power, noise, cooling, and UPS

Power‑off strategy can save thousands in electricity over a decade for a 200W‑idle NAS, especially in high‑tariff regions.
Large, slow fans and good fan control (PID loops) significantly reduce noise and fan power draw.
UPSes are valued not just for clean shutdowns but for smoothing brownouts and spikes; some consider skipping a UPS an unjustified risk, others accept it for home use.
Offline powered‑down backups are also used as ransomware protection.

Filesystem alternatives & experimental tech

btrfs: mixed reputation; some report past data loss, others long‑term stable use when avoiding its RAID layer and using only snapshots/compression/checksums.
bcachefs: seen as promising (checksums, flexible caching) but currently marked experimental; kernel maintainer concerns and early breakages make people cautious about production data.
General sentiment: for long‑lived important data, ZFS (or at least a mature checksumming FS) on well‑understood hardware is still the conservative choice.

Related topics