2025-01-17

So you want to build your own data center

Scope of the project (colo vs “building a DC”)

Many point out the article is about cage colocation, not constructing a full facility.
Some see the title as misleading clickbait; others argue “your own data center” is commonly used for occupying/racking in a colo.
Multiple comments distinguish between: greenfield DC build (building + power + cooling), cage in an existing DC, and simple rack colo.

Motivations & economics

Strong interest in cost breakdown; Railway deferred detailed numbers to a future post.
Several argue cloud egress is “extortionate” and makes bare metal + colo vastly cheaper for bandwidth-heavy workloads.
Back-of-envelope comparisons: 100 Gbps at IXPs for a few thousand dollars vs millions in hyperscaler egress, though exact AWS/GCP pricing numbers are disputed.
Some warn that while hardware and transit can be cheap, you must factor in spares, repairs, on-call, and operational complexity.

Hardware, networking & design choices

Servers: Supermicro x86; ARM considered but current SKUs seen as too old for the scale/risk.
Networking: whitebox/SONiC + FRR, BGP in the data center, eBGP to hosts, BGP unnumbered; design inspired by “BGP in the Data Center.”
Current gen uses 25G leaf and 100G spine; commenters suggest alternative topologies (collapsed 100G per rack, CR3-class switches) for better $/Gbps.
PXE-based boot via pixiecore + Debian netboot; BMC/Redfish orchestration; custom host agent for QEMU VMs and BGP advertisement.

Tooling & DCIM

Railway built an internal DCIM/rack-modeling tool (“Railyard”) integrated with their orchestrator, rather than adopting Netbox/Nautobot.
Supporters say generic DCIMs force infrastructure to fit their data model; critics note Netbox’s complexity and performance issues.

Operations, reliability & process

Redundancy: RAID, dual feeds, spares on-site, remote hands; drive failures handled without staff travel.
Emphasis on standardized rack layouts, detailed cabling diagrams, and plans to use LLDP-based validation.
Some urge extensive fault-injection and gray-failure testing; discuss future live migration and user-coordinated maintenance windows.
Many share war stories about cooling failures, ad-hoc fans, animal-caused outages, and old-school DC chaos.

Cloud provider experiences & positioning

Railway cites poor support from a major cloud despite multi-million spend as a motivator.
Commenters contrast AWS, GCP, and Azure support: experiences range from outstanding (AWS/GCP) to frustrating (Azure and some AWS cases).
Debate over whether running your own metal is worth the long-term operational burden vs sticking with hyperscalers.

Alternatives & ecosystem

Mentions of dedicated bare metal (Hetzner, Hivelocity), OpenStack, Oxide racks, and colocation resellers as intermediate options.
Oxide is seen as technically attractive but too monolithic/early-stage for this use case.

Related topics