So you want to build your own data center
Scope of the project (colo vs “building a DC”)
- Many point out the article is about cage colocation, not constructing a full facility.
- Some see the title as misleading clickbait; others argue “your own data center” is commonly used for occupying/racking in a colo.
- Multiple comments distinguish between: greenfield DC build (building + power + cooling), cage in an existing DC, and simple rack colo.
Motivations & economics
- Strong interest in cost breakdown; Railway deferred detailed numbers to a future post.
- Several argue cloud egress is “extortionate” and makes bare metal + colo vastly cheaper for bandwidth-heavy workloads.
- Back-of-envelope comparisons: 100 Gbps at IXPs for a few thousand dollars vs millions in hyperscaler egress, though exact AWS/GCP pricing numbers are disputed.
- Some warn that while hardware and transit can be cheap, you must factor in spares, repairs, on-call, and operational complexity.
Hardware, networking & design choices
- Servers: Supermicro x86; ARM considered but current SKUs seen as too old for the scale/risk.
- Networking: whitebox/SONiC + FRR, BGP in the data center, eBGP to hosts, BGP unnumbered; design inspired by “BGP in the Data Center.”
- Current gen uses 25G leaf and 100G spine; commenters suggest alternative topologies (collapsed 100G per rack, CR3-class switches) for better $/Gbps.
- PXE-based boot via pixiecore + Debian netboot; BMC/Redfish orchestration; custom host agent for QEMU VMs and BGP advertisement.
Tooling & DCIM
- Railway built an internal DCIM/rack-modeling tool (“Railyard”) integrated with their orchestrator, rather than adopting Netbox/Nautobot.
- Supporters say generic DCIMs force infrastructure to fit their data model; critics note Netbox’s complexity and performance issues.
Operations, reliability & process
- Redundancy: RAID, dual feeds, spares on-site, remote hands; drive failures handled without staff travel.
- Emphasis on standardized rack layouts, detailed cabling diagrams, and plans to use LLDP-based validation.
- Some urge extensive fault-injection and gray-failure testing; discuss future live migration and user-coordinated maintenance windows.
- Many share war stories about cooling failures, ad-hoc fans, animal-caused outages, and old-school DC chaos.
Cloud provider experiences & positioning
- Railway cites poor support from a major cloud despite multi-million spend as a motivator.
- Commenters contrast AWS, GCP, and Azure support: experiences range from outstanding (AWS/GCP) to frustrating (Azure and some AWS cases).
- Debate over whether running your own metal is worth the long-term operational burden vs sticking with hyperscalers.
Alternatives & ecosystem
- Mentions of dedicated bare metal (Hetzner, Hivelocity), OpenStack, Oxide racks, and colocation resellers as intermediate options.
- Oxide is seen as technically attractive but too monolithic/early-stage for this use case.