2025-01-03

The Evolution of SRE at Google

Definitions and Role Drift

Several commenters note that “SRE” and “DevOps” have become overloaded and blurred.
SRE is variously described as:
- Software engineers who write code to manage distributed systems.
- Modern sysadmins focusing on reliability and automation.
- People doing risk modeling and failure-mode analysis.
DevOps is seen as:
- Originally a culture/practices shift (shared ownership, automation).
- Commonly misused as a renamed ops/sysadmin team that devs “throw things over the wall” to.

DevOps, Culture, and Organizational Dynamics

Some argue DevOps mainly means “you run what you write”; others say that’s only a small slice of a broader body of practices.
A recurring theme: the real problems are organizational (conflicting incentives, poor collaboration) rather than tooling.
There’s disagreement whether “DevOps” principles can fix culture versus requiring good culture first.
Management fads (DevOps, now “Head of AI”) are seen as a lever for change but also as empty signaling.

Google as Example (or Not)

Mixed views on Google as a model:
- Many still see Google’s reliability and internal tech as top tier.
- Others see Google products, vision, and follow-through as deteriorated, and regard ex-Googlers as prone to over-engineering for non-Google-scale problems.
Some split: don’t copy Google’s product org, maybe copy parts of its reliability practices, but only if your scale and budget warrant it.

CAST/STPA and Incident Analysis

The article’s move toward CAST/STPA (systems-theoretic causal analysis) is widely seen as the most meaningful content.
Supporters emphasize:
- Moving beyond single “root cause” to interacting causes.
- Blame-free analysis of systems, not individuals.
- Looking at unsafe control actions and bad inputs, not just code correctness.
Critiques: the writeup is verbose, light on concrete process details, and probably feasible only for large, well-funded orgs.

Scale, Complexity, and Over-Engineering

Some argue architectures with 100+ nodes in a dataflow are a smell; the best mitigation is not building such complex systems.
Others note that companies often copy Google-scale tooling (e.g., Kubernetes) where simpler cloud services would suffice.
There is concern about SRE/DevOps teams gaining “main character syndrome” and redesigning everything, versus serving as pragmatic maintainability enforcers.

On-Call Ownership and Role Boundaries

Strong sentiment that engineers should own and be on call for the code they write.
Anti-pattern highlighted: SREs acting as babysitters/first-line support so product engineers avoid being paged.
Some large organizations reportedly still have SWEs on call without a dedicated SRE function.

Related topics