The Evolution of SRE at Google
Definitions and Role Drift
- Several commenters note that “SRE” and “DevOps” have become overloaded and blurred.
- SRE is variously described as:
- Software engineers who write code to manage distributed systems.
- Modern sysadmins focusing on reliability and automation.
- People doing risk modeling and failure-mode analysis.
- DevOps is seen as:
- Originally a culture/practices shift (shared ownership, automation).
- Commonly misused as a renamed ops/sysadmin team that devs “throw things over the wall” to.
DevOps, Culture, and Organizational Dynamics
- Some argue DevOps mainly means “you run what you write”; others say that’s only a small slice of a broader body of practices.
- A recurring theme: the real problems are organizational (conflicting incentives, poor collaboration) rather than tooling.
- There’s disagreement whether “DevOps” principles can fix culture versus requiring good culture first.
- Management fads (DevOps, now “Head of AI”) are seen as a lever for change but also as empty signaling.
Google as Example (or Not)
- Mixed views on Google as a model:
- Many still see Google’s reliability and internal tech as top tier.
- Others see Google products, vision, and follow-through as deteriorated, and regard ex-Googlers as prone to over-engineering for non-Google-scale problems.
- Some split: don’t copy Google’s product org, maybe copy parts of its reliability practices, but only if your scale and budget warrant it.
CAST/STPA and Incident Analysis
- The article’s move toward CAST/STPA (systems-theoretic causal analysis) is widely seen as the most meaningful content.
- Supporters emphasize:
- Moving beyond single “root cause” to interacting causes.
- Blame-free analysis of systems, not individuals.
- Looking at unsafe control actions and bad inputs, not just code correctness.
- Critiques: the writeup is verbose, light on concrete process details, and probably feasible only for large, well-funded orgs.
Scale, Complexity, and Over-Engineering
- Some argue architectures with 100+ nodes in a dataflow are a smell; the best mitigation is not building such complex systems.
- Others note that companies often copy Google-scale tooling (e.g., Kubernetes) where simpler cloud services would suffice.
- There is concern about SRE/DevOps teams gaining “main character syndrome” and redesigning everything, versus serving as pragmatic maintainability enforcers.
On-Call Ownership and Role Boundaries
- Strong sentiment that engineers should own and be on call for the code they write.
- Anti-pattern highlighted: SREs acting as babysitters/first-line support so product engineers avoid being paged.
- Some large organizations reportedly still have SWEs on call without a dedicated SRE function.