The Evolution of SRE at Google

Definitions and Role Drift

  • Several commenters note that “SRE” and “DevOps” have become overloaded and blurred.
  • SRE is variously described as:
    • Software engineers who write code to manage distributed systems.
    • Modern sysadmins focusing on reliability and automation.
    • People doing risk modeling and failure-mode analysis.
  • DevOps is seen as:
    • Originally a culture/practices shift (shared ownership, automation).
    • Commonly misused as a renamed ops/sysadmin team that devs “throw things over the wall” to.

DevOps, Culture, and Organizational Dynamics

  • Some argue DevOps mainly means “you run what you write”; others say that’s only a small slice of a broader body of practices.
  • A recurring theme: the real problems are organizational (conflicting incentives, poor collaboration) rather than tooling.
  • There’s disagreement whether “DevOps” principles can fix culture versus requiring good culture first.
  • Management fads (DevOps, now “Head of AI”) are seen as a lever for change but also as empty signaling.

Google as Example (or Not)

  • Mixed views on Google as a model:
    • Many still see Google’s reliability and internal tech as top tier.
    • Others see Google products, vision, and follow-through as deteriorated, and regard ex-Googlers as prone to over-engineering for non-Google-scale problems.
  • Some split: don’t copy Google’s product org, maybe copy parts of its reliability practices, but only if your scale and budget warrant it.

CAST/STPA and Incident Analysis

  • The article’s move toward CAST/STPA (systems-theoretic causal analysis) is widely seen as the most meaningful content.
  • Supporters emphasize:
    • Moving beyond single “root cause” to interacting causes.
    • Blame-free analysis of systems, not individuals.
    • Looking at unsafe control actions and bad inputs, not just code correctness.
  • Critiques: the writeup is verbose, light on concrete process details, and probably feasible only for large, well-funded orgs.

Scale, Complexity, and Over-Engineering

  • Some argue architectures with 100+ nodes in a dataflow are a smell; the best mitigation is not building such complex systems.
  • Others note that companies often copy Google-scale tooling (e.g., Kubernetes) where simpler cloud services would suffice.
  • There is concern about SRE/DevOps teams gaining “main character syndrome” and redesigning everything, versus serving as pragmatic maintainability enforcers.

On-Call Ownership and Role Boundaries

  • Strong sentiment that engineers should own and be on call for the code they write.
  • Anti-pattern highlighted: SREs acting as babysitters/first-line support so product engineers avoid being paged.
  • Some large organizations reportedly still have SWEs on call without a dedicated SRE function.