SRE Capability Map v1.3

This map shows all capabilities that (I know) fit the SRE domain. Use it to build and guide your org’s SRE function.

Testing mode: Develop Load/Stress Testing Experiments is the only active card

Observability

Understand observability across the entire system

Run tools for logging, metrics and tracing

Setup visual dashboards for observing service health

Review alert triggers when metrics exceed thresholds

Develop database artefacts for pulling observability data

Chaos Engineering

Understand various chaos to unleash on systems

Setup tools like Gremlin or configure Chaos Monkey, Simian Army, Chaos toolkits

Run automated chaos testing in CI/CD environments

Open Source Management

Contribute to the selection of best fit open source tools

Contribute to effective integrations of OSS in pipelines

Run assurance processes for code breakage pre-production

Review alert triggers when metrics exceed thresholds

Develop changes to open source library for a better fit

Incident Management

Respond to Pager Duty or other alert tools when on-call

Setup robust ChatOps solution

Develop solutions for issue noise and preventing pager fatigue (e.g. content)

Contribute to support team for engineer-level issues

Record incident reports

Review past incidents using Blameless methodology

Respond to tickets to support overloaded ops people

Develop runbooks to speed up response to common incidents

Develop tools that help reduce underlying causes of tickets

High Availability

Understand types of scaling including manual vs autoscaling

Contribute to effective Kubernetes strategy

Develop effective distributed databases

Run load balancers for effective traffic distribution

Contribute to the disaster recovery strategy e.g. failovers

Develop scaling models for various demand scenarios

Develop redundancies within the system

Understand ancillary concepts like failure-driven design

Contribute to safe release practices using canarying etc.

System Enhancement

Contribute to ways to meet SLOs and SLAs with a strong view of SLIs

Develop risk review for systems e.g. capacity, disaster recovery

Record anti-patterns with fallbacks for critical scenarios

Develop performance baselines required for production go-live

Record variables of extreme events occurring in the real world

Understand evolving properties of high-throughput systems

Develop evolution of suitable services to stateless form

DevSecOps

Contribute to teaching stakeholders about DevSecOps

Respond to active security threats with security specialists

Develop build security tactics for CI/CD environment

Develop pre-production security tactics and checklist

Run build and production security policies & tools

Run build and production security policies & tools

Review security issues from SIEM to plug holes within infra

Leadership

Understand the balance between urgent and long-term needs

Understand the value of user experience to operational goals

Develop standards and policies for better DevOps

Quality Assurance

Review source code for optimisation opportunities

Develop strategies for toil reduction and implement them

Review cloud billing to develop and run cost reduction tactics

Contribute to improvements for CI/CD pipeline

Contribute to tooling strategy for better developer experience

Contribute to improving config & infrastructure-as-code

Understand effective project management skills

Review system designs balancing with cost, time and complexity trade-offs

Performance Engineering

Setup application performance management (APM) tools

Develop benchmarks to observe how the system copes at scale

Run improvements on network, CPU, storage & applications

Contribute to ideas for “could better” issues in performance

Develop tools for investigating application performance

Review code and system (e.g. Linux kernel) for optimisation

Understand emerging practices in Linux, CPU and memory enhancement

Research Notes:

  • No individual SRE could (or should want to) fully cover the entire domain
  • Most SREs have common core capabilities (e.g. on-call incident response)
  • Many band core activities with an area (or several) of expertise e.g. Kubernetes