In this post, I will cover the following modes of system resilience:
- Adaptive Response
- Superior Monitoring
- Coordinated Resilience
- Heterogenous Systems
- Dynamic Repositioning
- Requisite Availability
Before we explore these ways in greater depth, I have pasted a definition of system resilience below. This is so that we start on reasonably similar terms about the concept:
System resilience is the ability of organizational, hardware and software systems to mitigate the severity and likelihood of failures or losses, to adapt to changing conditions, and to respond appropriately after the fact.Jackson, Scott. (2007). System Resilience: Capabilities, Culture and Infrastructure. INCOSE International Symposium.
Now, let’s explore the modes SREs toggle to drive greater system resilience:
The ability to respond in a timely and appropriate manner. Adaptive SRE teams should not be about speed alone. Nor should they navel gaze while a problem escalates.
They will have pre-existing toolsets and processes to rapidly and accurately handle incidents. Runbooks are one way to achieve this. Of course, a runbook should be constantly updated based on new learnings.
Monitor for and detect adverse events in a timely manner. The key to reducing the severity of system failures is knowing when they are happening in the first place. The other thing is that you have to know where and to what extent your systems are affected.
Effective SRE efforts combine the trifecta of logging, monitoring and tracing to achieve the holy grail known as observability. They monitor for system issues, trace origins of the problem and review logs to identify patterns.
Implement a resilience-in-depth strategy so that problems need to pass multiple obstacles to occur. SRE teams can attempt multiple system-level protections like failover design, security integration and BDD/TDD.
They can also coordinate their efforts with developers at a deeper level. Part of SRE work may include integrating into developer units. They may teach developers ways to achieve system resilience with measures like full stack tracing or the DevSecOps philosophy.
Use heterogeneity to reduce common-mode failures. For context, the most blatant example of a common-mode failure would be serving your entire application on a single fibre cable.
SREs aim to reduce the likelihood of total system failure based on a single component’s vulnerability. In the above situation, if the cable gets cut, the application goes down. If multiple systems are used in combination to serve up services, the risk is significantly lowered.
Increase the ability to recover from an incident by distributing and diversifying network distribution. Many SREs already have the benefit of working with distributed systems as many applications are now run on the cloud.
They can protect critical systems by implementing a combination of:
- Geographical repositioning – services are readily available from multiple zones
- Cloud repositioning – readiness for services to go private cloud or on-prem
- Dependency repositioning – altering connectivity between critical and non-critical systems so problems with the latter don’t affect the former
Balancing act for making services and data available to users. SREs have to deal with varying priorities for different types of data and systems. Some systems must be available at all times, others not as much. Some data is privileged, other data is not.
They may employ techniques like redundancy to increase the availability of important systems. This is where multiple instances reside in multiple locations to ensure availability in case one instance fails. They may also make data non-persistent so that it is not open to corruption or compromise.
We have now covered 6 modes for creating and running more resilient systems. Do you know more? Reach out if you do.