The Richter Scale of Reliability in Highly Scalable Infrastructure
Anyone who operates highly scalable infrastructure will know that there is one maxim that they must abide by:
Assume everything fails
It may seem like a rather morbid approach to the design and operation of infrastructure. It becomes obvious when you shift from a “aim for 100% uptime” to the Site Reliability Engineer (SRE) approach.
What I mean by the shift in approach, is that we change from building on the architect’s classic assumption that we can design for totally reliability. The SRE approach is to rely on the ability of every layer of your application infrastructure to be failing and recovering continuously.
Scale-Out Fails By Design
I once read something from Kelly Sommers (@kellabyte on Twitter) about how she operated database infrastructure at such a scale that about 10% of the nodes are failed at any given time due to the load put on them and other operational impact.
Read the entire article here, The Richter Scale of Reliability in Highly Scalable Infrastructure
via the fine folks at Turbonomic!