Anyone who operates highly scalable infrastructure will know that there is one maxim that they must abide by:

Assume everything fails

It may seem like a rather morbid approach to the design and operation of infrastructure. It becomes obvious when you shift from a “aim for 100% uptime” to the Site Reliability Engineer (SRE) approach.

What I mean by the shift in approach, is that we change from building on the architect’s classic assumption that we can design for totally reliability. The SRE approach is to rely on the ability of every layer of your application infrastructure to be failing and recovering continuously.

Scale-Out Fails By Design

I once read something from Kelly Sommers (@kellabyte on Twitter) about how she operated database infrastructure at such a scale that about 10% of the nodes are failed at any given time due to the load put on them and other operational impact.

