In a recent podcast, I was lucky to have a discussion with Niall Murphy about the role of Site Reliability Engineering. Having contributed to the seminal SRE book, and having experience in this field for many years it was an honour to get the opportunity to chat to him.
Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organisations.
The underlying inceptives for both Development and Operations can seem to be at odds with each other. One wishing to make change and add new features (Dev), whilst the other ensuring the product/service does not break (Ops). The catch here being, that changing the product increases the possibility of something breaking.
As a result of this realisation many forms of Gatekeeping (launch reviews, deep-dives and checklists) have been put in place to ‘help’ mitigate the friction between the two parties, but this is by no means solving the problem. It was very interesting for Niall to share his experience with these problems, and how the role and philosophy behind it goes about to help remedy this. Though the episode we were able to to delve into some of the key components that compose to become SRE, from the value of having an Error Budget, to the realisation that striving for 100% uptime is actually detrimental to the product itself!
You are able to listen to the episode in it entirety below, or by subscribing to the podcast.