Writing software and having good test coverage is not enough to ensure the reliability of your services. Once you hit production and get real traffic, magic things usually happen!
You find bugs, services go down, in the worst scenarios you get notified by your users, and if that’s not enough you might spend hours debugging or trying to replicate an issue
MTTR is one of the DORA metrics, and it’s the time to restore service. In other words, how long does it take to discover that one of your services is either down or not working like expected.
During this talk I’ll talk about MTTR, the time it takes to restore a service. In other words, how long does it take to discover that one of your services is either down or not working like expected.
I’ll share real world experiences, on how to bring your MTTR back to something that doesn’t frustrate you or your team, how to go from zero to hero and bring your services into a state where deploying to prod doesn’t feel scary anymore.