Watch Incident Management (class SRE implements DevOps) in New Channel | Channify

Views —

Likes —

Comments —

Published Oct 2, 2018

Channel Google Cloud Platform

Add More Videos To your Channel

In the previous video, Liz and Seth discussed how to make systems observable and how observability helps us diagnose failing systems, but didn't cover what to do when an incident grows beyond the ability of one person to do it all. In this video, you learn about the most important part of the incident management process – humans. In the stressful moments of systems failure, it is important to define clear, concise roles for all the humans involved in an incident. With too few people, you can quickly become overloaded with work, but with too many people, work may be duplicated (i.e. too many hands on the keyboard). Learn how SREs effectively manage incidents with clearly defined roles and responsibilities such as the operations lead, planning lead, communications lead, logistics lead, and more. Seth and Liz also discuss techniques for managing long-running and exponentially complex incidents. Reach out to Liz and Seth: https://twitter.com/lizthegrey https://twitter.com/sethvargo Watch more episodes from the playlist here → http://bit.ly/2PPL6f0 Subscribe to the Google Cloud Platform channel for more Cloud content → http://bit.ly/GCloudPlatform