Incidents Response in tech but from the eyes of the customer.

It’s 2022, and companies still aren’t equipped to handle incidents better.

Incidents can be best defined as a breakdown of a well-oiled machine running on code in a carefully spun environment using complex configurations. Folks in the industry, developers, and customers alike know it all too well. An undesirable phenomenon so common these days that even after years of development, we haven’t figured out how to deal with it adequately. This post is not about how to solve incidents better but handle them well at least publically.

There is definitely enough written, said, and talked about technical incidents. There are cookbooks, manuals, blogs, and guides on the length and breadth of the subject. But, a tweet by Gergely Orosz triggered an interesting conversation on how incidents affect us as customers and how things could be better handled publicly. After a brief chat on Twitter, I decided to document our conversation here in the hopes of presenting some ideas and having a wider conversation with the community. Hopefully, this helps companies rethink their incident management, response (public and private) & post-mortems.

In another incredibly long-running outage, Heroku is still investigating how their GitHub tokens leaked, and has disabled deploying from GitHub directly until further notice.

16 days later, investigation ongoing. Shockingly long outage, Heroku likely bleeding customers. https://t.co/aXNDXOw4NA
— Gergely Orosz (@GergelyOrosz) May 1, 2022

https://platform.twitter.com/widgets.js

Lack of incident leadership

Incidents need a spearhead who can cut across the organizational hierarchy and rally the team along toward solutions. An incident leader isn’t appointed or hired. Instead, they are cultivated right in the team who can take ownership and ensure all stakeholders are well informed. It’s a tough role that isn’t compensated regularly, but it’s a role that makes or breaks product growth. Cultivating the right support culture and sense of ownership among the team can help create incident leaders and can probably distribute this role among more than a few members. It’s much better than having all your customers being anxious and confused.

Disconnect between engineering & customer support

Often at times, you reach out to support channels for the company, and agents won’t have any updates or even knowledge of the situation. The engineering and support teams often don’t communicate directly or stay in sync. So, it’s not surprising to see incidents when one hand doesn’t know what the other is doing. While this happens, the customers are left in the dark opening even more support issues and drafting angry tweets. Not good. The team overall should at least be ready with a well-updated status page where customers can track the progress of the outage.

No comms due to fear of bad PR

We have seen this many times. Government departments to ride-hailing services to supermarket chains to fintech startups. Time and time again, companies have tried to cover up, downplay, refute or even ignore incidents entirely due to the risk of bad press, loss in revenue, and even a tarnished public image. When pressured or held at ransom, companies release information publically, but often, it is either vague, false, or useless copy to persuade customers that everything is still fine. This should definitely stop. If failing to acknowledge a security breach is bad, then lying about it is worse. Examples through the years [1], [2], [3].

No accountablity to prevent user confusion during/after incident

Recent incidents like Heroku are a clear example. While I empathized with the team in the outage that lasted almost 2 months. It just felt like a slap in the face when companies don’t even need to acknowledge outages on the dashboard, in email communication, or even on project statuses. Banners on the dashboard showed recommendations to still use a broken feature/pipeline. Errors during build time didn’t redirect to the right cause of why the feature isn’t working.

Personally, it took me hours to figure out what was wrong with my project deployed on Heroku in the initial days of the outage. Turns out Heroku doesn’t display any unauthorized deployment attempts in the deploy logs, which is fine. But, it’s not fine when the unauthorized deployments are actually caused by the GitHub pipeline broken by Heroku itself, and I had no information about it. This has been true for both the recent Atlassian outage, as well as the 3-weeks-long AWS Sagemaker outage

The lack of acceptance: We messed up

I arrive at my critical next point. We rarely see responses from companies with messages containing the general gist of “we messed up”. I personally feel the official response can be a touch more human to trigger empathy among the users. When things are on fire, a touch of humanity might just go a long way in not losing a customer as mistakes happen in this line of business, and damage is already done. Statements like “We are deploying an army to fix this” should be avoided. It shows gross incompetency & misuse of resources. Firstly, incidents handled well don’t need the entire team to jump in for the rescue. Secondly, where was “the army” when the product was being built in the first place that takes months to fix when it goes wrong. That’s always the thought in customers’ heads.

Taking control of the narrative quickly

Getting ahead of the incident and relaying as much “real” information in the updates as possible works out in favor of companies. If you don’t have something, then provide a holding response. I definitely feel better when the status page reports accurate statuses and the team is indeed working on it. It’s a sense of relief and, most importantly, self-assurance. Taking control of that narrative provides that sense of security and helps customers through a bad time as they deal with their customers, bosses, and stakeholders.

Let’s talk post-incident resolution

While we discussed things to improve pre-and post-incidents, here are some pointers for companies to manage their post-incident resolution response.

Publishing informed post-mortems

Post-mortems aren’t anything new. The only thing that could be improved would be to explain what happened, present the facts, provide the information and briefly go over your next steps. Improvements that you are planning, fixes you intend to implement short and long term. Later, add a section on your plans to eliminate the problem. These documents are promises kept in terms of transparency and assurance you provide, built upon the bedrock of trust that customers already have. Resources from Stripe and Atlassian (the irony, I know)

Getting Feedback from customers at the end the loop

An incident is an excellent way to judge, assess and improve emergency response. Getting feedback from the customers on what aspects of your response could have been better goes tremendously well to show that customers are being cared for. Especially when you implement or work on their feedback. Instead of a blame-and-forget, how about making the completion of that feedback loop a priority after an incident.

Running Mock Drills to prepare

Just like disaster management drills, mock drills for outages, incidents, and security events can be an ideal practice session for the team to work together and solve the problem. Especially when a team is asynchronous. Questions like, do you have emergency contacts for folks on the team? How to mitigate the situation? When and how to inform the customer about updates? There needs to be playbooks in place and updated on a regular basis. Every bit of due process helps when things begin to break. At least it gives you a chance to find steps that can be automated, areas that can be improved, and opportunities to do better next time. Resources on running incident response drills. [1] and [2].

And that’s about it. It was great talking to Gergely about this, who intends to do a follow-up newsletter on this. So, be sure to check that out. Till then, live in the mix.