Delayed reporting of Major Software Incidents

Process Problem

A few years ago, there was a period of time where I ended up being on the calls after a Major Incident (MI) with our software was declared. I was amazed at the clear failings with the process. There were all kinds of various managers, then maybe a few people from Deployment, but then barely anyone from Development. 

Then you’d have these managers just theorising if the Incident was caused by hardware or software, and if it was software – which enhancement/bug fix could have caused the issue?

It was pure guesswork of course, and occasionally they were right, but what is the point theorising when you could just get the people involved that can actually investigate it?

Sometimes I would hear about a Major Incident and then see that it was actually logged days ago. I think a problem is all the process between Support and Development. Support log incidents in their system and talk about “customer reference numbers”, but we need it in our system which is currently Azure DevOps and we talk about Work Item numbers. 

So sometimes there might be a delay even logging it in Support, and then there is a delay until it actually gets transferred over to our system. Often we are on these calls and we ask for a Work Item number so we can read information rather than just waiting for someone to mention it on the call… and then we are told that they haven’t created a Work Item yet.

I remember a time when users complained to Support, then they logged a Bug straight away, and did really good investigation. Then it would come to us with full details. Now, users complain for weeks, it gets escalated up to the Directors, then comes down the Development managers to Colin, then we have barely information to look into it because the users haven’t provided it, or is just “Chinese Whispers” and the information gets lost.

It’s so frustrating because how can you attempt to diagnose anything without all the required information? So the call is pointless, but when you do have the information, then you don’t want to be on a call – you want to go and start looking into it because you want to diagnose it then fix it.

So let’s look at one of the last examples I was involved in…

Example

Colin messages me about a Major Incident.

“I need some investigation…”

Colin

Apparently 179 sites are complaining about this problem and threatening to leave. Despite being complaining since the Friday release, Colin only knows about it Wednesday. Classic. Not the first time this has happened.

So what can I investigate? What is even the problem? All I know is there is a problem with a certain feature, but the thing is, we integrate with several third party API’s and I haven’t been told which one it involves, and I don’t know what the problem the users are seeing is.

I ask Colin to elaborate:

“not sure which provider. Can we look into all of them? So one of them may be broken, but we need to look at them all and hope we find it”

Colin

Wut. You want me to test them all out and hope I notice a problem? But it could involve just one and require specific steps.

When I finally got some information, it was pretty sparse.

“between 2 minutes and 12 minutes for the dialog to pop up” 

the problem

I told them I would have to see it happen, and see if there is anything different the user is doing to make it go from 2 minutes to 12 minutes. It sounds like a network issue, or not our fault.

“we will test it with a happy…well not a happy user, but an ‘engaged’ user”.

Manager’s suggestion of arranging a call with a user to witness the problem

I was then on a call with all kinds of random managers, all chipping in speculative accusations. “It’s a problem with the deployment“, “it’s a problem with the amount of data in their Tasks module“, “it’s a problem with the number of user-created resources“.

Colleague Opinion

I was discussing this with one of my colleagues and he re-iterated my views:

"they red flag anything nowadays - they just wave their flags around
It’s always the case that someone knew several days ago, but now it needs fixing immediately and it’s the first time any developer has heard about it

why can it not go:
Problem logged by customer -> Support discuss with Release Management -> Release Management arrange call with the correct Team Leader -> Prioritises the work with a developer and tester

I don't see how that's difficult

What actually happens:
Problem logged by customer -> sits in a list for a while
Problem management person looks at issue -> sits in list for a while
Someone kicks off and escalates the issue -> Director gets it in the neck -> Release Management notified
Release Management hold a call -> float many theories about what the issue is about
Developer randomly hears about issue -> mistakenly joins call mid-flow -> explains that it's probably related to the work item with the same keywords as the issue in the title
Developer now late with their own work
Release management say they'll prioritise the issue, but asks if you could start working on it -> Developer says no because you are busy working on 8 different things -> everyone in a huff
Team Leader nowhere to be seen -> likely watching the TV"

Apparent Analysis

A manager recently posted an update with facts that I was sceptical of:

“Analysis conducted into 2 years worth of Major Incidents

Some interesting trends have appeared and I think it’s worth discussing:

  • Data shows that most of the MIs are caused by either a 3rd party or tech debt/ops related issues. The number attributed to code change is pretty low.
  • We are lacking ownership around domains for existing products so not obvious which Engineering Manager is accountable.
  • We don’t have enough SQL skills to manage our complex DB
  • We rely upon a small number of people to investigate MIs.
  • MIs should be owned with the right teams.
  • Further investigation required on this and changes to ownership, structure, investment needs to happen.”

Release at Pace

Another interesting claim was made by another manager:

“There’s lots of interesting developments going on at the moment; we’re delivering more than ever before.”

Manager claim

I message the Software Delivery Manager to ask if there is any truth in that, because I thought we were releasing at the slowest pace.

“Who knows? I reckon one of those blaze statements that people just say to make it sound good”

Software Delivery Manager

On a recent Development meeting, one person posed a question to the Head of Development:

“Deploying software at speed is ultimately what we want to do as a business. However, this can come at a cost. We have had more Incidents in Q1 caused by software defects than we did in the whole of last year! How do we ensure the deployed code is at the quality required if we are going to deploy at speed?” 

Staff member’s concern about software quality

If we deliver more changes, I suppose there is more risk of introducing more bugs. If you are rushing, then you are more likely to create Major Incidents. I suppose you can say it is fine if there’s more bugs in general, but all considered low severity.

The Rapid Responders Group

As a response to this, I think Colin came up with the idea of the “Rapid Responders Group

“We are deploying our software at a pace we have never done before, so we can miss scenarios that could happen in the production environments as we have seen in the last two MIs. If we can get information straightway, we can investigate the issue straightaway. We all have access to the Live environment. With our technical ability, we might be able to see things that other people might not, so this is why I have assembled this team.”

Colin on his new idea

However, it was quickly shot down by one of our architects and never mentioned again.

“With one of the MIs, there was absolutely nothing we could’ve done between the people in this channel to anticipate that as it relates to a combination of live configuration and state that can’t be replicated.

In general the problems we have with the MI process is making changes takes too long due to processes and permissions, noise from the non-experts present (it’s hard to get a word in edgeways with some people), and multiple people bugging those performing the fixes/changes for updates (when it should be one person who fronts the technical team working on sorting it out)”

Architect

“We had first-hand experience of this on the last Major Incident call. We could have fixed it within 30 minutes but was on a call discussing it for hours.”

Colin
Architect:
It's usually the fault of those running the call or lack of confidence in Dev.
I will say when we have enough information to make a fix - that I'm departing the call to concentrate. As you say, it will otherwise go on for hours, with multiple people fronting opinions on the cause of the problem, unless someone identifies it with confidence and takes the initiative.

When an MI occurs, it's always the same group of people, which is often a pretty good mix, but first we need to identify the cause.
 
Usually it takes a long time for the details of the MI to be logged, particularly somewhere Development can all access such as a Work Item. If you miss the start of the call, it's difficult to know what the details of the original issue were: you can guarantee you missed something of importance.
 
I would suggest the following to help the diagnosis run more smoothly:
 
  1. When a release is made, it must identify a point of contact for each change, preferably the team email address making the change. 
  2. When an MI is started, a brief description must be provided in written form and a source-of-truth record started (such as a Work Item). 
  3. The Release Point Of Contact list is emailed with details of the MI starting.
  4. They might not need to join the call unless there is something they can add, but at least if they feel their input valuable to the diagnosis they can join or reply with details. 
  5. After identifying the team/person working on a fix, they are protected by a single manager. The manager will be responsible for communicating with the techies and the outside world, so they can concentrate. They will update the Work Item and email those who need to know progress, what the fix entails etc. 
  6. Work out a way to expedite changes to live and retrospectively log change requests
Principal Developer:
From the MI's I've been involved in, they mostly run smoothly - the only disruption is (as Mike said) people "sticking their oar in" and having a guess at what the problem might be - sometimes forcefully.

They are not always wrong, but sometimes it can distract from getting to the route of the issue, you need to have the confidence to talk over them and get the information you need, correct misunderstanding, and to keep the discussion on track. A lot of the time the initial presentation from support can be way off the mark, but that is true of most bugs we get due the lack of technical skill in support. Only the developers on the call are going to be able to diagnose a fault as a result of a code change. 

The primary focus of an MI call is to restore service ASAP, exploring workarounds is part of the developers role on the call - can we change a stored proc / setting to improve things now? This helps users and gives development more time to work on a fix. Another role is to assist the wider business (especially those in the Safety and Governance roles) to understand the issue - is it as bad as initially reported, or is it worse!?

Once we know what the problem is, and how we will resolve it, then it is OK to drop off and work on the fix.

Conclusion

I think it is clear that too much time is wasted discussing the problem with the wrong people. Managers need to find the correct group of technical people, give them all the information they need, then give them time without hounding them – in order to actually fix it.

Leave a comment