The Outage

Managing the Public Relations

We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.

I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.

We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.

They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?

Customer quote

All Hands

5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.

I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.

So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.

He described this situation as a

“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;

1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.

2. Pre-approved changes are suspended

3. Any changes submitted for approval will require TWO members of SLT. 

4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.

5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.

Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.

What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?

So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.

Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.

So would the CTO’s new process stop this happening?

New Rules – My View

“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks:

 “Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”. 

You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise. 

We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.
My opinion

The Developer Feedback

On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:

“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.

One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.

Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.

Deployment staff member

Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.

Databas Expert

With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all costFix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]

Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:

When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?

We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.

[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]

One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.

Mobile developer

The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.

We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.

We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.

As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.

Tester

After we made all these points the Technical Director just dismissed all our concerns. 

“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”

Technical Director

Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.

Architect

I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.

I asked a Software Architect what he thought:

“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”

Architect

The Next Release

Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:

“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”

Technical Director

I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.

“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”

Lead Developer

Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).

When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.

Feedback

On our next meeting with the CTO, he asked how we had been adapting to the new process. 

One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.

It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.

I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.

Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.

Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.

A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.

So we can fix forward then? And we can deviate from the rollback plan?

Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.

Refinement to the process

After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.

Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.

Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.

We are almost back to our old process!

Conclusion

I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.

2 thoughts on “The Outage

Leave a comment