Failing in different ways

I occasionally meet up with my old university mates. One friend works for a contracting company. He is a really good developer on a juicy £100k wage which is crazy, and it shows how much money companies waste on software on short contracts. Often they may only have small, permanent development teams, then hire some temporary staff for extra capacity, or they might just fully outsource their software requests entirely.

Since he works on short contracts, often 3-6 months, (sometimes up to a year), he has experience seeing how many companies operate.

I love having discussions with him because he is incredibly knowledgeable and always keeps up with the latest jargon. He knows all the Cloud and Web-development jargon, popular software tools, and Agile process.

He came out with this statement:

“It’s reassuring that you get to work with different companies doing different things, but they are all terrible. You are often asked to help – not because it’s going well; but because it’s going wrong”

What he was saying is that companies have different philosophies when it comes to software. It could be Cloud only, on-prem only, strict Agile development, different levels of automation etc, but whatever they do, it doesn’t quite work, and they will make baffling decisions.

So when I write about the baffling decisions where I work, it’s just that we are failing in a different way to other companies.

Angry Unicorn

I was getting a 503 error when trying to use Git. After searching for internet for solutions, I came across this “unicorns have taken over Error – 503“. I wondered what that was all about? Was it one of those stupid error pages some companies do as a joke?

After a bit more searching, I came across another one, Title: “Intermittent angry unicorn”, and it’s clearer what they are on about

“I’ve been subject to the angry unicorn for over a week now.”

That angry unicorn sure does look fierce.

Printing Licence Key Expiry

My employer made the news recently after the deployment team applied the wrong config to several organisations which led to a heavy spike in network traffic and caused issues nationwide.

Sadly, we made the news again.

“We couldn’t print our forms for three hours. Someone had goofed and did not keep up with their subscriptions. For a company of their size; that is embarrassing.”
Customer quote

When users reported the issue, there was a call between a few managers. I was invited to the call with zero context by someone I had never spoken to before, so I thought there was a good chance they invited the wrong person.

The call was recorded so I just listened to it when I was free. They had invited me on the suggestion that I had worked on a feature in that area many years ago, and they had no other ideas. Also on the call, they called someone else and remarked how strange it was to the recipient because they had never spoken to him before. Why didn’t they learn and send some context? He didn’t join either.

Eventually, they found someone who belongs to the team that procures the licences. He explained that they purchase licence keys for this printing software, then send them to an internal support team to update the licence keys in the database. The team receives automated emails reminding them to renew the licence keys 3 months prior to expiry, and they act on it quickly to not risk them expiring.

“It takes a while to go through the purchase process, so I usually do it early, but sometimes it can “fall through”
Procurement guy

After going through some emails, they found the key had been promptly purchased, and the licence keys were sent to support, then they were not applied.

Another guy joined the call and said a total of 43 organisations have reported errors, but that’s the only ones we know about due to direct complaints we saw via Facebook.

“And then I need to understand exactly how this is happening, ‘cause this is the second time in two weeks that a licence key was sent to support and wasn’t applied.”
Angry manager

After the issue was resolved, another manager asked for a summary of the issue. One guy remarked

“just to say, the people invited to the call (16 invited in total), are not the ones that ended up being involved or resolving.”
Manager, reflecting on what a mess the meeting was

I don’t understand how all our departments are causing chaos all at once? We don’t seem to learn our lesson either, how can we make the same mistake twice in two weeks? It’s also been a problem for years – that we struggle to invite the correct people to major incident calls so issues take much longer to resolve.

Datadog – The Smooth Out

I recently wrote about Datadog, which allows you to create dashboards to monitor servers. It’s the “cool” thing to use at work and the CTO is heavily promoting its use.

I discussed how it’s quite confusing to use, and there’s certain limitations that you have to bear in mind. I also suspected people were creating dashboards then assuming that they worked because it was showing some data – but when you try and verify the data; it turns out some of the dashboards were showing absolute nonsense.

One guy, who had been working with Datadog for months, only just noticed a problem with his dashboard.

“In our team, we monitor the response time of our API calls, and over the last 3 months, we’ve seen a remarkable increase. In May, we were seeing around 140ms, but now we are seeing 550ms.”

So I loaded up his graph and my default view was “1 Hour”. I switched it to “past 1 Day” to zoom out, and the graph looked a bit different to what I expected. The first graph shows 11-12, so on the second graph, just look at the far right at 11-12.

The first graph shows a spike around 11am and is roughly around 1.6 seconds, then drops down to 0.2 and remains very consistent. Yet the second graph seems to spike up to 1.15 seconds then drops down to around 0.4, then finally spikes up to 0.6 at the end.

As I switched between the views, the numbers seemed to differ by larger margins.

I then moved to the day that he mentioned, 7^th July. Then picked a particular time 12:05, then made a note of the value as I switched between the views, (although that particular time didn’t exist on all graphs, but I couldn’t be bothered re-checking them for more accuracy)

View	Recorded Value
15 mins	179.65
1 hour	176.51
4 hours	156.1
1 Day	372.89
2 days (12:00 time used instead)	552.93
1 week (12:00 time used instead)	554.93
1 month (11:00 time used instead)	550
3 months (1:00 used instead)	559

12:05 Friday July 07

He didn’t seem to be using any weird formulas, so why is the data so wildly different? I didn’t report my exact findings from the table, but another developer chimed in with this comment:

“Datadog’s long-term reporting is poor when it comes to averages. I would take a 1-4 hour window as a good sample size. Anything 1 day plus, the results are clearly not accurate”
Datadog user

So his statement seems consistent with my analysis. So why is Datadog so hyped up in the business – when people who use it don’t even think it is reporting accurate figures? Additionally, it sounds like when they have noticed, they have kept it to themselves and not shared this knowledge.

He then theorised that Datadog aggregates an average of its datapoints over a period of time e.g. 24 hours, then plots that. To me, it doesn’t make much sense because if it has a sample of the exact value for that time eg 12:00, then why would it need to take the average from 12:00 on that day till 12 the previous day, especially when you could be monitoring traffic which is time-sensitive eg have a spike in the morning, then lowers in the afternoon.

After searching on the Datadog documentation, we found this:

“As Datadog stores data at a 1 second granularity, it cannot display all real data on graphs. See metric aggregation for more details.

For a graph on a 1-week time window, it would require sending hundreds of thousands of values to your browser—and besides, not all these points could be graphed on a widget occupying a small portion of your screen. For these reasons, Datadog is forced to proceed to data aggregation and to send a limited number of points to your browser to render a graph.

For instance, on a one-day view with the ’lines’ display, there is one datapoint every 5 minutes. The Datadog backend slices the 1-day interval into 288 buckets of 5 minutes. For each bucket, the backend rolls up all data into a single value. For instance, the datapoint rendered on your graph with timestamp 07:00 is actually an aggregate of all real datapoints submitted between 07:00:00 and 07:05:00 that day.”
https://docs.datadoghq.com/dashboards/guide/query-to-the-graph/

That explanation sounds fine in theory. If the graph is showing each hour, then each point can be an aggregate of the previous hour. But what that should mean is that it is a smoothed value. So if you have “zoomed” into a minute-by-minute view, and see a very squiggly line of rapid but small fluctuations, if you zoom out to the hour, then the line should look fairly straight, and the value should be an average in that time period. I don’t think it explains how my first graph probably has an average of just over 0.2ms then it showed 0.4ms.

There’s this useless video from Datadog which is supposed to show how your graphs are “smoothed out” when zooming out. It has no sound so you have to interpret what they are showing you with their mouse:

At 12:04, they show the value of 13.32% in the (15 mins) view, and at 12:05 you see it drop down to 9.07%.

They switch to the 4 hours view

I’d say the 12:04 value is now 15.8% but they start hovering over times around 11:17

When they switch to 1 day, they then start looking at data around 6:45, so clearly they have no idea what they are even doing. In this view, the times around 12:00 are ~14%

With such small timescales, surely there shouldn’t be that much variance with the data. Surely it is important to get accurate figures when looking at the micro level at less than 1 hour views, then when you zoom out, the figures should be consistent.

IT Tales

Here is a collection of a few fails by our IT Department.

PC shutdown & Usage Monitoring

Even though we work from home, we still have some PC’s in our office that we remote onto. There’s certain systems that only seem to work when on the physical network so people often call this a “jump box”. Our IT was planning on temporarily moving our PCs whilst some electric work was being done in the office. I was invited into a Teams chat which was supposed to be for everyone affected. After skimming the list, I spotted 3 people that were missing, and other colleagues spotted others. 9 people were missing in total! How do they not know who owns the PCs? They have been citing “increased security” in recent times. Surely a security risk if they don’t know who uses PCs on the network.

More recently, I was contacted again via email asking “if you use this PC”. Again, why do they need to ask if we use them? Isn’t that a security concern if not? Surely they know, especially when they have installed extra network security tools recently. I thought they had said software monitors network traffic and alerts for anything suspicious.

Upgrading Software

I was contacted by IT saying my SQL Server version was no longer supported by Microsoft, so I need to urgently upgrade it by the end of the week due to being considered insecure. They said if I want an installer, please reply. I thought it would be easy enough locating the installer, but it seems Microsoft’s SQL Server pages are very confusing. So I replied asking for the installer. They ignore me. I reply again, they ignore me. Months have gone by. So not that urgent then.

IT then announced that they are taking increased security measures and are removing all admin rights from our PC’s. Now we can only install software with their permission. They also said it makes sure we can’t install unlicensed software, since it is easy for someone to install software that is free for personal use, but is paid software for commercial use, and then the business can be liable.

A week later, they then email us saying there is a known security vulnerability with our Visual Studio version so we need to update it. We can’t though, we need admin rights to keep our software updated and secure! So now we have to log tickets, then they remote on and type in the admin password to proceed. I bet they love that.

In a similar fashion, they are more fussy with USB devices. They sent one of my colleagues a new laptop but it rejects his smart-card reader which he needs for testing. Can’t be plugging in USB devices these days.

Saving Money

They also said they wanted to be more stringent when it comes to licence keys, as we seem notorious for purchasing more licence keys than we need, then we might stop using software then still pay. I was contacted in early July 2022, saying that I have had a Jira licence for the last year but have not being using it:

We currently purchase a licence for you to access Jira. We understand a lot of the users will have now migrated to Azure DevOps and as such, your access may no longer be required.

May I kindly ask you to respond to this email by 12pm Friday 8th July confirming whether or not you continue to require access?
IT Email

So I reply saying I wasn’t using it and I don’t think I have used it for 2 years. I then got contacted again in February 2023 saying the same thing. I confirm that I don’t need it. I then got contacted earlier this month asking me again. So I’ve had a licence for 3 years now for a product I don’t use at all.

The Outage

Managing the Public Relations

We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.

I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.

We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.

“They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?”
Customer quote

All Hands

5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.

I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.

So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.

He described this situation as a

“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;

1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.

2. Pre-approved changes are suspended

3. Any changes submitted for approval will require TWO members of SLT.

4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.

5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.

Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.

What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?

So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.

Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.

So would the CTO’s new process stop this happening?

New Rules – My View

“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”	The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”	Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks: “Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”. You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”	How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise. We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”	This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”	The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.

My opinion

The Developer Feedback

On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:

“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.

One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.

Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.
Deployment staff member

Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.
Databas Expert

With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all cost “Fix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]

Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:

When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?

We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.

[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]

One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.
Mobile developer

The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.

We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.

We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.

As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.
Tester

After we made all these points the Technical Director just dismissed all our concerns.

“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”
Technical Director

Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.

Architect

I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.

I asked a Software Architect what he thought:

“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”
Architect

The Next Release

Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:

“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”
Technical Director

I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.

“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”
Lead Developer

Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).

When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.

Feedback

On our next meeting with the CTO, he asked how we had been adapting to the new process.

One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.

It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.

I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.

Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.

Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.

A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.

So we can fix forward then? And we can deviate from the rollback plan?

Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.

Refinement to the process

After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.

Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.

Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.

We are almost back to our old process!

Conclusion

I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.

Commit Messages

As a developer, you need to “commit” your work into source control. When you do this, you need to provide a message that describes the change. A bug fix or enhancement can consist of a few commits. When you use a source control like Git, you can even “rewrite” the “history” by using squash/rebase commands.

When I am writing code for my job, I often put a bit of thought into the wording to make it clear and professional. If it is for a personal side-project, I sometimes write meaningful messages, but sometimes I may group together multiple fixes, then come up with something generic like “Bug fixes”, or might commit more experimental work-in-progress under a name like “attempt” or “x feature part 1”.

It’s quite frustrating though to see colleagues write generic messages like “bug fix” which doesn’t describe what it is fixing, or how it is fixing it. Seeing messages littered with spelling mistakes is also annoying and unprofessional.

Examples include:

“EventNotificationBugFix After Resolving James’ Comment”
“bug resolved”
“Dev test changes from tester” (literally what does that mean?)
Updated the findCorrectCodeSp to findCurrectCode.
Taken off completly the fix Amanda done for the fix for 25477 and fixed. Included the fix for 8017 as well
Fix for SQLCrop issues (should be SQL Cop, our Code Analysis)
Fioxed further typos (ironically creating more typos when fixing typos)
fixed the failed unit testes (testes rather than tests. Brilliant)
“Post Christ’ comments re coding standards” (it’s the second coming of Christ to teach us coding standards! They meant to write Chris.)

There was a guy who worked in our short-lived Scotland office who sounded like an absolute nutcase and I have never seen someone not take their job seriously like this:

instructions unclear, got dick stuck in tfs
what a waste of eyes
but fuck wasps
be nice to bees
what if there were landmines in dinosaur times

A colleague recently showed me this website https://whatthecommit.com/. I don’t know if they are based on real messages, but it shows you a new message every time you refresh. Most are pretty basic along the lines of “does this work”, but there’s some more outlandish ones if you persevere and refresh many times.

One of my team members recently submitted a change that was labelled “Crash when cancelling out of the dialog”. That describes the bug that he fixed, rather than what he fixed. Another team member provided the following good advice:

The change looks good, but your commit message isn’t describing the change. Could you just reword it so that it completes the sentence “When applied, this commit will…” please. Another way of looking at it is that every commit should read like an instruction in a to-do list. I’d use something like “Return empty collection if user cancels out of the dialog”.
Senior Developer

A stricter idea of commit messages is described here: https://www.conventionalcommits.org/en/v1.0.0/

One of our Principal Developers loves linking people to this guide https://cbea.ms/git-commit/. One interesting idea is to “Use the imperative mood in the subject line”, whereas most people would write in the past tense.

Having a clear history of file changes helps you when it comes to finding which change caused an issue, and also gives you a better understanding why it changed.

Pride At Work

During Pride month, there were a few Yammer (now known as Viva Engage) posts about LGBT issues. One guy made a blog post about how gay people were denied the opportunity of blood transfusions until recently. It was informative but I did think it was a weird thing to post at work – given the word-count of the word “sex” reached double figures and contained the phrase “anal sex” along with other sexual references.

If you take that out of the context of “pride”, wouldn’t discussing or writing about sex at work result in you being on a call with a member of HR?

I discussed it with a few of my colleagues. One guy said he thought he “had crossed the line with his phrasing and could have easily worded it in a less explicit way”. Another colleague stated that “although I support Pride, I don’t feel I should be reading about it at work“. That is actually a good point. Although there can be important social issues in the world, if it has nothing to do with work, then why are we reading or talking about it when we should be working? I’m sure there was even some policy we had to agree to – that said you couldn’t discuss religion and politics because if someone had different beliefs to you, then they may feel excluded.

It made me think that – because LGBT is the current hot-topic, then it trumps all existing work policies, and you aren’t allowed to say anything against it. This is even more contentious when this particular topic could be against someone’s religious beliefs (we do employ a significant number of Muslims, and a certain number of colleagues could have opposing views regardless of religion).

To conclude Pride month, a member of HR posted the following:

“Lots of events take place throughout June every year to celebrate the LGBTQ+ community and all the progress that has been made across legislation, attitudes and behaviours.
Personally, one reason I find these events so wonderful is because they bring together people of all ages and I see so many families attending together with children – what better way to encourage change than to teach children about positive attitudes and behaviours and set a great example for them.”
HR staff member

I laughed out loud when I read that. I really wanted to respond, but thought I’d end up being unfairly sacked. So I wrote this blog instead.

Maybe the average person hasn’t heard about all the controversies this year, but recently, I’ve spent a lot of time on Twitter and been watching a lot of Daily Wire content. I suppose the more stuff you view on Twitter, the more it recommends the content, and so if you have any hint of an opinion, then it becomes stronger with “confirmation bias”. I’ve generally been interested in conspiracy theories and hot debates, so Twitter has pushed a lot of this content to my feed.

Don’t get me wrong here, I’m not against LGBT in general, but am opposed to it being directed at kids (which a lot of people from the likes against Daily Wire are making content about), and Twitter seemed to like showing me everything that Gays Against Groomers were Tweeting, and that’s their purpose.

oh, won't somebody please think of the children - The Simpson's meme

So let’s go through some examples of what I am referring to here. If I remember correctly, the first controversy was a “family-friendly” Pride event where gay people in fetish gear were being whipped on top of an open-car. The next was a photograph of a curious girl about 6 years old who had approached 2 guys who were wearing that dog-themed leather bondage gear. A point here is – this content should only be known about if you go out of your way on an 18-rated website. Instead, people are in a public event where they knew that kids would be at, dressing up and even simulating these acts.

I actually only came across that particular fetish due to a colleague mentioning that a former male colleague had an OnlyFans with his boyfriend, and it was the company’s discovery of this fact that had forced him to leave the business. Given that the colleague that was telling me this had a reputation for exaggerating and lying, I asked him to prove it, and he linked me to his pages. He was telling the truth 😱😳

If my employer really is fine with this gay fetish aspect, then why was our former colleague sacked? Probably some hypocrisy there.

So I only learned about this fetish attire by going out of my way of the dark side of the internet, and here we have the likes of members of our HR department stating “I find these events so wonderful is because they bring together people of all ages and I see so many families attending together with children – what better way to encourage change than to teach children“. I find this sentiment being echoed among many that are presumably scared to be labelled a bigot for speaking out about it.

Some YouTubers stated that when they made content using such Pride footage, they were labelled as “adult content”. How can a “Family-friendly” event be adult content? Oh because it is adult content!

It’s considered a faux-pas to criticise Pride, but yet, if this same thing happened outside the context of Pride, people would call these people a “nonce”/”sex offender” and demand they be locked up for public indecency. This is what the group Gays Against Groomers stands for. They are against grooming kids. They are against exposing children to 18-rated content. Yet, they posted videos of their van parked at a Pride event and people were coming up to it and spitting on it. That’s right, people are openly fine with grooming kids these days. We used to want to protect kids at all costs, and we seem to have lost that over the last few years in pursuit of wokeness.

There was even the controversy with the Twitch Streamer NickMercs who Tweeted “They should leave little children alone. That’s the real issue” (it was in the context of a vote to celebrate Pride at a school), then Activision removed his character “skin” from the game “Call of Duty: Modern Warfare II | Warzone”. This then resulted in a minor boycott/review bomb, and people mocked Activision with the phrase “Call of Groomers“. How far has society fallen if stating “leave little children alone” is considered a controversial statement?

To go back to the first thing the HR staff member said “Lots of events take place throughout June every year to celebrate the LGBTQ+ community and all the progress that has been made across legislation, attitudes and behaviours.“. Progress made? So in addition to the examples of Pride becoming fetishized, you also had the transwoman that exposed their breasts at the White House, Puberty blockers banned in the UK , the boycott of Bud Light in the US due to the promotion with Dylan Mulvaney, the boycott of Target due to stocking chest binders which tanked their share price, the banning of Drag Queen events, men identifying as women to avoid Men’s prison, and more people speaking out against Transwomen in sports. So the Trans community has taken hits in their PR in this Pride month.

There was also the incident with Billboard Chris, where he was speaking to someone about how it is wrong to give puberty blockers to children, when a transwoman began screaming obscenities repeatedly in his face. Chris did his best to ignore her, until he got punched in the face. Despite having several police as witnesses, and having the event caught on camera, the police refused to prosecute the assault, and blamed Chris for being antagonistic. Pride Month ain’t it – Commit all the crimes you like.

So I’d say the LGBT movement had gained more and more support over time, but this year, it took a massive step back. I wouldn’t be surprised if further controversies were more widely publicised in future.

I think issues should be raised and discussed with logic, and not dealt with whilst being blinded by wokeness and hypocrisy. People need to take a step back, clear their minds and really decide what they actually believe in.

Assault is wrong. Grooming kids is wrong. Sex shouldn’t be discussed at work. I hope we can agree with that.

Errors logging errors

Today, I saw some code that looked like the classic “I can’t recreate this bug, so I’ll write some code to log extra information, and I’ll hope this somehow helps me diagnose the issue”.

They had a list of data, and the ID in this list should be unique, but it was in there twice due to a bug. So it would look like something like this (ID =1 is in there twice):

private List<Code> _codes = new List<Code>()
			{
				new Code
				{
					IsSelectable = true,
					ID = 1
				},
				new Code
				{
					IsSelectable = true,
					ID = 1
				},
				new Code
				{
					IsSelectable = false,
					ID = 2
				},
				new Code
				{
					IsSelectable = true,
					ID = 3
				}
			};

Then their method contained this code to validate the data:

		private void EnsureValid(int idToFind)
		{
			try
			{
				_codes.Single(s => s.ID == idToFind);
			}
			catch (InvalidOperationException)
			{
				Dictionary<string, object> customData = new Dictionary<string, object>
					{
						{ "ID", _codes.ToDictionary(x=>x.ID) }
					};

                LogToErrorDB(customData);
			}
		}

The Single will throw an exception due to there being more than one matching element (Single will succeed when there is a Single match). In the catch block, they then convert the code list to a dictionary and will log this in the error database. However, a dictionary requires the keys in the dictionary to be unique. Since there’s duplicate IDs, this will throw an exception again, this time with:

“System.ArgumentException: ‘An item with the same key has already been added.'”
Error thrown in the catch block

So you get an error whilst trying to log your error. The extra information they intended to log is never logged.

I’ve seen this mistake happen a few times before, and usually you can test it by artificially creating the data condition. Then they would realise their error logging doesn’t work.

Text Box Increase / Test Case Changed

Occasionally we may be asked to help our Software Testers run through their manual regression test cases

When I was a tester, even though writing test cases should be easy, you often find they are so tedious to write if you want to accurately describe every single step. Therefore, you may choose to be more concise with your wording or make assumptions that the person running through the test will understand what to click.

Sometimes you think you have written a brilliant test, but when you come to run it again at a later point, you realise that it was ambiguous and then might end up looking at the code to work out how it was meant to work at the time.

If the test case is misleading, sometimes the tester will then modify it to be “less ambiguous”/“correct” but there’s times where they have incorrectly changed it, causing further confusion.

I ran a test called “Enter 1020 characters into the Description Textbox ensuring to include numbers and special characters (namely ‘&’)”

However the expected result was “Textbox will only accept the first 260 characters”

Why would we be entering 1020 characters if the textbox is gonna stop at 260? Clearly something is up with this test.

So I look at the history to see if someone had changed it. It used to say enter 260, but 255 is accepted but then Sarah changed it to “enter 1020 and 260 is accepted”.

So I looked at the linked change to see what it should have been changed to (or maybe not changed at all). The item was called “Extend description from 255 to 1023 characters”

That seemed really random. Why 1023 characters? And why did the tester change the test case to 1020 (and 260) when that still isn’t enough.

Even more confusing was the developer didn’t even change it to 1023 – it was set to 1000 in the database.

＼（〇_ｏ）／

So we wanted 1023, the developer provided 1000, and the tester either tried 1020 or 260 and passed it.