Unity Runtime Fee

Intro

Unity have announced a new fee which they call the “Unity Runtime Fee” which is going to take effect in January. It affects all Unity developers, even people that have already released their game many years ago; which has caused mass outrage among the game development community.

I think the existing model states that once you reach a threshold of revenue, you have to pay a licence fee which works out around £1500 for the year. With the new model, once you reach a similar threshold, Unity is now going to charge a fee of 20 cents every time somebody instals your game on a new device for the first time.

The threshold is $200,000, which on the face of things, 20 cents per install doesn’t sound unreasonable when they have given you a great tool to help you create your game. They need to earn money as a business and deserve some kind of cut for their service/product. According to this tweet, it looks like they are burning through money so some drastic action is probably required

https://x.com/georgebsocial/status/1702696194558816751?s=20

There’s a fair few aspects of why this is new model is complicated, but I still feel some of the anger is misplaced.

I think the whole scenario is similar to what I have written about recently where the CEO demanded we release our software weekly instead of monthly and we told him several reasons why it is technically, and legally impossible. Then later he then demands all our changes have a well-documented rollback plan, and again, we told him loads of reasons why it wasn’t possible. He still insisted and looked like a fool when it backfired and caused a few problems he thought he was solving.

The main Problem

The core problem stems from the idea that it is based on installations and not based on Unit Sales or Revenue. For comparison, the main competitor, Epic Games’ Unreal Engine charges you five percent of your total revenues after you’ve earned at least a million dollars on your game. Now, that can work out to be a lot of money and especially in the long run if your game is successful, but the difference is that they’re taking a cut of your money that you’ve already earned. When Unity charges you for an installation you’re being charged whether or not you’ve earned any money, or at different period of time to where you earned the money. That could turn into a cash flow problem.

Collage of abuse:

https://x.com/LiamSorta/status/1702325745610338646?s=20

Tweets by FuckedByUnity

Theoretical scenarios

Once you’re over the threshold, if somebody bought your game a long time ago and they’ve now installed your game on their brand new computer, it’s going to cost you 20 cents. I suppose if it is an old game, you probably won’t be selling $200k over the last 12 months, so it’s probably not actually going to apply.

If you decide to port your game to a new platform which is often fairly easy in the Unity engine, then all those new installations are also going to be hit with fees. I suppose if you are re-selling the game then it’s not a major problem, but sometimes developers make a free-to-play mobile version. Then you make money later with microtransactions. Often these games have 90% of players not paying a penny, but then you make your money on the 10% who often spend big. In this case, you could end up losing money on the average player of your game.

People also raised the point of bundles like the Humble Bundle where people buy a bunch of games for a small price but some of the money goes to charity. You end up selling high volumes but gain very low revenue. If you hit the threshold, and you are more likely to with a sale like this, then you could be hit with a lot of fees. I think the interesting thing with this point which people don’t seem to be mentioning; is that people often buy these games then never actually play them. So you actually have a sale, but no install, so don’t pay the fee.

Fairly similar to a bundle is a service like Xbox Game Pass, where people could play your game with an overall payment to the provider, in this case Microsoft. I think Microsoft often pays a flat fee to the publishers to gain their games but I suppose contracts can vary. But the theory is, you could get a flat fee, then either get low instals so you’ve gained, or get a surprisingly large amount of instals if it is popular and it eats into your profits.

Piracy

People who pirate games don’t pay but do install your game. This means that every time your game is pirated you’re going to be slapped with a 20 cent fee. There can be other malicious ways you could be charged, if someone abuses Virtual Machines. There’s programs that will spin up large numbers of them, so you could “Install Bomb” quickly with virtual machines, hitting the developer with a 20 cent fee. It’s like when people “Review Bomb” where you leave loads of negative reviews on a game you don’t like in a coordinated way, but in this case you need fewer people, and they directly sap the revenues of the developer instead of just hurting their online presence.

Target Price

Unity has always positioned itself as being pro Indie. They want to help new aspiring Indies learn to program, break into the gaming market, and get their career started. New developers are also much more likely to sell their games for cheap. There’s a lot of games like this on Steam which are sold for £10 or loads for £5 or less, and that’s before you apply discounts. Steam is renowned for its high discounts in sales, and so these games are being sold for just a couple of pounds. They’re going to be disproportionately hit by having to pay Unity 20 cents every time the game is installed.

In the extreme case, imagine you’ve made a Steam game or a mobile game that sells for one dollar and then you pay a sales tax of 10-20%, then Steam takes 30%, then you know you’re left with around 50 cents. If you use a Publisher, then they will take their cut too. Then Unity takes 20 cents of it for an install and then maybe another 20 cents for another install, then you could be left with basically nothing. You could then lose money if it isn’t sold for full price.

Meanwhile, if you sell a premium game for £40+ then 20 cents is nothing. So it actually hits the indies harder. Unity have ways of getting the price per install down, but they look more aimed at larger companies who will want to pay the upfront fees to use the premium Unity features.

Patch Quest

Lychee Game Labs’ Patch Quest released on 2 March 2023 and so far has reached 182,594 total key activations on Steam (people who bought the game on Steam along with everyone who got the game elsewhere like in a bundle or a giveaway or for review purposes). So if the game keeps selling, or people install on more devices, then he will be taken over the threshold then would start being charged. He did remark that “for the sake of argument, every single person who already owns the game decided to install it on a second PC, I’d be hit with a charge of $36,400. Now it’s obviously not likely that this would happen” but it does make you think how Unity are gonna deal with these outliers.

Unity Response

Within the day of the announcement, there’s a lot of angry people, and Unity has tried to clarify the points raised. However, it’s not clear if it’s actually possible to do what they claim. They reckon they have some sophisticated fraud detection technology which can prevent the “install bombing”. Then they say that they will have a process for them to submit their concerns to our fraud compliance team. So from what I understand here it sounds like the onus will be on the developer to try and somehow keep track of how many of their instals are fraudulent and then if you have concerns, you contact the fraud compliance team, and then they will hopefully give you your money back. I think the majority of people don’t have a lot of faith in such a system when Unity have to put in some work to decide if they want less money from you.

https://x.com/thomasbrushdev/status/1702797688838775134?s=20

Unity have clarified that if you’re part of a bundle like Xbox game pass or you’re in a charity bundle then you’re not going to be charged for the install, although it’s not exactly clear how they’re going to know which instals come from charity bundles or game passes. They seemed to imply that for Game Pass, they would send the bill to Microsoft but I can’t imagine Microsoft will be too happy to have sprung upon them. It would probably have to be negotiated in future Game Pass deals and it might just be the case that Microsoft just doesn’t add any Unity-based games to their service.

Unity tried to justify this whole new fee structure by pointing to the thresholds and saying “if you don’t already earn loads of money on your game then you’re not gonna pay extra”. This is where I think a lot of developers are wrongfully attacking Unity, when they would never pay them anyway. I suppose in the Patch Quest example, I’m not aware of it being a major hit, and he has pretty much reached the payment threshold. But given that it’s been many months after release, you would imagine sales will now be low and he will only be liable for minor fees which he should be happy to pay.

Conclusion

There probably is a clause somewhere deep in Unity’s terms and conditions that says something like “we retain the right to change our terms and conditions”. Companies love to have that kind of future-proofing in their legal small print, but how many actually go through with major changes? It can be logistically difficult to implement drastic changes, and evidently a PR nightmare. However, despite that, many companies are against Unity for switching the Terms and Conditions with only a few months notice. When games can take years to make, you need that predictability to adequately budget, and if Unity can charge you more on a whim, then it’s unpredictable. People also wonder if they really can change the terms built on an older Unity software version as you essentially have an agreement at the time of release; but that needs to be left to the lawyers.

Switch Engines?

I think a key statement that many are using to justify their decision to abandon Unity at this time is “Is this the last time they’re gonna change their terms?

Jumping ship to another one might be possible when you’re just starting up on a new project but the deeper into development you get, the harder this becomes. Your game gradually ends up dependent on the engine it’s built in. Switching to Unreal Engine will require programming in C++ plus instead of C# which is a massive learning curve. Godot seems to be gaining popularity but people seem to say it specialises in 2D games at the moment. I think C# doesn’t have full support so their own GDScript is more popular.

https://x.com/TruantPixel/status/1702132911976194091

https://x.com/DarkestDungeon/status/1702378602895941837

References:

Unity Pricing Thoughts...

For context, we are a small studio (7 people) with a Steam game with 3M~ players.

I'm seeing many non-developers tell developers that this pricing change is not a big deal, here is why the entire community is lighting a fire:
• Massively disproportionately punishes indies
• Only three months notice
• Double dipping (Licence fee/ads cut)
• Dangerous precedence for charging "runtime"; you no longer fully own that exported build. If Unity continues to struggle, pricing could become more aggressive
Here are a few examples:
• Unity's own example on their site has a hypothetical scenario:
-- $2M USD Gross in 12 Months
-- 300k Users/month (200k Standard/100k "Emerging Market"), $23.5K USD/month
-- This means $282K/Year in fees, 14% of gross revenue, 3x Epic's 5%.
• F2P Games that are NOT excessively monetised are penalised:
10M Players:$1M USD
1M Players:$10M USD
The first case, with a vastly less predatory set of MTX is now punished significantly worse than one purposefully building money-extraction machines.
Our team has been hard at work for 2 years on a massive update to our game, with a F2P mobile ver coming next year. We built this from the ground up to be ethically monetised/for whaling to be impossible, so we are particularly unhappy with the news.
This affects developers everywhere, of all sizes. I am grossly disappointed by any industry figures brushing this off as "developers complaining." that do not understand the severe damage this can cause smaller studios.
Unity's trust within the games industry has been steadily eroding for years now, this latest change is a testimony to how horrendously mismanaged the board is. Personally dumped all of my Unity stock after this announcement was made.
I'd bet heavily on the people making these decisions have never even opened the editor, let alone released a game.

From <https://threadreaderapp.com/thread/1702189840383832408.html>

Problems With Hosted Services

Recently we have had several major incidents due to: software bugs, incorrect configuration being applied, not renewing licence keys, and migrating servers to the cloud and failing to check all services were correctly configured and running.

Our Hosted Services team gave a presentation of work in their department, and gave more insight to even more failings that have happened recently. As far as I am aware, Hosted deal with servers, data centres and networks.

Hosted explained that due to the decision to move all servers to the cloud, when their usual time came to replace old servers – they didn’t bother. But the migration has been a slow process and delayed which meant our software was running on inferior hardware for longer than anticipated.

“We don’t need to invest in the in the architecture that we’ve got, which was not the right decision in hindsight

We had a team of people who, in some cases, were the wrong people. They didn’t have the appetite to go and actively drive out issues and reduce the points of failure in our networks.”
Hosted Manager

He then goes on to say the change in strategy caused many of their long-term staff to leave. These people that really knew how the business worked.

“So we lost around about 90% of the team over a relatively short space of time and that put us into quite a challenging position to say the least. And needless to say, we were probably on the back foot in the first quarter of this year with having to recruit pretty much an entire new team.”
Hosted Manager

Then, because they were short staffed, their backlog of work was increasing, putting more stress on the people that remained:

“We had to stop doing some tasks, and some of our incident queues and ticketing queues were going north in terms of volumes, which was really not a good place to be.”
Hosted Manager

I’ve written about this situation in the past. It has happened in the Development department when a new CTO comes in, and says that manual software testing is archaic; so people have to learn automation or lose their jobs. Then a few months later, they realise their plan isn’t so feasible, yet have lost some good software testers to other companies, or allowed others to switch roles and aren’t interested in going back. Then the releases slow down because we can’t get the fixes tested fast enough due to the last of software testers.

They go on to say the Firewalls suffered 50 major incidents in Quarter 2, and now they have “procured new firewalls” to solve it. They have reduced bandwidth into the main data centre by routing certain traffic through an alternate link. The “core switches” at our offices and data centres are “End of Life” and will be upgraded to modern hardware (Cisco Nexus 9K).

So it sounds like they have a plan, or at least are doing the best with what they have. It sounds like all departments are currently shooting themselves in the foot at the moment.

Parkinson’s Law: Another Perspective

I recently wrote a blog on Parkinson’s Law, and I recently came across this tweet about Elon firing loads of staff when he took over Twitter:

Elon Musk fired 6,500 employees at Twitter.

A little birdie told me it's down to:

– 2 designers
– 6 iOS developers
– 20 web developers
– Around 1,400 sales and operations people

How is it possible that we are still using this website?

Two words:

Parkinson's Law.

Have you…
— Andrew Wilkinson (@awilkinson) June 25, 2023

Here is the tweet thread:

Elon Musk fired 6,500 employees at Twitter. A little birdie told me it’s down to:

– 2 designers

– 6 iOS developers

– 20 web developers

– Around 1,400 sales and operations people

How is it possible that we are still using this website? Two words: Parkinson’s Law.

Have you ever wondered why seemingly simple tech companies have tens of thousands of employees? Sometimes, it’s because they have huge sales forces or tech support/operations people. But often it’s also due to Parkinson’s Law.

Parkinson’s law is like lighter fluid for bureaucracy. It’s a business tapeworm that slowly eats away at companies, making them less and less efficient and innovative over time. Parkinson’s Law is the idea that the work will generally expand to the amount of time, budget, and number of people allocated to it, and no matter how many people you allocate to it, those people will feel busy. They’ll feel busy because, due to the excess time/slack in the system, they’ll start focusing on less and less important tasks.

Here’s how it manifests on an individual level: Let’s say you have a report due in a week. The report might only take you around five hours to finish if you really focus and work efficiently. However, because you know you have a week to complete it, you might find yourself spending a lot more time on it than you need to. You’ll be more prone to distractions, take longer breaks, or perhaps decide to add more details, tables, graphs, and so forth. Essentially, the task becomes more complex and time-consuming purely because you have more time in which to do it.

And here’s how it manifests across organizations: Imagine a big tech company. A social media company with various departments. Each department has tasks that it must complete to contribute to the overall productivity of the company. Now, suppose each department is given a budget and a set amount of time to complete its tasks for the year. According to Parkinson’s Law, each department will use its entire budget and the entire allotted time, even if the tasks could have been completed more efficiently. This is because as resources and time increase, departments tend to become more complex and less efficient. For example, a department might add more steps to its procedures, requiring more approvals and creating more paperwork, which slows down the process. Or it might use the full budget on additional personnel or equipment that doesn’t necessarily improve productivity. The department might also use the full budget to justify the same or larger budget for the next year, since budgets in many organizations are often determined based on the previous year’s spending. This is a phenomenon known as “budget padding” or “spend it or lose it” mentality.

Inefficiencies can also develop in staff allocation. If a department expands, it might add managerial positions that aren’t strictly necessary. More employees are hired to manage, creating layers of bureaucracy that may not contribute to productivity and can even slow decision-making. I have seen this occur over and over again in my career. The larger the team, the larger the budget, the longer the timeline, the less gets accomplished. I’m very curious to see how many more tech companies come to this realization.

Indian Expo

Recently, I blogged about how managers love any excuse to go to India to visit our office over there. Then they write a blog on their experience, stating how important it is for face-to-face collaboration in an office environment… before returning to the UK and telling us how working remotely from home is the modern way of working, and has no impact on efficiency.

They actually spend most of their blog writing about the local cuisine and the landmarks they saw; so it’s definitely a holiday and not a work trip at all.

I also wrote about The Expo, which is where the entire UK side of the company travelled to one location to watch many in-person presentations (which we could have just watched remotely like we normally do). Then when it is “business as usual“, managers are telling us to find ways to save money, and how we want to become a carbon-neutral business.

So after dumping loads of money into travel costs, hotel expenses, venue hire and catering for the Expo in the UK, they decide it would only be fair to host a similar thing in India… which means getting all the directors and senior managers to fly over there to do the presentations.

Obviously they used the opportunity to post a blog about the importance of face-to-face collaboration, Indian landmarks and cuisine.

Key phrases from their blog are as follows:

The India Office

“I am amazed at how much we were able to accomplish”
“India greeted us with its vibrant energy and diverse cultural heritage”
“The workspace was a fantastic environment, promoting team collaboration and productivity”
“Witnessing the teams working closely together was inspiring, and the entire place was abuzz with creativity and a real growth mindset”
“The office boasted excellent facilities, including communal work areas, private group session rooms, a gym, nap rooms, massage chairs, a food court, and garden”.

Expo Day:

“The Expo day itself was an exhilarating experience, with a buzzing atmosphere and a large number of attendees”

“Representing the team on the stands was a humbling experience, as engagement levels were high and the audience had a deep understanding of our work, asking probing questions around aspects of safety, governance and our products.”

Cultural Experiences:

Visiting the UNESCO heritage site at Mahabalipuram allowed us to witness the interplay between Hindu, Chinese, and Roman architectural styles in this historic trade centre.
Learning about the story of Draupadi and understanding the long history of international collaboration.
Our visit to DakshinaChitra cultural heritage site, highlighted the vastness of South India and its rich diversity.
Meeting the skilled craftsmen and hearing them describe their trades first-hand provided a deeper appreciation for the diversity of people and their skills across the country.
We learned about different rice and cooking methods for Biryani, and the amazing flavoursome vegetarian dish suggestions.

IT Tales

Here is a collection of a few fails by our IT Department.

PC shutdown & Usage Monitoring

Even though we work from home, we still have some PC’s in our office that we remote onto. There’s certain systems that only seem to work when on the physical network so people often call this a “jump box”. Our IT was planning on temporarily moving our PCs whilst some electric work was being done in the office. I was invited into a Teams chat which was supposed to be for everyone affected. After skimming the list, I spotted 3 people that were missing, and other colleagues spotted others. 9 people were missing in total! How do they not know who owns the PCs? They have been citing “increased security” in recent times. Surely a security risk if they don’t know who uses PCs on the network.

More recently, I was contacted again via email asking “if you use this PC”. Again, why do they need to ask if we use them? Isn’t that a security concern if not? Surely they know, especially when they have installed extra network security tools recently. I thought they had said software monitors network traffic and alerts for anything suspicious.

Upgrading Software

I was contacted by IT saying my SQL Server version was no longer supported by Microsoft, so I need to urgently upgrade it by the end of the week due to being considered insecure. They said if I want an installer, please reply. I thought it would be easy enough locating the installer, but it seems Microsoft’s SQL Server pages are very confusing. So I replied asking for the installer. They ignore me. I reply again, they ignore me. Months have gone by. So not that urgent then.

IT then announced that they are taking increased security measures and are removing all admin rights from our PC’s. Now we can only install software with their permission. They also said it makes sure we can’t install unlicensed software, since it is easy for someone to install software that is free for personal use, but is paid software for commercial use, and then the business can be liable.

A week later, they then email us saying there is a known security vulnerability with our Visual Studio version so we need to update it. We can’t though, we need admin rights to keep our software updated and secure! So now we have to log tickets, then they remote on and type in the admin password to proceed. I bet they love that.

In a similar fashion, they are more fussy with USB devices. They sent one of my colleagues a new laptop but it rejects his smart-card reader which he needs for testing. Can’t be plugging in USB devices these days.

Saving Money

They also said they wanted to be more stringent when it comes to licence keys, as we seem notorious for purchasing more licence keys than we need, then we might stop using software then still pay. I was contacted in early July 2022, saying that I have had a Jira licence for the last year but have not being using it:

We currently purchase a licence for you to access Jira. We understand a lot of the users will have now migrated to Azure DevOps and as such, your access may no longer be required.

May I kindly ask you to respond to this email by 12pm Friday 8th July confirming whether or not you continue to require access?
IT Email

So I reply saying I wasn’t using it and I don’t think I have used it for 2 years. I then got contacted again in February 2023 saying the same thing. I confirm that I don’t need it. I then got contacted earlier this month asking me again. So I’ve had a licence for 3 years now for a product I don’t use at all.

The Outage

Managing the Public Relations

We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.

I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.

We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.

“They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?”
Customer quote

All Hands

5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.

I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.

So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.

He described this situation as a

“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;

1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.

2. Pre-approved changes are suspended

3. Any changes submitted for approval will require TWO members of SLT.

4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.

5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.

Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.

What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?

So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.

Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.

So would the CTO’s new process stop this happening?

New Rules – My View

“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”	The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”	Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks: “Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”. You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”	How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise. We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”	This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”	The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.

My opinion

The Developer Feedback

On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:

“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.

One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.

Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.
Deployment staff member

Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.
Databas Expert

With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all cost “Fix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]

Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:

When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?

We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.

[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]

One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.
Mobile developer

The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.

We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.

We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.

As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.
Tester

After we made all these points the Technical Director just dismissed all our concerns.

“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”
Technical Director

Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.

Architect

I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.

I asked a Software Architect what he thought:

“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”
Architect

The Next Release

Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:

“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”
Technical Director

I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.

“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”
Lead Developer

Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).

When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.

Feedback

On our next meeting with the CTO, he asked how we had been adapting to the new process.

One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.

It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.

I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.

Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.

Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.

A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.

So we can fix forward then? And we can deviate from the rollback plan?

Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.

Refinement to the process

After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.

Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.

Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.

We are almost back to our old process!

Conclusion

I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.

Datadog

Introduction

In recent times, the likes of the CTO have stated that we need to use modernised technology and tools. One aspect that they love is software that produces statistics/metrics that we can then judge improvements over time.

When we buy software licences for such a tool, there is always hype among certain groups of people who will volunteer to take ownership and work on implementing such software (installation, training, creating a “best practices” process), and will take any opportunity to highlight it to the managers.

So the “soup of the day” is a tool called Datadog which seems like a very powerful tool and has all kinds of integrations. I found this “jack-of-all-trades” approach was difficult to really understand what Datadog was for, and why it was different from what we had before. I knew we had Dashboards that showed which servers were running, their processor/memory usage, and which versions of our software was installed, and more. Datadog is used for this purpose too.

https://twitter.com/_workchronicles/status/1509146599355781122?s=20&t=QxTz3UkI_BvJg3WdTXk12w

Jargon Sales Pitch

One reason why it is difficult to understand is that Datadog’s webpage spouts loads of jargon, but also internally, managers love spouting jargon too. Here is what one DevOps member said about Datadog (warning – these next paragraphs contain a lot of jargon):

“As our organisation continues to grow and evolve, it is essential that we have a comprehensive and centralised observability solution in place. Currently, we are using multiple disparate siloed monitoring tools, which not only is inefficient but also hinders our ability to identify and resolve issues promptly. This leads to decreased visibility and a lack of agility in our operations.

Datadog observability provides a unified platform that consolidates all our monitoring, logging and tracing tools into one solution. This not only reduces the complexity of our monitoring landscape but also gives us a single source of truth for all our operational data. By implementing Datadog observability, we will have the ability to quickly and easily identify and resolve issues across our entire infrastructure, reducing downtime and improving overall service levels.

Moreover, Datadog observability offers the ability to deploy configuration changes to the Datadog agent with agility, which is critical in a fast-paced and dynamic environment where changes to our infrastructure occur regularly. With Datadog observability, we will be able to quickly and easily make updates to our monitoring configuration, ensuring that our monitoring remains up-to-date and relevant at all times.

With a pre-approved change, it will be easier for us to leverage the 600+ integrations that we can configure to further enhance our current infrastructure observability, root cause analysis and incident mitigation. This will allow us to gain greater insights into our operations, improving our ability to identify and resolve issues before they become critical.

In conclusion, authorisation and creation of a Datadog pre-approved change will bring numerous benefits to our organisation, including increased visibility, improved agility, and reduced complexity. This solution will help us effectively monitor and manage our infrastructure, ensuring that our operations run smoothly and efficiently.”
DevOps Engineer

That really sounded like he was saying the same thing multiple times and was really emphasising the speed. I think a concise statement is that “Datadog is one software product for monitoring, and can replace many metric tools that we currently have”. So I would imagine it should be cheaper (paying one licence rather than several), and since it is all in one place – probably easier to create new dashboards.

Jargon From The Docs

On their page, Collect SQL Server Custom Metrics, they show how you can run a custom query involving a person’s age. Isn’t that a terrible example? This would run every minute (or whatever it is configured to do) and you will create graphs from this. Without good examples, it’s hard to understand how or why you would use this feature. Other problems are due to excessive jargon.

“In v6, DogStatsD is a Golang implementation of Etsy’s StatsD metric aggregation daemon. It is used to receive and roll up arbitrary metrics over UDP or Unix socket, thus allowing custom code to be instrumented without adding latency.”
Datadog

“Enabling JMX Checks forces the Agent to use more memory depending on the number of beans exposed by the monitored JVMs.”
Datadog

Official Training

Members of the Network team, DevOps, a few managers, and some volunteers (who want managers to look upon them favourably) – signed up to a session with official Datadog training staff. These sessions were recorded, and I watched these and made a few notes; although it was just riddled with jargon and hard to know what anyone was talking about.

“Datadog Expert Services, or DES for short, is a set of “guided hands-on keyboard” pair-programming sessions. These collections of sessions, collectively known as an engagement, are time boxed and specifically designed to enable you to get the most out of Datadog while adhering to best practices. In this session, our team will work you to configure and deploy the Datadog Agent. This includes deployment in a standard, or Kubernetes containerized environment.”
Datadog

There were 2 courses that it seemed that these people were enrolled on

Name	Time	Cost
QS-INF course	2 Weeks + 5 Sessions, Curated	$15k
QS-INF-LOG course	3 Weeks + 8 Sessions, Curated **	$25K

Training cost is bonkers isn’t it? Once you have paid all that, then it pushes you toward the sunk-cost fallacy.

One of the Instructors asked what our infrastructure was.

“we’ve got resources and infrastructure in Azure, with a bias towards AWS, then we have on-prem; most of it is Windows Server. A combination of 2012…and onwards. 2016, but mainly 2019 as well. They also run on Windows HyperVisor, and also VMware – so they are virtual machines. But actually, we also have physical servers as well.”
deployment dude

Basically, we just made it up as we went along and got all the things! It sounds like a similar thing was done with the monitoring, because the deployment dude said we have “16 or 17 on-prem monitoring tools, as well as custom Powershell scripts to generate some data to monitor”

The Datadog Instructor explains that we have to log tickets if it is outside our “engagement time”. They will reply when they can but there’s no set time-frame.

“That’s fine with us, we log enough tickets already, so that’s fine. I think we will welcome that.”
DevOps Engineer

It’s almost like we were taking any opportunity to slag our company off.

No Going Back

Good news everyone!

The DevOps engineers with support from the Architecture Team have levelled up our Live datacentres!

How? With estate wide deployment (completed Friday evening) of the incredible, uber-awesome full stack monitoring SaaS Datadog!

If you’re aware of Datadog’s capabilities, effortless integration and out-of-the-box features you’ll appreciate how monumental this is.

For the uninitiated, Datadog in a slick, AI driven, intuitive UX allows full stack monitoring of servers, databases, tools, services, containers, et al.

Effortlessly switch from viewing the entirety of all network traffic to drilling down into individual requests, logs, payloads, processes, you name it, in real-time.

Going forward we envisage significant improvements to our reaction and mitigation of all types of incidents, minor to major!

We are currently trialling access – To request access please join our Slack channel.

Stay tuned as we have more exciting stuff coming as a result of our DevOps strategy!

Watch this space!
DevOps Engineer

Web-based Demo

One team put together a small web-based app and presented a demo to the department to promote Datadog, and obviously, take the opportunity to look amazing in front of the management.

The team lead was trying to show a feature called “Cumulative Layout Shift” but didn’t explain it. He made out it could track how many parts of the website load – so you know how sometimes you load a webpage and might see some text, then an image suddenly pops on screen, then some adverts, and often it causes the layout to change, then some more adverts appear, possibly changing the layout once more? It’s not a smooth user experience and causes a lot of jerks if the user tries to navigate the page before it has fully loaded. So how does Datadog track that? What is tracking it? and wouldn’t that mean there are multiple server calls to Datadog to log it? The web page is already slow, so why would adding extra server calls back out make it better? I can’t see how that can be performant, especially when you have thousands of users. Isn’t this process logging an insane amount of trivial data over time? I think I was left with way more questions than answers.

He also said it can track time spent on a particular web page, view count, error count, action count, frustration count (he claims Datadog tracks clicks out of frustration. How?). When people are already worried about the amount of monitoring/tracking/surveillance with the likes of tracking cookies – and then websites can track you to this granular scale with Datadog; it is a bit worrying isn’t it!?

Everyone should use Datadog

In following department meetings we were told by the CTO that all teams would eventually use Datadog and we need to increase the amount of monitoring, and to do it quickly to take advantage of the benefits of the tool.

My manager wanted our team to create a Datadog dashboard. Even if it wasn’t that useful, she wanted to be among the initial users – probably to look good to her manager.

I asked one of the smartest developers if it was even suitable for my team. He was looking into creating a dashboard for his team, but his team had an API that third-parties could use and it was prime for this kind of monitoring.

He was a bit vague though:

“You could create a custom metric for it. But I wouldn’t be too sure. I’m probably going to use custom metrics for “#messages per APP per minute” sort of thing. But I can get all that from my Logs/Traces. You’d have to have something pulling that data from the main databases which would involved pushing it to Datadog.”
Principal Developer

I asked other people that were using it, and people just kept on saying they weren’t sure, or maybe others have done it.

“We’re making heavy use of Datadog for our new software and I believe it’s also being used in other areas too. It’s incredibly powerful and provides a huge amount of detail. Getting the most out of it is important and also building some standards seems like a good idea. Do we have any thoughts around how we ensure we lead the way with this and get some standard/learning/documentation in place?”
Developer

No one can give a straight answer when it comes to this stuff. People are like “EVERYONE is using Datadog“, then when you ask about it in more detail, they are like “well SOME teams are using Datadog“, then when you ask more people, they are like “there are some metrics but not quite the ones you want“

Performance Problems

I asked my Software Architect friend (who seems to know everything) if Datadog is as flawless as people were implying. My intuition was thinking it cannot have zero disadvantages.

Me
Won't Datadog just cause performance issues if we start monitoring everything?

Mark
yep, or run while patching is in progress and block access to the Database/tables, which has already happened. Running ad-hoc scripts is a fairly bad idea
Hosted had to run patching twice the other week, which pushed us out of our Service Level Agreement.

Me:
this juicy gossip keeps quiet doesn't it

Mark
yes because Datadog is a massive success and we paid lots of money for it

Technical Director

Recently we hired a “Technical Director”. He asked how Datadog was coming along and if we can highlight any issues so he can get involved. This prompted John to go on a rant. The TLDR of this section is that “Software Developers don’t know about infrastructure of the Live Production environment.”

I think one of the company’s biggest challenges is how many products we have, and how diverse they are. We have no real standardisation due to a number of different factors, not sun-setting old services, not tackling tech debt, products that were developed by other companies and came to us via acquisition etc..

As a result, I think it’s difficult for us to template things out such that it can work for multiple people.

Realistically, each team for each product needs to look at how their product works, how it’s used, what tech it’s built on, and build a solution that works for their product. And I think one of the biggest challenges at the company is the ‘DevOps wall of confusion’ isn’t just a normal wall, it’s a Trumpian 15 foot high one with razor wire. Lots of products have dev teams (assuming they have one at all!) with little to no exposure or knowledge of how production works and what it looks like. For so long dev teams were told they had no role in production, no need to access it and were kept locked away from it

For reference, I used to think like that. I’ve been here 15 years and I have been part of the mindset in the past. It’s changing, and I’m happy to be one of the people pushing for that change, breaking down that wall of confusion. But that’s one of your biggest hurdles – is that people don’t know what to monitor in production because they don’t know what it looks like, and trying to monitor it by just copying a template that worked for somebody else, but doesn’t work for their solution isn’t a way to solve it

The key to unlocking Datadog for me, is to get people to have visibility of production, to understand how it’s used and what it looks like, and then start to work out what metrics are important, and what “normal” looks like so we can alert when we deviate from that

I can talk for hours about this, my team has one of the best observabilities out there, and had it before Datadog came around. If you want to have a chat, happy to have a discussion about what we can do.

I may have painted a somewhat negative opinion above, and I agree that there are things that we can improve. But we can’t expect some pretty Datadog dashboard templates to solve the historical problems that have meant we have lots of live services in the business with nobody who understands where they are or how they work and crucially expect Operations 24/7 to be able to magically pick up the pieces and fix it by themselves when it falls apart.

Yes, the company has a long history of developing a solution, moving the team that developed it off onto a new project, and leaving that solution behind. Combine that with a massive wall of confusion between Dev and Hosted, you have hosted running a bunch of servers that they have no idea what they do.

Case in point right now, the “Login and Identity service” is in the main data-centre, and we also have one in the DMZ that was built for the Mobile app, but nobody is quite sure what the main one is for. I have some notes that indicate it was built for the Connect app, but Connect doesn’t use it. Yet still that production sits there unused with nobody sure why it’s there.

You’ll find a team that has maybe done work in the past on Appointments, maybe even recently. Are they currently working on Appointments? do they have any knowledge or visibility of production? is it even on their radar that they should be monitoring the performance of it?

This goes deeper than just dashboard templates, it’s a company culture problem
John

Anomaly detection works well if the metrics are predictable for different periods of a day. It’s not a “AI” as we thought when I tried it out, it’s more of a fancy algorithm than machine learning.

I found with XMPP that method would work OK for Mon-Fri, then the alert will trigger all weekend because traffic wasn’t as high those days.
Lee

Scheduler

I was added to a group chat where the team was discussing how to use Datadog on our “Scheduler”. It sounds like an easy task, but there’s way more nuance and technicality to it. The main problems we have with the scheduler is that:

some jobs fail and then wait to be reset,
some jobs run but get stuck (I think most cases the scheduler isn’t informed that the job has finished, so it fails to schedule the next run).

The TLDR of this section is that: there is a lot of junk data (and I mean A LOT) and reporting on these figures can be somewhat misleading because failed jobs for organisations that no longer exist aren’t a problem (although we should stop them from running since they are obsolete).

John
Surely we need something that runs that shows us:
a count of jobs in Error Status 	
A list of jobs with a status of Running, 	
Long Running Jobs


Matthew
We'll want to take into account the normal running time of a particular job. We don't want to be alerted about a job that usually takes 2 hours to run and it has only been 1 hour.
We'll get to ones that get stuck quicker if they usually take a minute to run

Dave
Someone should run some queries against live now, and get a picture that looks correct.

Matthew
We also want the data to be meaningful otherwise we'll be back to where we are now - where we don't know what's actually working and what isn't. There's a balance to be had here
Christian
Can we summarise the Key Performance Indicators that will cause an action that needs to be performed? These become multiple metrics IMO, that add together to give better context

John
1. Job queue building up
2. jobs failing and not being reset
3. jobs getting stuck

Matthew
• Large numbers of jobs not running when they should
• Jobs stuck in running beyond their normal running time
• Mass job failures
• Mass job queues (this has the potential to false flag when workload is high)

John
There's a bug / unexpected behaviour where the scheduler can fail to update the database with the result of a job and the table shows it in status Running. Scheduler logic queries the tables for what is / isn't running. Leaving it to make decisions that it can't do stuff because a job is "running" when it in fact isn't.

Matthew
If this is a bug, the smartest thing to do after the monitoring piece is to fix the piece of software causing it surely?

John
the secret to any good bug report is reproduction steps, and it's not an easy one to reproduce
You mentioned you'd had one get "stuck" recently. Do we know how to reproduce that behaviour again on demand?

Matthew
"Just let the scheduler do it's thing and wait" is the only way we know how to replicate these

John
hence why any developer would struggle to fix it because it's difficult to know where to look if you can't reproduce it

Christian
"Treasure what you measure" or "Measure what you treasure". Simple counts and alerts will likely get us to what we need very short term which is to prevent or proactively manage / reduce potential for a Major Incident.

Matthew
I've got some initial queries together for this that could be used to alert on if the numbers get too high. I'd appreciate someone who knows TSQL to have a look and to suggest any improvements to the data being returned.

John
the 3000 + jobs in error is scary
Do we need to filter that list by only jobs that we know get automatically reset by the Hosted DBA agent jobs?

Matthew
Maybe, I did think that but I also thought that we should really know how many jobs are actually in error

John
I know that list in itself is a problem. But I think all Domains are going to have a high failed count and it's difficult to know if there are important ones in that 3000 + count

Matthew
We shouldn't alert on that metric, hence the one to track how many in error for the last hour
The scheduler is a massive mess and that 3000+ count suggests we have a fair bit of clean-up to do.

John
the only suitable metric I can think of for "important' is ones that the Database Administrators already deemed as important and created automated resets for.

Matthew
I could add an additional "Important" row to the general stats that includes those (or excludes any that aren't those)
Need that info from the Database Administrators though

John
Do we maybe need a couple of groups rather than just 1 "important" group

Matthew
I'd rather split the jobs and call out their names though, rather than pile them into one huge count if we're doing that
Let's get the data in Datadog first and see what normal looks like and tune alerting accordingly
JobTypeIDs aren't consistent across the estate by the way, so you'll have to match on JobTypeName with a join to the JobType table
<Image of 2048 Ready jobs. 47 Error>

John
Interestingly those 47 jobs are haven't run in years. some of them last ran successfully in 2016
but we're resetting them every day to try and run and constantly failing (job run count of 271,280)

Matthew
Hence my comment about a lot of clean-up - I'm willing to bet these are trying to run for closed Orgs, or orgs that have moved endpoint

John
Each Domain will probably need work to get rid of all the false alarms
I know when I checked one domain there were 40 + jobs that had never ran and were just constantly being reset
Maybe an idea to simply disable these and change the script to add & Enabled = 1 to the filter so you count only enabled jobs?
That should help remove the false positives you know about - then you can actually alert if the value goes above 0 for jobs in error

Paul
We are assessing whether the best approach to reduce the number of scheduler incidents is to deliver the new scheduler with improved logic and Datadog integration which will take time. 
Or support the integration of Datadog with the current scheduler.

Matthew
If it's the former, should we still do the latter anyway until the new scheduler logic is in place?
I suppose what I'm trying to ask is will the time-frames for implementing the new logic be quick enough to satisfy the urgency of monitoring the scheduler?

Paul
Yes agreed, we have just reviewed the last 9 months of incidents and having Datadog reporting would have given us the insight to avoid a number of these.

John
As well as adding an "enabled=1" filter Matthew, do you think it's worth adding a runcount > 0 filter as well to avoid counting jobs that have never ran?
For the sample Domain I looked at, every priority job in error had a run count of 0 showing they've never worked. Adding this would bring that result down to 0 which makes it much easier to then set an alert if that goes above 0

Matthew
I thought about that, but that will mask errors with jobs that should run but haven't. We'll want to see those. New job types as well, for example

John
going to be a hell of a job tidying up all the crap scheduled jobs in the scheduler AND setting up and calibrating monitoring at the same time
My thoughts were to filter those out for now, then look at those crap jobs later

Matthew
Yep, it is, but we can't ignore the mess as it won't go away. A lot of work disabling jobs will be needed to get the overall stats query to show nice figures. We shouldn't shy away from them looking terrible though. I don't believe in fiddling figures to make things look nice when it comes to monitoring
The other queries that show failures and stuck/running jobs for over an hour will help with spotting immediate issues though
One particular situation to take into account is the longest-running job we have is 8 hours. We can take care of that in Datadog with trends and anomaly detection to tell us when more jobs than the expected ones are stuck in a status for longer than an hour.
Similarly, we can use that same alerting method to warn us when the numbers on the overall stats aren't within usual parameters. Change detection is also a good measurement to use here too. We don't necessarily have to use traditional methods of alerting as soon as a value is over X

John
that sounds to me like a case of another metric
count of scheduled jobs running more than 1 hour where job type is NOT “expect long-running jobs”

Performance Problems Part 2

Note: If the min_collection_interval is set to 30, it does not mean that the metric is collected every 30 seconds, but rather that it could be collected as often as every 30 seconds. The collector tries to run the check every 30 seconds but the check might need to wait in line, depending on how many integrations are enabled on the same Agent. Also if the check method takes more than 30 seconds to finish, the Agent skips execution until the next interval.
Datadog

It seems that for custom sql metrics, you can only specify a single time frequency to run ALL the queries. So if one team creates a query they want to run every minute, and another team wants to run every hour – you can’t.

One team wanted to run a long-running query, but since the first team had set the queries to run every 60 seconds, then this long-running query wasn’t possible.

In a similar fashion, we also anticipate problems if the total time of all queries exceeds this 60 second limit which we will soon do with only several queries.

Another problem that we found is that the time you set is just a guide. So Datadog could actually run it twice in the 60 second period. Then when it comes to creating the dashboard, you have to be careful that you don’t end up counting the data multiple times. Some teams were seeing decimal numbers on their charts when counting data with only whole numbers!

The possibly crazy workaround
John

I think a good workaround would be to have the data refreshed hourly, and placed in a separate database somewhere then have that separate database queried every 60 seconds by Datadog. If it’s separate to the Live estate – it should reduce the risk. Needs thought putting into how you would pull those stats into a database hourly however. Need a SQL agent job or similar that could collect them once an hour and push them to a central separate location.
John

key thing, would be to ensure we aren’t using the scheduler to kick off SQL that monitors the scheduler 🤣
Christian

Need More Tools?

We purchased Datadog to replace several metric tools. Now we have seen the SQL metrics are a bit basic, it seems like we are concluding we need an alternative tool. I wonder if we will keep buying different metric tools over time and end up in the same situation we were in before!

You get some basic SQL monitoring in the base DataDog install (or the APM one, not sure which). You can pay extra for “enhanced” SQL monitoring in the DBM module. It’s still very basic and about 3x the cost of “proper” SQL monitoring tools. I’m sure as the DBM module matures it will get closer to fit for purpose, but right now it’s an incredibly expensive SQL Server monitoring tool that’s feature poor. If it was near zero cost, I’d have it everywhere to give a wider audience better visibility of what the DB layer does in live, but the features don’t currently justify the cost of that particular extra module.
Database Administrator

manager: can you draw me a pretty graph

pretty graph:

Performance Tales: Tasks

Recently, a team was created in order to tackle major issues, often proactively. Some errors are logged without the user noticing anything, so by monitoring the logs, the team can diagnose and fix these “silent” errors. The other thing they are looking out for is performance problems. Some performance problems aren’t noticed when it’s more of a minor increase, inconsistent, or slowly gets worse over time. I would think some users don’t bother reporting slowness because it’s harder to quantify than the obvious crash.

However, one user had been seeing a recent drop in performance by not dealing with their tasks, but it had got to the point where they could no longer log in, as it took longer than 30 seconds to retrieve their tasks on login – so it timed-out (as in an error was meant to be thrown when the time to retrieve their tasks took 30 seconds).

“At the time of logging this bug, the user currently has 136,854 tasks in Tasks Management. The Program performance will start to be negatively affected after 4,000 tasks. I have extended the timeout of sql call for TasksManagement.GetUserTaskCountSummary to 60 seconds as this caused a login failure.”
Walter (Developer)

“let’s be honest, the program performance will start to be negatively affected after 1 task”
Mike (jestingly)

I think this is acceptable as a quick fix to allow the user to login again, but is it really acceptable for the login process to take more than 30 seconds? I’d imagine it would take around 40 seconds for this user.

That’s the problem with this team, they just look for quick wins, even if it isn’t the right solution long-term, and might even move the problem elsewhere.

What’s better than waiting 30 seconds? waiting 60 seconds? or threading it off to delay the loading? or Mark’s idea: no timeout.

What if the time taken still exceeds 60 seconds, assuming some other user has a large volume of data? Can you set the timeout as infinity?.
Mark

we had a customer with a very large count and it only ran for around 32/33 seconds and their counts were unusually large. We are going to send out communications for customers to keep these low by means of maintenance. The 60 seconds just allows an extra safety net if we get in this situation again. I don’t want to extend the timeout to be too long for this reason as it will unlikely (if ever) be needed to be longer than 60 seconds.
Walter

Why not a try catch / retry attempt for this? It should be a non-essential call to logging in, if it fails, you can catch, log, and show an error message. Should we not look at optimising this so that you can login quicker?
Maybe run this on a background thread too?
Lee

I discussed this with Johnny and making changes to this Stored Procedure could result in worse performance for smaller datasets and advised against making changes to it. We’re going to tackle this by means of communications to sites. I thought the simplest and safest approach is just to extend the timeout slightly so that the practice does not suffer a system down as a result of it – the timeout is only breached by a second or two.
Once the user logs in, they are displayed their task counts, so I think it might be deemed essential (rather than showing them a loading wheel until the data is returned). Currently, if we did this, when loading up Tasks Management it would just crash with an error.
Walter

It would still crash on logging in if it takes over 60 seconds.
Why not make it non-critical to logging in?

Log in

Status bar “Loading…”

Completes – OK.

Fails – Show error and retry link.

Lee

This was the worst site on the estate and was taking roughly 32 seconds. To take over 60 seconds the task count would probably be unheard of.
Each time I ran that stored procedure manually, the results were the same so I don’t think a retry is going to work here.
Even by changing to make it non critical to logging in, Tasks Management will still be busted when you try to load it. The timeout is on the SQL side so that is the area we need to resolve really.
Walter

However, Johnny did advise against alternative solutions such as:
1. fine tuning the stored proc
2. adding indexes
3. Remove tasks counts completely for some types of tasks

My View:

Walter seemed to have put more thought into it than I originally thought, but I still thought he was overlooking Lee’s suggestion. Yes, it would need more work to actually work (display loading text on the Tasks Count Bar, then loading screen when launching the Tasks Management page), but it would significantly speed up logging in. If this user could log in 32 seconds quicker, then what would the average user see?

If the other parts of the log-in process also take some time, then that’s a long time they are waiting in total. If taskcounts are the bulk of the time, then we can make it super fast if we take it out. I would have thought users would expect times of 5 seconds or less (might not be possible, but that’s the scale we need to aim for). Walter is talking like users are more than happy to wait 30 seconds or more just to get to the home page. A long wait is better than not being able to log-in at all, but surely it’s generally unacceptable to be more than several seconds in total. It’s one of the reasons why users have grown more discontent over time.

When doing some testing of smaller counts, for example 10k – the results are returned in a few seconds (2-3). This organisation had around 120k Appointments Tasks across all users plus all of their other tasks which resulted in a production duration of 32 seconds. The more they manage their tasks the quicker workflow will be, that’s always been the message we’ve tried to get across.
Walter

Project: Batch Approval

This long blog documents what I have been working on for the past year. I had made lots of notes with the aim of writing a blog, in addition to taking extra notes from chat logs.

We actually estimated the project would take around 5 months, but then an extra 2 months for testing and go through our slow rollout process. It actually took closer to a year. I’d say it was a combination of:

realising the feature was more complicated than anticipated
the UX team had little knowledge of the actual user experience
managers changing or trying to change team members
our slow release process
90/10 rule of project management

We were told the project was important, yet we were only assigned 2 developers (as in myself and one other). As the project came to a close, we were being integrated into our new team, therefore other developers could help out during the final stages.

Here is a list of all the people involved over the project’s lifetime:

Name (Core team in bold)	Role
Me	Developer (Team Lead)
Daniel	Developer
Dean	Developer (Temporary)
Dennis	Developer (Temporary)
Tina	Tester
Tim	Tester
Colin	Technical Manager
Mary	Technical Manager
Olivia	Product Owner
Owen	Product Owner
Carl	Customer Representative
Adam	Architect
Andy	Architect
Grace	Safety & Legal Governance
Ulrika	UX
Ursula	UX

I’ve made the names start with a letter to represent their job title, apart from Colin because he is a recurring person in my blogs. I’ll put reminders throughout the blog so it is easy to follow.

Current Software

To protect anonymity, I need to come up with a different theme for what the software is for. Let’s say customers request various restricted items of different severity. So a request could come in for a Hunting Rifle, and the user needs to know if they have the adequate licence to possess firearms and they are deemed medically safe in a recent time-frame. Possible warnings are shown which the user can dismiss/acknowledge e.g. “licence is up for renewal in the next 3 months”, “recent purchase of other firearms”. Standard users can create “Awaiting Approval” tasks and assign them to users with authority to approve. To approve them, the authorised users open the task list, view the details, then click approve. Many tasks have either no warnings, or low-severity warnings, so users often just glance at the info and click Approve. The system then sends the approved request to a central system, then loads up the next task. There’s a couple of seconds delay due to the “digital signing”, a couple of seconds for sending, then loading up the next record. To sign loads of tasks, it’s a very slow and laborious process. It’s a major source of complaints from our users.

Unsafe/Unofficial Automation

Carl [Customer Representative] sent a link to a video where someone was demoing a commercial automated tool that autocompletes the tasks. It waits for the system to load, clicks the approve button, then repeat. So you could set it running, then walk away from your desk.

I thought it seemed ridiculously irresponsible and would cause people to be sacked if they got caught using such a tool:

A) The program is now the one authorising the tasks, not the qualified user. What’s the point needing to have qualifications if you aren’t even going to read what is on-screen? If a task was wrongly approved, then the user would be accountable.

B) if you walk away from your desk, you are leaving your PC unlocked, along with your physical Security Key.

The creator had actually put a bit of thought into it though. If there are any Warnings that require another click to dismiss/override, then the automation is paused.

The video claimed that some users have up to 500 tasks to sign after a weekend. They charge a fixed yearly fee of £295, plus 7p per customer on the system per year.

“the robot does not get bored, does not make human errors, and crucially is a lot cheaper than the user’s hourly wage”
Promotional video for the Automation tool

Probably just makes robotic errors instead!

I said we should change the names of the buttons to try and screw them since it probably uses something like that to locate the button to click. It would be quite funny to make them dish out refunds.

The existence of the automation tool shows how much the users desire a better solution.

UX User Feedback

Given the existence of such an automated tool, it is no surprise that one frequently requested feature is Batch Approval. Our UX team put together some kind of interactive prototype and invited a few users to provide feedback on two designs. The alternative design was actually produced by Mary [Technical Manager] who has no UX qualifications. I’m not sure how that came about and why UX agreed to trial her design, but the feedback was actually extremely favourable to her design.

This caused her to be quite smug and maybe caused some animosity as we will see later. The ratings out of 5 were:

(Option A) 4.3 for Mary’s design

(Option B) 2.3 for UX Team’s design

For additional comments, one user commented:

“I prefer Option A by a country mile – Option B feels even worse than the existing system!”

Another commented:

“Option B feels more clunky, less user friendly than option A. A lot of clicking involved”

One even gave a threatening response:

“Option A or you’re gonna lose me and my franchise”

Shortly, there was a write-up from a conference where the feature was announced:

“This item is one that really did steal the show – this is something that our customers have been very eager to see us implement and are very excited to learn that we are busy developing this solution.”

“Busy developing this solution” made me laugh, because at the time, all I had was a dialog box with a couple of lines of text and a button.

Proposed Change

The general idea, is that the user is presented with key details from the tasks in a data grid.

They can click checkboxes to select which tasks they want to approve.
These are added in a queue to send in the background.
The user can continue working as they are sending.
The “digital signing” has to take place on the user’s computer so a large part is done client-side.
The user has to remain logged in until the process is finished.

This project had actually been discussed for years, but because there wasn’t much of a “commercial drive” for it – we would be giving users this feature for free – it was always low priority.

Product Owner: Owen

I think the initial planning was done by a different Product Owner but then when the project fully began, we were assigned a new Product Owner, Owen, who was new to the company, but he also gave me the impression that he was new to the role…but also didn’t seem very clever in general.

Here are some quotes that happened in various meetings (mainly Sprint Planning and Refinement).

Owen: "which work item is it?"
Me: “the one right at the top"
Owen: slowly scrolls...chooses 2nd item

Me: "it's not a Must, it is a Could"
Owen saves it with Must tag
Tim [Tester]: "No, Owen, you tagged it wrong, go back"
Owen: "Which WI is this?"

saves it with the Must tag again
Then goes back into the work item and gets confused
then goes back into it again. I think he needs rebooting

Me: "you need to set the state"
Owen clicks to close
Me: "you need to set the state, go back"
Owen is confused
Me: "left hand side Owen!"
Owen hovers over the right
Me: "left hand side Owen!"
Owen moves down

Me: "leave it as it is"
Owen "Which one shall I take out?"
I'm sure he is intentionally 30 seconds behind to wind us all up

Owen changes Story Points from 3 to a 5 without any discussion.
"shall we keep it at 5?"

For another item, I was talking about how the requirement is either obsolete, or needs a completely different approach from the initial proposal. 
Owen: "So how many points shall we add?"

"The system crashes when entering incorrect PIN and clicking 'OK' on error prompt"
Owen: "what was the behaviour before we fixed this?"
team: "It crashed"

We were discussing how we logged a bug a few months back but haven’t seen it occur since, so it will need some investigation to try work out what the recreation steps are.

“Assuming the bug still exists, how long will it take to fix it?”
Owen

Estimating software changes is hard, but I always think bugs are even harder to estimate. It’s only possible if there’s clear recreation steps, otherwise it is stupid to ask – we can’t fix it if we don’t know what the problem even is.

“depending on Grace’s [Safety & Legal Governance] feedback, do you know how long it would take to fix?”
Owen

Translation: can you predict what Grace would say, and given that she did say it, can you come up with an estimate for it?

I logged a bug about suggestions on how to improve a dialog. It would be up to Owen or UX to decide on the approach to fix it. Owen then asks questions along the lines of: “what do we need to do for this? do we need it?” I said it would be nice but it’s not my decision. Then he still asks “do we need it?” “can we close it?“

What’s the point asking me these questions, when I logged it with the aim of asking him to decide?

When the project deadline was looming, we ended up having multiple meetings to decide if there’s any features we could scrap, or defer to a later release. After the first meeting where we decided scope, he may as well have said “You know those items you said we need to do and couldn’t defer them, are you sure we can’t defer them”, because he was arranging subsequent meetings to go back over them. When we came up with estimates which showed that we would need at least another month, he was then arranging another meeting to re-estimate them.

The Architects

An important project started around the same time ours did. Our architect, Adam [Architect], was reassigned to the new project. Andy [Architect] joined our team as a replacement. He wasn’t completely new to the company but wasn’t familiar with this area of the system. Additionally, I don’t think he even looked at the software or even requested a demo.

Any question we asked him, he ended up making an excuse that he was busy and will get back to me later. Then when he did answer, I then sent a message to the original architect, Adam, and he said Andy had asked Adam about it and simply relayed the message back to us. So basically Andy wasn’t doing anything. We had him officially assigned, but it was Adam [Architect] that was answering the questions but via a middle-man.

The July Cancellation

There was a bit of disruption when our project looked to be cancelled, but there was apparently some mis-communication.

Hi All, a decision has been made by Directors to stop Batch Approval and to move resources across to pick up Project France instead. Therefore I will be cancelling the Batch Approval meetings.
Project Manager

1 day Later

The directors had decided to move you to the new project so I cancelled the meetings, but then I find that there wasn’t a firm decision from the Directors.
Project Manager

Brian has asked us to proceed with Batch Approval as originally planned. Sorry about the chaos dudes. They must be smoking some good drugs upstairs.
Olivia [Product Owner]

It was off the table, then someone put it back on the table, then someone else swept it off the table, then someone picked it up off the floor and put it back on the table.
Andy [Architect]

Coding Tales

Colin [Technical Manager]: "What sprint are you in?"
Me: "I dunno"
Colin [Technical Manager]: "you are the team lead, you should know"
Me: "No one in the team knows"

Put it in a new tab but make it behave like a dialog

The original UX designs looked like it fit nicely in the existing Task Framework. The requirements were that Batch Approval had:

Its own folder but is a sub-folder of Approvals
Opening a task opens it in a new tab

After looking at the code though, the framework didn’t actually support a sub-item. But we found a basic workaround to make it look like it did. However, there were quite a few features that we got “for free”, but we didn’t want them because they weren’t appropriate for a sub folder. So I had to disable the features by hacky code.

If you double click a task, then it opens in a new tab, which is what they wanted. However, they then didn’t want you to be able to navigate away into other parts of the system, and the Task Framework didn’t support that. With a bit of a workaround, I got that working, but the tab was designed to view one task only, and we are displaying a Batch of them. A few weeks went by and I managed to cobble something together, but the code was awful.

I took a step back and thought about it.

We have a tab that the users surely would expect to be able to move away from to view other tabs.
I’m using this “tab” which is designed for a single task, and I want multiple. So I had to make my own custom page.
We have hacked a sub folder and had to basically fight against the codebase to get it all working…

So why don’t we just have a button on the main folder, and it launches a modal dialog?

It would take a couple of days to get working,
the code would be neat,
and I think it’s what the user would expect.

After speaking to UX about it, they were happy with my proposal. I had wasted about 3 weeks trying to get it working like they previously wanted. Also, we are again telling UX what a good UX design is.

Scrollbar

The UX was also clear that we didn’t want a scrollbar to appear, and instead we use pagination. I didn’t see anything obvious in the standard DataGridView Winforms control, although I’m sure this is a common problem/requirement.

I ended up writing my own logic to add controls to the grid, keep track of the size, then stop adding when the size exceeds the height of the control. However, if there is only 1 very large task, we have no choice but to use a scrollbar.

The problem we encountered was that sometimes a scrollbar did appear when it shouldn’t. I made some tweaks to the calculation and it seemed to work fine. But then a Tester found a combination of task sizes where it still appeared. I couldn’t work out what I was missing in the calculations but it seemed about 4 pixels off, so I just added that into the calculation. Again, all seemed fine for a few days, but then the Tester found a combination of sizes where it still appeared.

Olivia [Product Owner] suggested that we detect when there is a scrollbar then disable the Approve button until the user scrolls down.

I said if we know when the scrollbar is there, why don’t we just remove the last task and check for the scrollbar again, repeat until the scrollbar has gone. I thought the code would be messy, and I’d end up writing a stupid code comment like “mate, something has gone wrong with the calculations here, so we’re gonna have to do some jiggery pokery to get out of this mess”.

Adam [Architect] did suggest some alternatives and they were just as wildly wrong.

Dean, a developer in another team agreed to help, and after a couple of days, he says “you can just set the vertical scrollbar to be disabled”.

But if the scrollbar is appearing so you have to scroll to view the content, then surely disabling the scrollbar will mean content is off the screen?

I tested his idea, and it worked fine! What must be happening is that the vertical scrollbar appears and takes some of the horizontal space… which causes the text to wrap and creates the need for more vertical space. Therefore the scrollbar is required and so remains. But if you tell the scrollbar it cannot appear, then the controls are added, and my calculations meant it fit perfectly in the grid.

It’s a self-fulfilling prophecy!

Olivia [Product Owner]: Do we have concerns about the unknowns?
Tim [Tester]: It's just the unknowns that we don't know about
I feel like you need to know the system inside and out to be able to safely implement this

Conflict With The UX Team

UX: “We want to minimise pop-ups”
Also UX: “Add a pop up after closing the dialog”

Ulrika [UX] had to take time off to deal with some personal problems. Ursula [UX] agreed to join the meeting we arranged on the Wednesday.

“I don’t work Thursday/Friday and have to leave early on a Wednesday to get the kids. I’ll get back to you next week”.
Ursula covers for Ulrika but then also has time off.

When she got back to us, she seemed to overlook how users access this restricted part of the system, and it turned out none of the UX team actually had this knowledge. So halfway through the project, we were discovering new requirements because they hadn’t designed the user flow.

Don’t Have Time

In early January, we were waiting for UX to give us some approved text but they seemed to be taking their time. I asked Olivia [Product Owner] what was going on, and she said that we don’t have time to make any more changes so they “needed to stop requesting changes”. Even though I pointed out that I was the one requesting changes, she said “we don’t have time to test” (even though it only involved quickly checking some text has changed on a message box). Nearly 2 months went by before we actually began to release.

After more protests from me, she says:

“The text is fine for now. We don’t have time to be changing it.”
Olivia [Product Owner]

When it came for the final review, reviewers questioned why we had dialogs with some ToDO comments on it saying “ToDo: Awaiting UX approval“. Even if you don’t have comments like that, I have seen developers question the user-facing messages if the grammar isn’t correct or sounds unclear. It definitely wasn’t clear because we just wrote the first thing that popped into our heads at the time; knowing the text would be replaced.

I think what had happened was that Mary [Technical Manager] and Olivia [Product Owner] had fallen out with Ulrika [UX], and then was refusing to authorise her changes. Remember, tensions will have been building since users criticised Ulrika’s design and wanted Mary’s design, and Mary’s arrogance about it wouldn’t have gone down well.

It’s just part of the process though – all text needs to be approved by the UX team; otherwise what is the point of their team?

Conflict With The Architect

When we implemented Adam [Architect]’s suggested invalidation logic, we thought the criteria was too restrictive. Adam was off on annual leave for a few weeks so we couldn’t consult him. So we made our own decision to change it, and got Carl [Customer Representative] and Grace [Safety & Legal Governance] in agreement. However, when the Architect saw it, he said it was unsafe. In many meetings, I got the impression Grace wasn’t really listening and she tended to agree with what we said. Not exactly great when your job involves telling the team what is safe and legal, and then get overruled by the Architect.

We came up with a compromise, and implemented it. Then when it came to the Code Review, Adam suggested removing one more of the sub-rules which I think would be perfect, but then Olivia [Product Owner] was reluctant for us to make more changes.

Then a week later, Olivia said she would arrange another meeting to discuss the rules because she felt it might be too restrictive. OMG. However, she then seemed to have personal grievances with Adam, so told me not to make the simple change, even though it would be what we want. She used the excuse of lack of Testing time.

Adam [Architect]

We shouldn’t be knowingly introducing bugs. Olivia [Product Owner] This is not a bug. It’s a change to the criteria and we are not going to change it a week before we finish. I am speaking to Carl [Customer Representative] about changing the criteria, and we’ll look at it then. Adam [Architect] A bug is any deviation from requirements. Why are you planning on changing it if it is not a bug? Olivia [Product Owner] That’s not a bug. You are right in the sense that we need to change it…we’re just not changing it now. I was happy to leave it as it was to get this out of the door. That’s my call to make. Mary [Technical Manager] There's a lot that's not right. But how long do we keep going until we give it to the customers?

A summary of how this situation appears to me:

There is a process, but if you declare you want to move the process to the next release, then it is fine.
It will take too long to change a few lines of code, so we ain’t doing it. Apart from when it is a comment on the Code Review, then we are doing it, apart from those that we aren’t.
It takes longer for Olivia [Product Owner] to argue against it than to fix it.

The CEO had recently posted:

“The most important thing we do every day is keep our users and their customers safe by managing risk effectively. I know you all know this, but it warrants repeating: safety is our number 1 priority all day, every day – regardless of anything else that is going on. It trumps everything. Please always remember that.”
CEO

Our Managers are like:

“Next release”

The Technical Manager change

Colin [Technical Manager] complains that Daniel [Developer] and I haven’t handled the project well – and it overran by over a month at that point. A week or so later, the team was on a call with other stakeholders and he said

“you guys have done a tremendous job”,
Colin

then said the delay “was caused purely by scope-creep and nothing to do with the developers at all”.

“Mary is in charge of the team since yesterday”
Colin [Technical Manager] with his timely announcement

I got the impression that Mary just wanted to get rid of the project, because it was dragging on for far too long.

The Testers had nothing to do since us Developers were working on the last few bug fixes. Tina [Tester] said she was just re-testing old features to pass the time, but also get extra confidence there are no remaining bugs. Mary [Technical Manager] replied:

“should we be doing testing when changes are ongoing?”
Mary

Well, in that case, this statement means testers should only be hired for a couple of weeks right at the end of a project – since changes are constantly ongoing. I think she might have intended it to mean like “you’d better not find more bugs!”, but if there are bugs, then you definitely want to find them before our users do.

On the last day of the Sprint, Tina [Tester] took annual leave. She had left her assigned items in the “To Test” column of the Kanban board. There was no evidence she had tested the item, so I don’t think it wasn’t a case of just forgetting to move to “PO Approval” column. Olivia [Product Owner] and Mary [Technical Manager] then decided to just close the items. No evidence, no demo – just close them so the Sprint looks good, and looks ready to release.

What annoys me is that Mary had criticised how we had run our team and suggested we don’t follow the process. She stated that she perfectly follows the process – which leads to her successful projects. Then I see her cutting corners like that.

Just like Colin, she criticises me to my face, but then when we are in a group she states:

“I think you’ve done a fantastic job given that there’s only 4 of you”
Mary

A few days later, I had finished what I was assigned, but there was a bug on the backlog which Mary [Technical Manager] seemed to want to defer (again, she just wanted to release the project as soon as possible). I thought it couldn’t be released without this fix. I stated that I would like to look at it and she said:

“don’t do any development work”
Mary

Seems I have the day off then. What is the point in me sat around doing nothing? If I fix it, we can decide if it goes straight in, or deferred for the next release. Or maybe I won’t even find a solution. She just seemed desperate to finish the project so wasn’t considering the seriousness of the bug, or thinking logically at all.

The Backstab

I didn’t actually sit around doing nothing. I worked hard and found a solution. I knew that there was no chance Mary would accept my changes, so I needed to come up with a way of convincing her. My plan was to get the testers to informally test it, then I can say that I have a fix, and the testers are happy that there’s low risk of introducing more issues – so she would be stupid to reject it.

Testers Tim and Tina were in agreement that the fix should definitely go out in the initial release, and they agreed Mary was making a bad decision to consider releasing without it.

Tim said he would “have to check with Mary if he was allowed to spend time testing it” since they got told not to test anything. I said “there is no way she would approve it, that’s why we are doing this informally/secretively”. If Tim and Tina test it and find a bug, my plan has failed and Mary never needs to know that I attempted it.

It’s a perfect plan, or it would have been, but Tim then goes and tells Mary that I asked them to test it.

“You gotta start being better with your comminications – it’s not just yours and Tim/Tina’s decision if something gets put into the release – it’s a whole team decision but ultimately mine and Olivia’s. You’ve messaged them directly asking if they can get it tested, and as much as they’ll also want to get it done, it then puts them under pressure. This is how you’ve all got to a stage of being all over the place and burning yourselves out, it’s got to stop please.”
Mary’s chastisement

I shouldn’t have to go behind people’s backs and make my own decisions, but the entire non-management side of the team thought it should go in, and only the managers thought it shouldn’t. As a team we care about quality, but managers are just focussed on deadlines.

I also didn’t appreciate that she is accusing my decision making of adding stress to my team.

80% coverage

As the project got towards completion, I recalled our stupid “Merge Ready” process that no one seems to care about other than the small team who came up with it. You have to justify the likes of Code Coverage, and ours was at a low figure like 10%.

I’ll write some blog posts about my reasoning on when tests are good or bad in the future. A simple explanation is that Units tests are good when covering requirements, but often developers write them to cover implementation i.e. verify a particular method is called; but not that the particular method actually works. When you switch implementation, you have to rewrite a new unit test, slowing you down. Unit Tests are supposed to help you refactor, but in this case, it is a hindrance to refactoring. We did a lot of prototyping early on, and knew there would be large re-writes, so Daniel [Developer] and I decided to worry about Unit Tests later on.

When I declared the low number of Unit Tests, Olivia ended up raising it to the Directors for some reason. Why was it their concern? Do they even know what Unit Tests are for, and what the coverage actually means?

It could jeopordise my chance of payrises (I was correct, I got 0% this year) and tarnishes my reputation.

When Mary joined the team, she berated me over this decision and made the dramatic statement:

“We can’t go on like this”
Mary

She then asked a couple of her favourite developers to write some Unit Tests for my project, completely undermining me.

The thing is, both Dean [Developer (Temporary)] and Dennis [Developer (Temporary)] spent way longer than they estimated, and they didn’t do as much as they hyped, then when it came to make the last few changes, it slowed us down.

We ended up around 22% in the end, and the managers decided that is fine.

That’s the problem with us though… Do 80% coverage because it’s important. But actually it’s not that important, so you don’t need 80%. But TRY get 80%, Why?, Dunno, but the Document says.
Tim [Tester]

On track

Dennis [Developer (Temporary)] was also asked to helping out address the Code Review comments. In some ways, this kinda slowed us down. I told him I had a branch with some changes in already and sent him a link to it so we can work together. When I caught up with him the next day, he said that he had been working on a few of the ones I already had done because he hadn’t looked at the link. What a waste of time.

When Mary asked for a progress report, Dennis reckoned it would take 1 day to go through 20 comments, but he had done 8 easy ones the day before, and we had the hard ones left. So I said it would be more like 4 days, but could take longer if they are surprisingly complicated. I was correct.

Manager: How's the project coming along?
Developer: It's on track.

The track: pic.twitter.com/kTWIE4CHZE
— Molly Struve 🦄 (@molly_struve) August 17, 2021

Final demo

On the final Project Demo, Carl [Customer Representative] was saying the sending process was far too slow. He had been on most of the demos from the start and saw the progress across the project.

The original version I showed him was incredibly slow, but I had managed to speed it up significantly. So despite him witnessing the project months ago, he said the performance was a concern and maybe users would think it wasn’t a significant improvement.

We had all kinds of people turn up to this final demo. People from support, training etc. We should have had those guys on the early meetings. They were prodding holes in the requirements and asking important questions. Although we gave good answers for most of them, I couldn’t help but think our changes might not be as useful as we thought.

If only we got more users involved throughout the project, rather than just some UX mock-ups before we started, and then a year later – give them the update and hope for the best.

I’d like to reiterate just how hard the team has worked. They have worked their little socks off
Olivia [Product Owner]

Conclusion

We were told the importance of the project, but because there wasn’t a direct commercial aspect to the project, I felt it wasn’t backed up by the number of developers assigned to the project. With only 2 developers, then key staff like Architects and Product Owners switching throughout the project; it just slowed us all down and made us all feel it was actually a low-priority project.

There were other morale-reducing aspects like when we were told the project was on hold, then Mary berating my decisions, and implying the failures were down to me.

There wasn’t a great understanding of the feature in many ways, illustrated by

how many requirements we discovered throughout the project,
the UX team being clueless about many aspects,
one Product Owner so clueless – it seemed he struggled to use a computer,
then switching to a clueless Architect that just went straight to the original architect.

Recruiting Graduates #6: The Disappointment

Last year, I wrote some blogs explaining how the hiring process for Software Developers is bad.

General Summary Of Why The Software Developer Hiring Process Is Bad

When I was involved in trying to come up with something better, I was frustrated because I didn’t have the answers, but I felt everyone else involved didn’t understand my concerns.

I hate the way Tech Interviews and tests currently are, but I think it’s hard to come up with something as a replacement. When researching example questions, I see lots of questions along the lines of “who invented Object Oriented Programming?“, and I think “I don’t care“.

Then there’s loads of aspects that you would just Google if you didn’t know it e.g. memory usage of each type.

Or you may get questions like “what is wrong with this code?“, but if you actually pasted that code in Visual Studio and try and build it, you would see the error. Or in the case it is valid but not advisable; Code Analysis would flag it. Some of these can be trick questions like mathematical logic where the order of execution is ambiguous, but Code Analysis would definitely flag to tell you to add brackets, so it’s silly having these as a question.

Then I hate all the generic Fizzbuzz stuff (check if a number is divisible by 3 or 5, or even both). These are just so cliché that people could just learn from memory anyway.

Also, what do these graduates we are targeting actually know? back in my day, I knew Java but I was terrible and didn’t know how to test properly, or even debug. So there’s no point testing them on that either.

Stupid Questions From The Internet

During our meetings, I asked the team how they were coming up with the interview questions, and they said they were just Googling for C# Interview questions. This is problematic because applicants may have read and learnt from these very sites, but also many sites had terrible questions, which were sometimes incorrect.

I came across this website:
https://www.includehelp.com/mcq/is-the-use-of-return-statement-necessary-in-every-function-in-csharp.aspx

 95. Is the use of return statement necessary in every function in C#?

 Yes No

 Answer: A) Yes

 Explanation: Yes, the use of return statement is necessary in every function in C#.

I think some programming languages may differentiate between “methods” and “functions” where “functions” always return something. But in C# you just call them “methods” which can either return something, or be defined as “void”. So this is just incorrect.

 99. Which is the first line of a C# program?

 using System;

 using system;

 using Namespace;

 namespace MyApplication

  

 Answer: A) using System;

If you create a blank class, “using System;” is the first line, but A) you could remove it B) the order of using statements matter C) in modern C# you could move it to a “global namespace”. I also asked the question of “who are we targeting?” because everyone was coming up with C# specific questions. My colleagues told me the universities we were targeting were primarily coding using Python! So we were writing the wrong test, or targetting the wrong people.

The Performance Review

In my recent performance review, my manager Colin stated:

“You were given an opportunity to lead on defining our interview process, but have not shown much interest in it. Ultimately someone else led it, creating the interview pack, redefining interview questions and leading live interviews, which I think is a missed opportunity to shine.”
Colin

I argued that: although I didn’t have a huge impact on the end result, it would have been a bigger mess had I not pushed back on some of the questions. Although it seems some of the interviewers still asked questions I said weren’t suitable.

We ended up coming up with a terrible process that some candidates dropped out of, and the ones that went through with the interview struggled and we were left with the decision to gamble based on what little we saw.

The Best Hire

One of the best hires we made was someone I didn’t interview, but I looked at the interviewer’s notes, and he didn’t exactly look like a strong hire.

Implemented the basic application to consume webapi and just printed the values based on some predefined values
No unit testcases were written
application is simple
Inheritance – simple explanation
polymorphism – no idea
interface – not good explanation
purpose of inheritance – No idea
Print the numbers and its occurrences – Not completed, but after showing the hints he completed
Print the number from 1 to 100 using Recursion – Completed
print the numbers based on divisible values – Completed
Logical, cut the cakes into 8 pieces with 3 cuts – Completed
Candidate lacks on syntactically at some areas, but has some logical solving skills and good attitude on approaching things

When I stated to Colin that it “wasn’t a great batch of hiring”, Colin said out of the people we did interview, we ended up hiring around a third of them – which he said was above the normal hiring rate. I said that it was only above the hiring rate because we hired people that should have failed. Then later, he quoted the numbers “15 interviewed. 4 hired” (more like a quarter then). But there were loads of applicants rejected before the interview stage, and others dropped out along the way.

Intermission: Bantz from Twitter

Why does no one like our 12 steps recruitment process? pic.twitter.com/ys4OLjqFry
— Vincent Déniel (@vincentdnl) February 22, 2023

Make me do 2 hours of live coding interviews for a job and then still reject me? Yeah I’m starting Monday. See you online.
— I Am Devloper (@iamdevloper) May 21, 2022

The Polymorphism question

The Polymorphism question that was asked on some of the interviews was basically “do you know the types of polymorphism?”

I didn’t even understand the question, and I probably have 10 years of experience.

I was only thinking about the usual polymorphism where you can refer to derived types as the base type:

 IAnimal animal = new Dog();
 animal= new Cat();

To check if I wasn’t going mad, I asked some of my colleagues. Here is how they responded:

“not sure i do, no. Inheritance and Composition, are they right?”
Dean (Senior Developer around 10 years experience)

Like covariance and contravariance? But I don’t think I could put it into words. I guess you have real polymorphism and fake…virtual functions have a virtual function table that map overridden functions from base classes. But a lot of polymorphism is just the compiler being intuitive and going “you’re not crazy” it can be treated as that type.
Adam (Software Architect)

Types? an interface can be implemented many times. A method can have the same sig but different parameter types etc
Rob (Senior Developer around 20 years experience)

According to https://www.bmc.com/blogs/polymorphism-programming/, there’s Runtime, Compile Time, Overloading, and Casting. So if very experienced developers can’t explain the concept, then why are we using the question to target Graduate developers?

Conclusion

It annoyed me that Colin stated “have not shown much interest in it”. I was trying to make sure we had well-defined objectives on the exact type of developer we were hiring. Everyone else just started copying questions for the internet without considering the knowledge and experience of the people we were targeting. I didn’t have the solution of how to make a good software development hiring process, but I really wanted to make sure we got close – but I just didn’t have any support from everyone else that was involved.

Intro

The main Problem

Collage of abuse:

Theoretical scenarios

Piracy

Target Price

Patch Quest

Unity Response

Conclusion

The India Office

Expo Day:

Cultural Experiences:

PC shutdown & Usage Monitoring

Upgrading Software

Saving Money

Managing the Public Relations

All Hands

New Rules – My View

The Developer Feedback

Architect

The Next Release

Feedback

Refinement to the process

Conclusion

Introduction

Jargon Sales Pitch

Jargon From The Docs

Official Training

No Going Back

Web-based Demo

Everyone should use Datadog

Performance Problems

Technical Director

Scheduler

Performance Problems Part 2

The possibly crazy workaroundJohn

Need More Tools?

My View:

Current Software

Unsafe/Unofficial Automation

UX User Feedback

Proposed Change

Product Owner: Owen

The Architects

The July Cancellation

Coding Tales

Put it in a new tab but make it behave like a dialog

Scrollbar

Conflict With The UX Team

Don’t Have Time

Conflict With The Architect

The Technical Manager change

The Backstab

80% coverage

On track

Final demo

Conclusion

General Summary Of Why The Software Developer Hiring Process Is Bad

Stupid Questions From The Internet

The Performance Review

The Best Hire

Intermission: Bantz from Twitter

The Polymorphism question

Conclusion

The possibly crazy workaround
John