The Feature Flag Rant

When adding new features to software, you can add a Feature Flag. If set to true, it uses the new feature, false and it doesn’t. This allows a quick roll-back feature by tweaking this value rather than releasing a new software update. However, it makes the code more complicated due to branching paths.

When all users are now using the new feature, when do you remove the code? Obviously it should be removed once all users are switched over and happy with the new functionality, but the work needs to be planned in, and what is the urgency? Project Managers will want new projects that add value, not deleting redundant code.

One of our most experienced developers posted a rant about feature flags. He pointed out there was no guidance on when to use feature flags. Do all new features get feature flags? What if it depends on a feature that already has a feature flag? Do Software Testers test each combination to make sure all code paths are supported? Is it clear which configurations are deployed on live since this should have priority when it comes to testing? By default, our Test Environments should match the config of a typical Live Environment. However, we often find that the default is some configuration that is invalid/not used.

It’s not always possible to “roll back” by switching the feature flag off. This is because to implement the change, you may have needed to refactor the code, or add new database columns. Changing the feature flag back to “off/false” just stops some new code being called, but not all new code changes (the refactored parts). So if the bug is with the changes even with the flag off; then it is still a problem.

It was also discussed that some people used our Configuration Tool for actual configuration and others were using them as Feature flags, and maybe we should have separate tools for Configuration and Features.

Feature flags cause maintenance problems. It needs to be tested on/off when implemented, then if you want to remove it, then that needs to be tested too. If you leave it in, then it’s always going to be questioned if code in that area is used/needs testing. How do you prioritise removing the code; does it belong with the team that initially created the feature? What if the team has moved on, or split?

Another developer brought up an example of how a bug existed in two places but the developer that fixed the issue was only aware of one path, and didn’t know about the other which required a feature flag to enable.

He also questioned if it is more of a problem with our process. Other companies may have quicker releases and are more flexible to rollback using ideas like Canary Deployment. Our process is slow and relies on “fix-forward” rather than rollback.

Things to consider:

What actually gets feature flagged?
When is the conditional code is removed from the codebase
Effect of the “Cartesian Explosion” of combination of flags on unit tests and test environments

SonarCloud, the Static Analysis Tool

SonarCloud is a static analysis tool. It runs on your codebase and points out where code can be written to “best practices”, possible bugs, and code coverage.

When my employer bought a SonarCloud licence, there was a large emphasis that they wanted existing errors fixed, and no errors in new code. I knew this would be a bad idea to emphasise because of the classic “you get what you measure”. If you tell people the goal is to not have errors, then they will do whatever they can to not have errors. The idea of a tool like SonarCloud is to improve code quality, but since the metric is a number, then the number becomes the goal rather than code quality. However, the aim should be to improve the quality of the code, optimised for readability, maintainability, scalability, and code coverage.

As more people began working with SonarCloud, we saw more changes with the title along the lines of “Fix Sonarcloud issues“.

Human Analysis

SonarCloud is just a static analysis tool, it can only spot problems and suggest pre-defined improvements. These improvements are not always in the best interest of the code base. You have to objectively ask yourself “Does the change I’m making make the code better or worse?” Before you can do that, you need to understand why Sonar thinks there is a problem with the code. Then you can decide if Sonar is right or not. Some of the rules are not applicable everywhere.

I have seen a number of changes where the code is actually made worse by the changes to satisfy Sonar.

Example 1: 7 Argument Limit

A constructor or method that contains loads of parameters/arguments can be a sign of bad design, and maybe some of the parameters can be grouped together inside an object. Once you reach 8 arguments, Sonar will flag this. A simple fix is just to create a class and throw in a couple of parameters in there. It then satisfies the rule but it doesn’t make logical sense unless the parameters are in fact related. Adding a new class when it is single use could just make the codebase more cluttered and seemingly more complicated. I would rather see a class with 8 values, than some complicated class with extra types in the way “Just because of Sonar”. Mark Seemann has a good blog post about Curbing code rot with thresholds.

Example 2: TryParse

Another example, I once wrote some code with the following:

var ok = int.TryParse(a, out var x) & int.TryParse(b, out var y);
//something with x and y

Sonar complained about the use of the bitwise & and it was confusing and suggested I use &&. However, if I did that then “y” wouldn’t always be defined because of the short-circuiting and the code wouldn’t compile. I was about to reject the Sonar issue as “Suggests code that doesn’t compile” and just keep my version.

Then I thought, “if Sonar cannot understand my code to make a decent suggestion, maybe other developers can’t either“. I was trying to be too clever with my code.

Instead I changed the code to:

var ok1 = int.TryParse(a, out var x);
var ok2 = int.TryParse(b, out var y);
var ok = ok1 && ok2;
//something with x and y

It wasn’t as terse as my original version, but it was certainly easier to read and understand, and Sonar didn’t have a problem with it any more either.

Example 3: Cyclomatic Complexity

When a method has loads of If statements, this creates loads of permutations that can be executed which means if you are aiming for true 100% test coverage, you have loads of tests to write. It can easily make the code hard to read and understand too. At a certain point, Sonar suggests breaking the methods into smaller methods. I have seen people take this extremely literally and you end up with a design that looks like

  SetPersonalDetailDifferences1(returnList);
  SetPersonalDetailDifferences2(region, returnList);

So there is no logical grouping for what goes in part 1 and part 2, it’s just that enough code gets placed in the first method then everything else in the second. Now the original single method is half the size with no logical reason other than to satisfy the Sonar rule.

Me (rhetorically)
Are these fields logically grouped or is this just to satisfy sonar?    
 
Brijesh
It is just to satisfy sonar

Example 4: Making The Code Worse

Nullable DateTime

DateTime always has to have a value. However, You can declare a nullable DateTime in C# by appending a question mark like DateTime?

The existing code was checking a standard DateTime for null, which can never happen.

if (startDate == null)
   throw new ArgumentNullException("startDate");

The Code Analysis report was correctly flagging this code as completely unnecessary. Instead of removing the code, the developer then changed it to

if ((DateTime?)startDate == null)
    throw new ArgumentNullException("startDate");

The method was still accepting the startDate as a non-nullable DateTime so could never be null. But then it was being cast to a nullable DateTime so the check against null is technically valid.

Me:
Why are you casting to nullable datetime?
 
Chandeshwar
dateTime is value type , it can never be null value.
if check dateTime == null, it's always return false .

Me:
Yes, that is correct Chandeshwar. That’s why you can delete the code completely. Your code is always false too.

Sometimes, this type of scenario leads to multiple attempts. Either

because they make some changes and they still get the same Sonar error,
they introduce a new problem,
or maybe the reviewer dislikes the changes and wants them to change it back.

So there’s plenty of files where our Change History looks like this

Fix Code Smell Issue <- doing what they think Sonar wanted them to do
Additional Static Analysis Fixes <-they messed up, so tried again
Addressing code review comments <- I pointed out it still wasn’t correct

Example: Magic Numbers

Magic numbers should not be used

A magic number is a number that comes out of nowhere, and is directly used in a statement. Magic numbers are often used, for instance to limit the number of iterations of a loop, to test the value of a property, etc.

Using magic numbers may seem obvious and straightforward when you’re writing a piece of code, but they are much less obvious and straightforward at debugging time

That is why magic numbers must be demystified by first being assigned to clearly named variables before being used.

-1, 0 and 1 are not considered magic numbers.

Vignesh changed code like

dosagesCount == 1

dosagesCount == Constants.Single

Me:
Constants are for a different purpose! They are not for replacing all the numbers in the codebase. It's a good idea to give numbers clear names, and if the same value is used multiple times, it means if you need to change the number, it will update in all places.

Vignesh
This is the rule we follow (quotes the magic number description shown above)... and I got comments from my senior level.

WHAT DOES IT SAY, VIGNESH?
“-1, 0 and 1 are not considered magic numbers”

And you have replaced 1 with the word “SINGLE”

Example: Lack of pragmatism

It frustrates me so much that so many developers even don’t agree with Sonar, or don’t understand the change, but still attempt to change it anyway. Their sole goal is to remove the error that they lose sight of the true aim to write great code that works.

Dave
OK, but why did it suggest it, and why does it make the code better? Did you understand the suggestion or just blindly do what it said?

Pavel
I think I understood it. The comment was: "Add the default parameter value defined in the overridden method". Default arguments are determined by the static type of the object. If a default argument is different for a parameter in an overriding method, the value used in the call will be different when calls are made via the base or derived object, which may be contrary to developer expectations.
That's why it suggested it. But in my opinion this change is redundant and doesn't make the code better.

There was some code which had loads of if statements. It was checking the types of “nodes” in a tree structure, and so attempted to cast the type. If successful, it would go into that code block, else it would attempt to cast to a different type.

Even though the developer didn’t need to change this code, he did change it to attempt to resolve Sonar issues with regards to the casting. However, he only updated some lines and not others which meant it was inconsistent so the codebase is more confusing, and still contains Sonar errors. He also took some lines out of the if statements and then performed all the casting at the top of the method. So now it was actually more inefficient because once you have found a match, you do not need to keep attempting to cast.

  {
            var creterionTag = row.Tag as LinkedCriterion;
            var relationshipTag = row.Tag as Relationship;
            var attributeTag = row.Tag as Attribute;
            var resultSetRuleTag = row.Tag as ResultSetRule;
            var conceptOrderingTag = row.Tag as ConceptOrdering;

Me
why change to "as casts" for most, but then not for Table, Concept and SharingDisplayValue?
I'm not even sure if there is an advantage to doing it this way. We now have several variables where only 1 will be set, and the rest are null.
Might be better spending more time refactoring it out to get rid of the branching.
Pattern matching is probably the neatest way for now. https://docs.microsoft.com/en-us/dotnet/csharp/pattern-matching

Kalya
I just simply resolved some existing SonarQube issue which not raised because of our changes. It is a kind of help to resolving existing issues, It is very difficult to resolving all the issues as of now

So he tried to help, was too difficult, so gave up, but still decided to submit the changes for review, despite making the code worse.

Example: Just Suppress it!

Of course, you can still get rid of the error without actually fixing anything. But merely hide it from the managers that are only looking at the figures:

#pragma warning disable S1075 // URIs should not be hardcoded
public const string InfoUrl = " http://med.info";
 #pragma warning restore S1075 // URIs should not be hardcoded

Conclusion

The goal isn’t to make Sonar happy, the goal is to write good clean code, sonar is a guide to help you do that, but it doesn’t guarantee success.
Me

Goodbye Slack

For the last several years, we have used Slack as our primary way of communicating in the Development department. However, company-wide we have Microsoft Office 365 licences, so other departments use Teams. I always thought it was a dumb decision to use Slack due to essentially paying twice for a communication tool. Slack isn’t that expensive on the lower tiers but it adds up when you have a large amount of staff. Plus, due to stricter security policies, we wanted to use single-sign on so had to upgrade to the Business+ licence which didn’t seem to be worth the cost.

As time goes on, we keep “improving security” which I often think is just an excuse to get rid of certain software. How do you really determine which software or companies are secure anyway? They could tell you they use certain security practices or have some accreditation but if your data is exposed in a data breach is another story.

“not sure what you can hack via Slack. Just over reacting like everything these days. 2FA all the things!”
me

On Slack’s Enterprise licence, they boast even more security features and with our new strict security policies, the management decided that we would have to pay significantly more to keep using Slack, or just get rid of it. So they decided to get rid of it.

To be fair, Teams has improved a bit over the years, and although I prefer the way Slack looks, and its excellent emoji support (you can add custom emojis!); I can’t justify the cost.

why is slack not secure as opposed to teams? probably just nonsense. Where does the data go when it is lost? surely doesn’t leak out onto the dark web!
Rob

We somehow had over 900 members according to Slack Analytics but I reckon that was every historic user since we started using it. Scrolling down the list and roughly estimating, we seemed to have around 300 which could reasonably be called “active”. Then looking at the Business+ costing, it should cost $45,000 per year. Enterprise is one of those tiers where it says “contact sales for a quote”. One manager reckoned it would cost $250k a year to use which doesn’t sound right. How can you justify such an expense for a chat application? Even if it did cost that much on paper, surely you can haggle that down significantly. I’m sure Slack won’t want to lose us. Surely charging $60k is good profit for them.

I often think the way companies charge for software licences doesn’t make sense. They often just charge “per user per month” but there will be times where people don’t actively use the licence due to the work they are doing, or maybe have annual leave to take. Then there’s people that join temporarily, then people just naturally join/leave the business over time. So who really tracks what the accurate amount you need to pay. Companies just end up overpaying for licences they don’t need. Slack seem to suggest they charge just for active users. But what happens if you just send a few messages for 1 day in the month; is that an active user for the month? I often think the best approach would be to charge for a certain amount of users, but then give out an extra 25% keys for light usage.

One thing that I found interesting when looking at Slack Analytics is that most people seemed to be sending as little as 20 messages per day. I think that they are either super focussed and just work independently, or they are chilling out. It’s hard to believe that you can work well in a team, or even have a good relationship with them if you only send 20 messages. I find that some people use instant messaging by sending a sentence per message, which will inflate the message count which makes the numbers even more surprising. For example, they could send 4 messages for this interaction:

Hi

Are you free?

I was wondering if you can help me work out this error

I have just got the latest code but am unable to log in

The decision to remove Slack was disappointing for some, but the bizarre thing is that we got told by our manager on the Wednesday, it was formally announced on Thursday, and gone by Friday 4pm. If you were on annual leave that week, you would be confused when you could no longer access Slack on the following Monday. There was some great information that we had on there, and was great to search for common errors and find solutions to them. We didn’t have enough warning to try and extract the information.

“Has the cost of the loss of productivity and collaboration been factored into the decision to remove slack?”
Sad developer

One developer had a crazy idea of developing our own solution:

“We are a software development company. If we’re that desperate, can’t we write our own messaging system, protected to the security standard we want?”
Ambitious developer

The thing is, we already made a chat application for our users. I never understood why users would want a native chat app when they could use something more widespread. Since we already have a chat app, then it could actually make sense to add more features to it; then use it internally.

Making your own tools isn’t as cheap as you would think. If a developer’s wage is £35k, then paying only 1 developer to develop and maintain it each year is £35k. You may as well just pay for Slack then. But if we are using it and selling it to our users, then it does make more sense.

The weird thing is, for our upcoming software, we originally used Okta for the login functionality but it was decided it was too expensive, so a few developers got together and made their own solution. That seems bonkers to me because that is about security, so surely you should leave it up to the company that specialises In security. So the fact that we do make custom authentication makes the idea of making a chat app even more realistic.

However one of the architects working on this upcoming software ironically replied:

“We need to move away from homegrown solutions, especially based on the presentation happening now from our Head of Security”
Hypocritical software architect

Another architect supported this claim:

“This is about minimising home grown solutions when an off-the-shelf solution would do just as well”
Software Architect

Does that mean he should be bringing Okta back?

The Outage Part 2: Feedback on the new process

In my blog, The Outage, I described a Major Incident and a knee jerk response from the CTO.

He described this situation as a

“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented”
CTO

He came up with 5 changes that he came up with, presumably with no consultation from others. I gave my view on them in the blog. After a few months of carnage, the CTO has put out some revisions to the process.

CTO = Chief Technology Officer

SLT = Senior Leadership Team.

ELT = Executive leadership team

BAU = Business as usual

Suggestion from CTO	My View at the time	CTO’s update
“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”	The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.	“as a knock-on effect, we have also reduced our delivery capacity and timescales.”
“Pre-approved changes are suspended”	The idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks: “Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”. You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?	“The door for pre-approved Standard Change has been re-opened. Standard Change templates can be proposed and put forward as before. As part of our continued governance and enhanced view of change taking place, we do ask for the following: Each Standard Change template requires approval from one SLT or ELT member. A full review of both the implementation and rollback steps needs to have been undertaken.”
“Any changes submitted for approval will require TWO members of SLT. ”	How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise. We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.	“A significant challenge impacting time to deliver has been the ‘two eyes on’ stipulation. We recognise that not every type of Change requires two sets of eyes and so are refining this slightly. Standard Changes will need to follow the above process. Where ‘two eyes on’ is not deemed necessary, two SLT approvers will need including in the template submission verifying that this is not required. Normal Changes will follow the BAU process. Where ‘two eyes on’ is not deemed necessary, two SLT approvers will need including in the submission verifying that this is not required.”
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”	This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.	no specific comment was made
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”	The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development.	no specific comment was made

So in conclusion, it seems I was correct.

This is very concerning to hear

On a code review, a Senior Developer, Lee questioned why there was no database changes when the Developer Neil had removed all the related C# server code. Neil replied that he “wasn’t sure how the patching process worked” (despite being here years, and was in a team with experienced developers), and wasn’t sure if there were any backwards compatibility issues to consider.

So what was his plan? just hope it gets past the code review stage unchallenged? Then we would have some obsolete stored procedures, and unused data lingering in the database for years?

I initially thought his claim for backwards compatibility issues was nonsensical but from an architectural standpoint, it makes sense due to how it works in our system. The server code doesn’t call the other’s server; it goes direct. So that means if the old version calls the new version, then it would expect the stored procedures and data to exist. However, for this particular feature there were no cross-database calls at all.

I suppose being cautious and not deleting the data makes sense from a rollback point of view. It’s hard to restore the data if it is lost, but easy to restore the C# code. I have never seen us use this approach though.

The Senior Developer said :

This is very concerning to hear, can you please work with your team lead to understand how our versions are deployed, and if they are unable to answer all the questions, please reach out to someone. We do not support any version changes by default, though there are cases where we do have cross version server/database calls, but these are for specific cross organisation activities.
You can safely remove these columns, update these stored procedures.
There is no value in leaving something half in the system, if it is no longer needed, remove all references, database rows/columns/tables, class Properties, etc.

In my previous blog, I discussed Project vs Domain Teams. This is kinda linked in the sense that specialising in a certain area of the system means you gain knowledge of the functionality and architecture of that area. There would be less chance of this scenario happening where the developer is questioning if there could be backwards compatibility issues. However, he could have also found this information out by raising questions.

This example does cover many topics I have discussed on this blog:

Poor communication
Bad decisions
Funny quote from a senior developer ”This is very concerning to hear”

Domain Teams, Project Teams & Cross-Cutting

In the world of Software Development, there are often differing views on how to arrange teams. Regardless of the approach, people will leave/join over time, but team members need to be replaced and teams need to adapt.

There was a time when we were arranged into teams that were assigned to a Project, then moved onto a completely different one once complete. Any bugs introduced by the projects then get assigned to a “Service Improvement” team who only deal with bugs (and possibly ad-hoc user requests).

Then after a few years, and maybe under a new Development manager, they would restructure to Domain teams where you take ownership of a group of features and only projects related to those would be assigned to your team. Any bugs introduced by the projects stay with the team, which gives you greater incentive to fix them early as possible. People build up knowledge of their areas and become experts.

Then a few years later, we will switch back to Project teams.

There’s pros and cons to each structure, and there’s always edge cases which pose a management problem. Even in a Domain Team, there will be certain features that don’t neatly fit into the groups you defined, or ones that apply to many modules eg Printing.

Sometimes we have called a team that handles the miscellaneous features “Cross-Cutting”. Managers would sell it on being for features like Printing that really are used by many areas of the system, but we all know it becomes a team that gets miscellaneous and unrelated projects. They end up being like the “Service Improvement” team that deals with random bugs, and work no one else wants to do.

Cross-Cutting

There was a meeting where managers were announcing the new Domain Teams and I got assigned to Cross-Cutting. One developer was voicing his concerns about having a Cross-Cutting team. He wanted to point out that Domain Teams are supposed to have specialist knowledge on the Domains but most people that were assigned to their teams had little-to-no knowledge. For some reason he chose my name to make a point.

“What does TimeInInts know about Cross-Cutting?”

Which received a room full of laughter. I’m sure some were laughing at his point, some laughed at his emphasis and delivery, and others probably saw it as an attack on my knowledge. I was probably one of the best people for it really, given my experience in the previous Service Improvement teams.

The whole idea of keeping Domain knowledge in the team only works if there is a true commitment to keep the teams stable over years. However, people will leave the business, some will want to move to a different project to broaden their skills, or people could just fall out with their team members.

Another concern this developer had was with his own team. He was assigned to a Domain team he was the expert on, but was used to working with a couple of developers in the UK. This new team had two Indian developers. They had recently acknowledged the distributed teams weren’t really working so these new Domain teams were supposed to be co-located. But this setup seemed to signal that he was there merely to train these Indians up to then essentially offshore the Domain. Since he was the expert and proud of it, he still wanted to work in that area. But he can’t work on the same software forever.

In the Cross-Cutting team, we had an open slot labelled “new starter” so we were going to get a new hire in. You have to start somewhere, but again, this doesn’t help the teams specialise if they don’t already start with the knowledge.

Colleagues Opinions:

Developer 1:

Me 13:39: what does a new starter know about Cross-Cutting? 
Mark 13:39: sounds more like Cost Cutting!

Developer 2:

It’s infinitely harder to build something if you don’t understand the thing you’re building. Hard to catch issues and make sense of designs if you had no opportunity to learn the domain.

Developer 3:

isn’t one of our major issues is we’ve lost domain expertise for core/bread and butter modules. For any “module”, there’s a combination of what the requirements are/how it should work, and what the code is actually doing. Without “domain teams”/ownership – we’ve lost a large part of the puzzle (how module should work).

Developer 4:

our teams are completely ineffective, expertise has been spread too thin. We probably need to reorganise the department again with who is remaining.

Build stronger teams first that only have one junior-ish person, then have weaker teams helping out where possible. It will be very hard for the weaker teams, but unless we do this, we’ll lose the stronger people.

The weaker teams should be given appropriate projects with longer timescales, and given as much help as possible while ultimately having to struggle their own way, stronger people who put in the effort will begin to emerge from those teams.

Balance in Teamfight Tactics

I’ve read about, or watched videos on computer game balance and find it such an interesting topic. How you can measure and perceive the strength of each character/unit, or attempt to fix the issue to rebalance the game.

Second Wind have made a video on Teamflight Tactics.

I’ve never played this game, or even similar games, but it has the same general problems to solve in its design that many games do.

So taking the transcript, and running it through AI, I’ve made a good blog on it.

Teamfight Tactics

Teamfight Tactics (TFT) by Riot Games is a strategic auto-battler, inspired by the League of Legends universe and drawing elements from Dota Auto Chess. In this competitive online game, players are pitted against seven adversaries, each vying to construct a dominant team that outlasts the rest.

In a game like League of Legends, a single overpowered champion can only be selected by one player and would be banned in competitions once discovered. In TFT, all Champions and items are available all at once creating many possibilities for players to find exploits in.

Balancing the dynamic of Teamfight Tactics (TFT) is a compelling challenge. Comparing it to card games like Hearthstone, where adjustments are made through a limited set of variables, TFT presents a stark contrast with its myriad of factors such as health, armour, animation speed to name a few.

Initially, it might seem that having numerous variables at one’s disposal would simplify the balancing process. Even minor adjustments can significantly influence the game’s equilibrium. For instance, a mere 0.25-second reduction in a character’s animation speed can transform an underperforming champion into an overwhelmingly dominant force.

The sensitivity of each variable is due to the intricate interconnections within the game. A single element that is either too weak or too strong, regardless of potential counters, can trigger a cascade of effects that alter the entire gameplay experience.

Consider the analogy of a card game where an overpowered card exists. In such a scenario, there are usually counters or alternative strategies to mitigate its impact. However, if a card is deemed too weak, it’s simply excluded from a player’s deck without much consequence. Contrast this with a game like Teamfight Tactics, where the strength of a champion is intrinsically linked to its traits and the overall synergy within a team composition. If a champion is underpowered, it doesn’t just affect the viability of that single unit; it extends to the entire trait group, potentially diminishing the strength of related champions. This interconnectedness presents a challenging balance but manageable through data analysis. Player perceptions of balance are shaped by this data.

Vladimir The Placebo, and Vain the Unappreciated

The character Vladimir in League of Legends had become notably powerful, overshadowing others in the game’s “meta”. To address this, developers proposed minor tweaks to balance his abilities. However, when the update was released, Vladimir’s dedicated players were outraged, believing their favourite character had been weakened to the point of being nonviable. But, in an unexpected turn of events, the nerf was never actually implemented due to an oversight. The players’ reactions were solely based on the anticipated changes they read about, not on any real modification to Vladimir’s capabilities. This psychological effect influenced Vladimir users to play more cautiously, while their opponents became more bold, illustrating how perception can shape reality.

Data only reflects the current state, not the potential. Particularly in a strategy game like Team Fight Tactics, which is complex and “unsolved”, players’ understanding and use of characters can be heavily swayed by their perceptions. Perception often becomes the player’s reality.

In the fifth instalment of the game, there emerged a low-cost champion named Vain. Initially, after the game’s release, the consensus was that Vain was underperforming—deemed the least desirable among her tier. The development team had reservations; they believed she wasn’t as ineffective as portrayed. Consequently, a minor enhancement was scheduled for Vain. However, before the update could go live, feedback from players in China indicated they had discovered a potent strategy for Vain. This revelation transformed her status drastically within three days, elevating her from the least favoured to potentially one of the most overpowering champions ever introduced.

This scenario underscores the limitations of relying solely on data, whether from players or developers, as it may not reveal the full picture. Balancing in gaming is often perceived in black and white terms by the player base—they view a character as either strong or weak, which leads to calls for nerfs or buffs. However, they frequently overlook the subtle intricacies and minute adjustments that can have significant impacts on gameplay.

Different Players

In competitive games like League of Legends, different balance parameters are set for various levels of play. A character might dominate in lower ranks but may not be as effective in higher tiers of play.

When it comes to balancing games like Teamfight Tactics, developers have taken an approach by balancing the game as if computers were playing it. The game is designed to test strategic thinking rather than reflexes and mechanical skill.

In Army A versus Army B, the outcome is predetermined. However, this does not mean we should nerf an army simply because it performs well at a lower skill level. Instead, it presents a learning opportunity for players to improve their skills.

Interestingly, perceived imbalances can serve as educational tools. As players engage with the game, they gain knowledge through experimentation. For example, if a player tries a certain composition with specific items and it fails, they can reflect on whether it was a misstep or an unforeseen event. Learning that a champion doesn’t synergize well with a particular item is valuable knowledge to carry into future games.

There are build combinations that could potentially disrupt the game’s balance if the perfect mix is achieved. This aspect works well in single-player modes like Roguelikes, where the aim is to become overwhelmingly powerful. However, the challenge arises in maintaining this sense of excitement while ensuring these powerful builds don’t lead to exploitation in a multiplayer setting.

Risks & Rewards

Balancing isn’t merely about pitting one army against another to see the outcome. It’s also about the risks involved in reaching that point. For instance, if there’s a build that appears once in every 10,000 games, requiring a perfect alignment of circumstances, it’s only fair that such a build is more potent than one that’s easily attainable in every game. Therefore, in games like TFT, balancing involves weighing the relative power against the rarity of acquisition, ensuring that when a player encounters a significantly rare build, it feels justified because of the risks taken or the innovative strategies employed.

TFT thrives on the abundance of possible outcomes, with a multitude of combinations and variables at play. It’s crucial for these games to offer not just a handful of ‘high roll’ moments but a wide array, potentially hundreds, allowing for diverse gameplay experiences. TFT reaches its pinnacle when players are presented with numerous potential strategies and must adapt their approach based on the augments, items, and champions they encounter in a given game, crafting their path to victory with the resources at hand.

New Content Updates

The allure of both playing and developing this game lies in its inherent unpredictability. Each session is a unique experience, a stark contrast to many Roguelike games that, despite their initial promise of variety, tend to become predictable after extensive play. Teamfight Tactics, however, stands out with its vast array of possible combinations. Just when you think you’ve seen it all, a new set is introduced, refreshing the game entirely. This happens every four months, an impressive feat that adds a fresh roster of champions, traits, and augments.

The question arises: how is it possible to introduce such a significant amount of content regularly while maintaining balance and preventing the randomness from skewing too far towards being either underwhelming or overpowering? The answer lies in ‘Randomness Distribution Systems’. These systems are designed to control the frequency and type of experiences players encounter. As a game designer, the instinct might be to embrace randomness in its purest form, but the key is to harness it. By setting minimum and maximum thresholds for experiences, we ensure that all elements of randomness fall within these bounds, creating a balanced and engaging game environment.

In Mario Party, have you ever noticed that you never seem to roll the same number on the dice four times consecutively? This isn’t a coincidence; it’s actually by design. Nintendo has implemented a system of controlled randomness to prevent such repetition, as it could lead to a frustrating gaming experience.

This concept is akin to a crafted ‘Ludo-narrative’, where game designers aim to shape player experiences through seemingly random events, but with a controlled distribution to keep the gameplay enjoyable and engaging. The goal is to allow players to encounter extreme situations, but these are skewed towards positive outcomes rather than negative ones.

This scenario might distort the essence of randomness, but surprisingly, players may not voice their dissatisfaction. Despite the statistical improbability, with millions of players engaging in a game daily, someone is bound to encounter this experience. Even odds as low as 1 in 10,000 can impact thousands of players at scale, highlighting the importance of considering player frustration as a crucial aspect of the gaming experience.

Perfectly Balanced

When discussing game balance, it’s not just about whether a feature is frustrating; it’s about recognising that frustration indicates a flaw in the design that needs to be addressed and learned from. Game balance is a complex, ever-evolving challenge that developers continuously tweak, hoping to align with player expectations. However, there will always be criticism, no matter the adjustments made.

The perception of balance is significant, and within any gaming community, you’ll find voices claiming that perfectly balanced video games don’t exist. Some players set such lofty standards for balance that they seem nearly impossible to meet. The key is establishing a solid foundation that dictates how the game should unfold, ensuring that the core gameplay aligns with the intended player experience.

In Teamfight Tactics, the ideal duration for rounds is targeted to be between 18 and 25 seconds, which is considered the standard for a well-paced battle. By setting these benchmarks, developers can align the game’s balance with this envisioned state and is key to achieving a finely-tuned game.

Conclusion

It’s essential to have a clear, balanced vision for the game and to persistently follow through with it. Balancing a game is a complex and dynamic challenge, not merely a matter of adjusting to data but also managing player perceptions and their experiences of frustration. Navigating this ever-changing landscape is no easy feat, especially when the development team must juggle multiple roles at a rapid pace. However, it’s precisely this complexity that adds to the excitement and enjoyment of Teamfight Tactics.

Project Aurora & The Strangler pattern

Recently we have had another tech guy join the company who is reporting to the CTO. I find that people in these kind of roles want to put their stamp on things by coming up with a new idea.

He presented his idea in our monthly Tech Meeting. He wants to attempt to address our performance problems by taking traffic away from our main on-premise databases. There’s been some similar ideas recently, and although I’m not great when it comes to hardware, networks and general software/hardware architecture; I am sceptical that these ideas can work.

His idea is that we can replicate the database in the cloud (“the cloud” solves all problems you see), and then the database in the cloud can be used for Read access, whereas Write would still go to the main on-premise databases (then synced up to the cloud).

The Announcement

This programme of work is to move workload away from our primary systems to enable these systems to withstand expected load factors from upcoming initiatives as well as expected growth in usage on our APIs during Winter 2023.

The intent is to run focused cross functional teams in work-streams across the group to deliver this initiative. The approach taken here is to place multiple bets, across multiple teams. The expectation is that not all teams will deliver by September, but enough to bring in the headroom we need.

The programme is intending to free up at least 20% load across our core databases. 

Upcoming aims:
    • Strategic, move read-only workloads to Aurora.
    • Redeploy APIs to AWS, Move to cloud technology, Containerise and Optimise Service
    • Enable use of replica data when ready.
    • Move Appointment Workload
    • Mitigate 8am peak load.
    • Use caching engine on AWS (Elasticache/Redis), mitigate 8.2% of PC DB Load 
    • Reduce load on the DB during day time.
    • Reduce Datacentre and DB load and improve performance
    • Mitigate 6.2% of DB load by optimising how we summarise task counts
    • Proof of concept is Complete, expected to cost £2m a year.

My Conversation With Architect Mark

I think the reason for the replication (as opposed to just moving it all to the Cloud) is that you can’t fully commit to ideas like this. You have to have a rollback plan. So if we find it doesn’t work, is too expensive etc., we can just return to the old way without much inconvenience. I asked one of our Software Architects what he thought of the plan because it doesn’t sound right to me:

Me
doesn't sending data out to another database just increase traffic, and they wanted to reduce it?
Mark
Yes, it will also be delayed, and often broken
Me
no pain, no gain
Mark
they're replicating data, and it's unlikely it'll be used
Me
I don't see how you migrate things. You have to keep them both running until you are confident it works, then bin off the old database. But then in reality you just end up keeping them both for longer than expected
Mark
you then also need cross-database transactions or to be very careful with queries
yeah, that's basically it. Have the same API at both ends, some sort of replicate and transform on the data to ensure it's in both. Persist to both simultaneously, then when all works, turn off the old
Me
The CTO said that “some people say there is a delay, but it is only 5 minutes”. Does that address any of your concerns at all?
Mark
no, this is only the second time I've heard about this, and the first I laughed
I agree with the principle of strangler pattern for migrating, but this isn't migrating
it's keeping multiple DBs 'in-sync'
Me
does that mean you can view an appointment book which is 5 mins out of date, and you try book an appointment, then it checks the real database and is like "no mate you cannot do that"

The conversation between architects

Mark then sent me a conversation he had with two other architects, Andrew and Jon. Mark already had concerns with the “appointment book” example.

Mark
so when this replication system goes down for a few hours, what happens then? I guess the system tries to book appointments for slots already booked, put in requests for items already issued etc.?
seems our business layer needs to be aware of how outdated the original information was, so it can compare something like a changelog number. Sounds like a big challenge to implement correctly

Andrew 11:10
Yes, any write operations will need logic to ensure that cannot happen Mark.
John and I have already called out that Appointments and Orders will have significant challenges with this replication model and have suggested that the initial focus should be on User Profiles, and any historic data, etc.

Mark 11:13
User Profiles and historic data seem just as dangerous to be honest.

Jon 11:15
The idea I suggested these is that you would check the change log on the primary system before even considering going to the replica. If the User had had a recent change (what counts as "recent" is TBC, I suggested 30 minutes) you wouldn't even consider going to the replica.

Mark 11:15
can we implement the strangler pattern properly? set up proper Appointments APIs to use in our datacentre, and AWS.
duplicate the data.
then dual file everything against the APIs? if one fails to file, the other gets rolled back.
we ensure consistency, we can transform the data, and we're using the pattern as intended
Jon, I agree your idea is the right way to do this sort of thing, but it will be adding logic and latency in a lot of places (as well as augmenting every one of our products to be aware of this), and not bringing us forward, but continuing to keep us in the primary data-store model

Jon 11:18
Honestly if the use case for customers looking at their data, then having it a touch out-of-date information isn't as critical as if our actual users sees an out of date view. As a hypothetical Customer who knows nothing about IT, if I viewed my record straight after a consultation
and it wasn't there I would just assume that there was a delay and it would appear later.
When it comes to actual Users viewing the record, it's absolutely critical that they see the up to date view. And when it comes to appointments that's also critical because appointment booking is fast moving, it'd be an awful experience for a User if every "free" slot they booked turned out to be booked minutes earlier.

Mark 11:19
depends, if you've just requested a particular item and the page doesn't update to indicate that, can you continue requesting it?

Jon 11:20
Many of our users (mine included) turned off online appointment booking entirely at the beginning of the pandemic and use a triage system now.
You wouldn’t be able to successfully request duplicate items, because the write would take place conditionally, so if it had been requested already then it'd say no (if designed even
vaguely competently).

Mark 11:22
the write wouldn't come through, but it'd be confusing for the User seeing the prescription still requestable, unless the application has its own datastore of state

Jon 11:22
Yes it would be far from ideal. But the CTO has some ideas about that (having a "recent changes" dataset in a cache that is updated live, and merged with the replica's data.
feels like there's loads of little bits of logic that need 'tacking on' to resolve potentially quite serious incidents. When the correct use of the strangler pattern gets us away from on-premise as primary DB, and moving in the direction we want to go
Yeah, this isn't easy and requires careful consideration.

Andrew 11:30
You are absolutely right Mark - there are a heck of a lot of potential gotchas and ultimately the plan has to be to use the strangler pattern, but at the moment we are looking at a rescue plan to put out some existing fires in the data centre and to handle predicted significant increase in load that will hit us in the Autumn. Everything that you have flagged is being considered.
The only fall-back plan that we currently have is to spend nearly £4m / year on additional SQL Server readable secondaries (on top of having to pay an additional 12% on our existing SQL Server licences thanks to MS hiking their prices) and nobody has the appetite for that.

Closing Thoughts

I don’t know what the Strangler Pattern is, so I’ll add that to my reading lists. However, it seems that even with my limited knowledge of architecture, our Software Architects have similar concerns as I do. There’s been plenty of ideas that the CTO (or similar level managers) have quickly backtracked on due to not consulting people who have knowledge on whether their idea is actually logically sound. I’ll keep my eye on this idea to see how it develops.

Changes to the Software Delivery Process

One of the problems we have where I work – is not releasing fast enough. When you read about Software Development, you hear of these companies that can release minor updates every week. Maybe it is more of a Web Development thing rather than an Application Development one, but there are also contractual reasons why we cannot release faster.

However, over time, the release time has continuously crept up to 4-6 weeks which causes problems.

If there is more time for each release, it means that the scope of the current release often increases further. For example, if there is a fix that needs to go out within 3 weeks to meet the SLA (Service Level Agreement) and the next release will go out in 4 weeks, then you have little choice but to get it in the current release. If you are checking it in close to the deadline, then you might end up delaying the release in order to test it. The more you delay, the more chance of someone else needing to get a fix in the current release and it grows further.

If there’s many big projects targeting the same release, each team developers in their own code “branch”, then will merge into the Main branch for release. Since it’s not really feasible to all merge at the same time, you end up taking it in turns and resolving any conflicting changes. To be honest, it’s quite rare that we will change the same files for the main feature changes, but there’s certain files with a lot of churn, mainly ones containing XML. To merge in projects, it usually takes a few days, then all the bug fixes on top. The Testers can’t really begin testing until it’s all merged so it’s a lot of overhead to manage the release.

When the releases are large, the Testers insist on running more Regression Tests which increases the Testing phase and can cause further delays.

“I think we spent about 2 months on that regression. It was madness. It was a HUUUGE release”
Software Tester

So smaller releases are much more manageable, take much less time to test, incur less risk, and have lower scope for scope-creep.

Our Software Delivery team made an announcement about this (basically just saying the same things I have just discussed), and desire to plan in quarters but release in a couple of weeks.

In the past, we would scope a single release looking at the features, fixes and minor enhancements we wished to deploy. We would follow a process of merging everything into our main release branch before undertaking testing. This was a two-phased testing approach, integration/functional testing each feature and fix, and then regression testing to ensure pre-existing functions continued to work as expected. We would then spend eight to ten weeks deploying the release through Controlled Roll Out and Customer User Acceptance Testing.

This approach brought with it a number of challenges. Merging everything in was time consuming, issues or blockers with one feature would slow down or block other features, regression testing was a challenge, and this also put pressure on the roll out and deployment through pushing out a number of changes in one go.

To try and mitigate some of these challenges, we are now adopting a strategy of breaking these large releases down into smaller updates.

Working in quarterly cycles we will scope what we wish to deliver over a 12 week period. Each feature will be analysed and risk assessed for size and complexity by our Engineering Leads, and have a business value determined by our Product Management and Commercial Teams.

Using this feedback we will then determine an order in which we wish to deliver each feature. We will then merge them into a release and test them one at a time (potentially two if both are small and low risk), before signing over to Release Management to commence deployment.

We will then deploy the full scope over a series of smaller releases rather than in one large release.

The last update in the cycle will be a maintenance release to address the backlog of prioritised service work.

The objective behind this approach is to have our users benefit by taking elements of the release scope earlier than they would have before, whilst also simplifying the testing approach and hopefully enabling us to push code out across the estate quicker.

The Secret the Task Manager developer didn’t want you to know!

Dave Plummer, who has the Youtube channel Dave’s Garage announced on Twitter:

Big news! Someone finally noticed that if you hold down CTRL, the process list in Task Manager conveniently freezes so you can select rows without them jumping around. I did this so you could sort by CPU and other dynamic columns but then still be able to click stuff…

Dave Plummer

There’s been plenty of occasions where Task Manager rows jump around to my annoyance. Why wasn’t this a more obvious feature? Frank Krueger (who appears on Merge Conflict podcast) made the obvious point:

Don’t hide features under random key combos – undiscoverable and unmemorable UIs are user hostile. A little checkbox with the text “Pause Display” would be discoverable and you won’t have to wait 30 years for someone to find your feature.
https://x.com/praeclarum/status/1693649521375621524?s=20