Aug 2024 – Stranger Coding Tings

When adding new features to software, you can add a Feature Flag. If set to true, it uses the new feature, false and it doesn’t. This allows a quick roll-back feature by tweaking this value rather than releasing a new software update. However, it makes the code more complicated due to branching paths.

When all users are now using the new feature, when do you remove the code? Obviously it should be removed once all users are switched over and happy with the new functionality, but the work needs to be planned in, and what is the urgency? Project Managers will want new projects that add value, not deleting redundant code.

One of our most experienced developers posted a rant about feature flags. He pointed out there was no guidance on when to use feature flags. Do all new features get feature flags? What if it depends on a feature that already has a feature flag? Do Software Testers test each combination to make sure all code paths are supported? Is it clear which configurations are deployed on live since this should have priority when it comes to testing? By default, our Test Environments should match the config of a typical Live Environment. However, we often find that the default is some configuration that is invalid/not used.

It’s not always possible to “roll back” by switching the feature flag off. This is because to implement the change, you may have needed to refactor the code, or add new database columns. Changing the feature flag back to “off/false” just stops some new code being called, but not all new code changes (the refactored parts). So if the bug is with the changes even with the flag off; then it is still a problem.

It was also discussed that some people used our Configuration Tool for actual configuration and others were using them as Feature flags, and maybe we should have separate tools for Configuration and Features.

Feature flags cause maintenance problems. It needs to be tested on/off when implemented, then if you want to remove it, then that needs to be tested too. If you leave it in, then it’s always going to be questioned if code in that area is used/needs testing. How do you prioritise removing the code; does it belong with the team that initially created the feature? What if the team has moved on, or split?

Another developer brought up an example of how a bug existed in two places but the developer that fixed the issue was only aware of one path, and didn’t know about the other which required a feature flag to enable.

He also questioned if it is more of a problem with our process. Other companies may have quicker releases and are more flexible to rollback using ideas like Canary Deployment. Our process is slow and relies on “fix-forward” rather than rollback.

Things to consider:

What actually gets feature flagged?
When is the conditional code is removed from the codebase
Effect of the “Cartesian Explosion” of combination of flags on unit tests and test environments

I heard from a few security podcasts that Microsoft wanted to have exclusive rights to manage the security of the kernel on Windows machines. However, due to the EU’s competition laws, they don’t like monopolies so want an open market of security software. In most cases, competition is good, but this could actually be one area where you do want a closed system. The more companies that have control in something fundamental as the kernel, then the greater risk of threats.

A kernel driver has very intimate access to the system’s most inner workings. If anything goes wrong with the kernel driver; the system must blue screen to prevent further damage to the user settings, files, security and so on.

Crowdstrike released a faulty update in their software update, which caused the infamous blue screen of death in many Windows systems across the globe. Microsoft must have been fuming, because they knew this wouldn’t have happened with a closed system, and the media kept on reporting on it as if it was a Windows problem. Sure, it only affected Windows PCs, but it had nothing to do with Microsoft.

If I understand correctly, the driver was signed off by Microsoft but the update involved a “channel file” which just contained loads of zeros. So when the driver used it, it had no choice but to blue-screen. It makes you wonder what kind of testing processes they have at Crowdstrike if they can release an update like that.

When I logged in at work, our Group IT announced that some colleagues will be affected by a Crowdstrike problem and would be acting quickly to get people back up and running. It was only a bit later when someone sent me a screenshot of some of our users complaining on X did I realise that it wasn’t just an internal problem. When I went on X, I saw reports of the problem affecting banks, airlines, supermarkets and more; and had a live news page on the BBC. I still didn’t understand the severity of the problem until I saw that Troy Hunt had declared it as one of the severest problems we have ever seen.

I don’t think it’s too early to call it: this will be the largest IT outage in history
— Troy Hunt (@troyhunt) July 19, 2024

Despite Group IT making it sound easy to restore, when I heard others talk about it, I got the impression that it was fairly straightforward to revert the update on a single computer, but when you have hundreds of computers; then it is a problem. In companies where they only have a few IT staff; it is crippling. You may think that people could fix the problem themselves but many people aren’t tech-savvy, and plus, many companies lock down access so you don’t have any advanced features like Administrator mode.

Furthermore, it sounded like servers “in the cloud” were even more difficult to restore; or it was more cumbersome at least.

Ironically, in recent years, we have moved a lot of our live infrastructure from our own data centres and into the cloud; citing benefits of reliability. However, this problem meant our users were impacted for a day or so; when we could have got them up and running within an hour or so if the servers were still internally hosted.

Crowdstrike released an update to prevent more machines from being brought down, and had sent customers mitigation steps and tools to identify impacted hosts. The new update wouldn’t fix the broken machines though; that required manual fix involving booting into safe mode, locating the dodgy file, and removing it.

Companies purchase security software to prevent system outages, and causing a global system outage is a massive PR blunder for Crowdstrike and security software in general. It’s gonna be tough rebuilding trust, but many of the every-day people will probably blame Microsoft because that’s the name that was initially stated in the media.

It must have been brutal for the upper management, and a disaster when they turn up fatigued and under pressure on live TV.

The CEO of @crowdstrike chokes on his own words because he knows they are not true.

This was NOT a single update failure. pic.twitter.com/6jCWkkbVHw
— Jennifer Arcuri (@Jennifer_Arcuri) July 20, 2024

Troy Hunt documented the story as he learned more:

Something super weird happening right now: just been called by several totally different media outlets in the last few minutes, all with Windows machines suddenly BSoD’ing (Blue Screen of Death). Anyone else seen this? Seems to be entering recovery mode: pic.twitter.com/DxdLyA9BLA
— Troy Hunt (@troyhunt) July 19, 2024

Month: Aug 2024

The Feature Flag Rant

Things to consider:

Crowdstrike Struck The World