I heard from a few security podcasts that Microsoft wanted to have exclusive rights to manage the security of the kernel on Windows machines. However, due to the EU’s competition laws, they don’t like monopolies so want an open market of security software. In most cases, competition is good, but this could actually be one area where you do want a closed system. The more companies that have control in something fundamental as the kernel, then the greater risk of threats.
A kernel driver has very intimate access to the system’s most inner workings. If anything goes wrong with the kernel driver; the system must blue screen to prevent further damage to the user settings, files, security and so on.
Crowdstrike released a faulty update in their software update, which caused the infamous blue screen of death in many Windows systems across the globe. Microsoft must have been fuming, because they knew this wouldn’t have happened with a closed system, and the media kept on reporting on it as if it was a Windows problem. Sure, it only affected Windows PCs, but it had nothing to do with Microsoft.
If I understand correctly, the driver was signed off by Microsoft but the update involved a “channel file” which just contained loads of zeros. So when the driver used it, it had no choice but to blue-screen. It makes you wonder what kind of testing processes they have at Crowdstrike if they can release an update like that.
When I logged in at work, our Group IT announced that some colleagues will be affected by a Crowdstrike problem and would be acting quickly to get people back up and running. It was only a bit later when someone sent me a screenshot of some of our users complaining on X did I realise that it wasn’t just an internal problem. When I went on X, I saw reports of the problem affecting banks, airlines, supermarkets and more; and had a live news page on the BBC. I still didn’t understand the severity of the problem until I saw that Troy Hunt had declared it as one of the severest problems we have ever seen.
Despite Group IT making it sound easy to restore, when I heard others talk about it, I got the impression that it was fairly straightforward to revert the update on a single computer, but when you have hundreds of computers; then it is a problem. In companies where they only have a few IT staff; it is crippling. You may think that people could fix the problem themselves but many people aren’t tech-savvy, and plus, many companies lock down access so you don’t have any advanced features like Administrator mode.
Furthermore, it sounded like servers “in the cloud” were even more difficult to restore; or it was more cumbersome at least.
Ironically, in recent years, we have moved a lot of our live infrastructure from our own data centres and into the cloud; citing benefits of reliability. However, this problem meant our users were impacted for a day or so; when we could have got them up and running within an hour or so if the servers were still internally hosted.
Crowdstrike released an update to prevent more machines from being brought down, and had sent customers mitigation steps and tools to identify impacted hosts. The new update wouldn’t fix the broken machines though; that required manual fix involving booting into safe mode, locating the dodgy file, and removing it.
Companies purchase security software to prevent system outages, and causing a global system outage is a massive PR blunder for Crowdstrike and security software in general. It’s gonna be tough rebuilding trust, but many of the every-day people will probably blame Microsoft because that’s the name that was initially stated in the media.
It must have been brutal for the upper management, and a disaster when they turn up fatigued and under pressure on live TV.
Troy Hunt documented the story as he learned more: