Whew, what a disaster! I share my thoughts about the whole CrowdStrike situation and the fundamental problem that I think lies at the core of this. Let me know what you think about this in the comments.
✅ Get the FREE Software Architecture Checklist, a guide for building robust, scalable software systems: https://arjan.codes/checklist.
📨 The Friday Loop by ArjanCodes Newsletter: https://thefridayloop.com
💻 ArjanCodes Blog: https://www.arjancodes.com/blog
🎓 Courses:
The Software Designer Mindset: https://www.arjancodes.com/courses/tsdm
The Software Architect Mindset: https://www.arjancodes.com/courses/tsam
Next Level Python: Become a Python Expert: https://www.arjancodes.com/courses/nlp
The 30-Day Design Challenge: https://www.arjancodes.com/courses/30ddc
👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
Social channels:
💬 Discord: https://discord.arjan.codes
🐦 X: https://x.com/arjancodes
🌍 LinkedIn: https://www.linkedin.com/company/arjancodes
🕵 Facebook: https://www.facebook.com/arjancodes
📱 Instagram: https://www.instagram.com/arjancodes
♪ Tiktok: https://www.tiktok.com/@arjancodes
🛒 GEAR & RECOMMENDED BOOKS: https://kit.co/arjancodes
🔖 Chapters:
0:00 Intro
0:11 Who is CrowdStrike?
0:45 Recap of Friday outage
1:11 Rant time
2:03 How it could have been avoided
2:53 A fundamental dichotomy
3:47 Things will get worse
#arjancodes #softwaredesign #python
DISCLAIMER – The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!
source
CrowdStrike Terms & Conditions.
8.6 "…AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR
INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE
PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS
ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES,
COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT
SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE
FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY
DAMAGE. Customer agrees that it is Customer’s responsibility to ensure
safe use of an Offering and the CrowdStrike Tools in such applications
and installations…."
Who allowed the use of this Software in critical applications?
We have ended at my place to revert back from a DevOps infra thinking, too many failours in the devops mindset and rather develop less and more secure at the end of the day it's the long haul that wins the race…..oh and no one asked for the AI shit which is just a extra layer of complication.
As a software engineeer myself:
– To less time: to test, to build well/refactor, to rebuild legacy code. Deadlines pushing bad/not well tested software into production.
– To much stress because to much firing/people leaving and rehiring all the time. And with the problem the knowledge of parts of the software is gone.
– A lot of bad managers in the IT world, who make the above happen
– Companies see software development/it as a cost instead of a win: For example in some companies i worked sales persons get bonusses when they make enough sales selling the software, while software engineers dont get anything.
– There is not a real easy to see how good/bad a software engineer have been working for managers/people who cant read code. And because of this most companies only look at speed. Not how well software is written.
This have been happening for a very long time in lots, probably most companies. With legacy code that isnt workable anymore and very hard to maintain. And should have been replaced years ago.
Using a root kit, as they have done, to enhance security feels very counter intuitive. Especially seeing as CrowdStrike was shown to have deep ties to the US security state during the Russiagate debacle. I would not touch their products.
🦀
Great video as always! Thanks, Arjan!
Completely out of topic:
It would be very interesting to see your take on the game "The Farmer Was Replaced", where you have to write code to automate farming drone
Thanks!
Operating systems and applications are inherently unsafe and unsecure because they weren't built from the ground up to be so. If they were you wouldn't need security software to protect it. Security first pattern in software development should be front and center, not bolted on as an afterthought. Especially now that lives may depend on it.
All parties need to fix this broken system:
– Security companies cannot ever force push without testing.
– OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process.
– Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time.
We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.
Humans tend to think they can sacrifice quality for speed, which works for some time and then fails miserably. It's a bit like the uncertainty principle, there is a fundamental limit that cannot be cheated.
This is nonsense, moving fast doesn’t not mean breaking everything! You need basic checks always and constant improvement.
The CEO of CrowdStrike, George Kurtz used to be the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.
Slapping CrowdStrike over the wrist will do NOTHING to fix anything.
Windows has had nearly 40 years to work out how to deal with "legit" 3rd party software updates with broken code IE not malicious viruses. WITHOUT locking the User out of their pc.
This WILL happen again. Blaming Crowdstrike is a total waste of time. FIX the PROBLEM.
Why my Win10 computers here don't crash or anything?? 😟.. and everybody was working normally……….. because I don't let somebody I don't know update my computers just because some as^&%%shole said I have to ………. I update my computer if I need it.. and after I try that update in a not essential machine for at least some days or a week 🤷♀….. today my Windows is running with the same files it was running yesterday….. and also last weak…… and last month……… and the same software that I install on it last year ………. it boils my piss that this imbeciles are constantly beta-testing their crap in the computers I use for work…. like if this were a 1960 Terminal.. instead of a PERSONAL Computer.. 😡……… I don't understand who the hell you people tolerate this nonsense over there…… ¿¿was medical centers with this problem over there??…… and probable in nuclear plants affected also 😶…… yes… because those things are connected to the internet sewer rigth.. because the retardation is overwhelming in this world.. so yeah probably 🙈…..
Ah yes, he age old question posed by Juvenal: Quis custodiet ipsos custodes? (Who watches the watchmen?)
There is (should be) a standard pre-release test even for time-critical security software: The test should be that the target operating system can at least boot to a point where the updater can allow new versions of the security software to be installed. The test should be run twice: once on an instance that keeps upgrading the software and one on a freshly installed operating system. If just those tests are implemented then, to an extent, you can rush the rest because fixes can be sent out to clean up errors. This test doesn’t need re-writing for each instance of the software.
Other tests (does it block the malware, does it not interfere with critical applications, …) can be run after launch because errors can be cleaned up automatically. There is room for subtlety here: a customer might sign up for the pre-application-testing version or the post-application-testing version. Perhaps they do their own testing. Perhaps they have made a risk-balancing decision.
This sounds so obvious. Hindsight is a beautiful thing.
CrowdStrike hit linux few months ago, too, but nobody told anything since impact was smaller
CrowdStrike also was able to force such upgrades. Plus, we can have both w.r.t tests and velocity, ContinuousDelivery main front person, Dave, told it as well 🙂
✅ Get the FREE Software Architecture Checklist, a guide for building robust, scalable software systems: https://arjan.codes/checklist.
I agree.
There is a lack of appreciation in the general 'software' industry of the shaky foundations of logic & perfection that coding is built upon, and how it has permeated the foundations of many other parts of society.
Those who are reaching for blame should have a read of the years of study of human error, safety studies, and their ilk to see how these major failures continue to happen.
It's just another day in our interconnected world.
When the underlying unexpected factor(s) is/are finally identified, they'll likely be tedious and boring from some disused cupboard that had been forgotten about (xkcd/2347).
WHQL certification for drivers for 3rd parties need to be FASTER from Microsoft. Then, you wouldn't need the workarounds that Crowdstrike does to release quick updates via sys files….
Thanks for this!
From a retired Windows developer's perspective, this description contains some great insights and interesting speculation about how the CrowdStrike driver caused the problem. Normally a kernel driver, like CrowdStrike's, is carefully tested and certified. Perhaps they are meeting the need for rapid updates, which @ArjanCodes discusses, by running code that is downloaded dynamically like a virus definition file. This means that they can, in effect, run uncertified code in the kernel. Speculation, but informed and plausible.
See @DavesGarage https://www.youtube.com/watch?v=wAzEJxOo1ts
Never
Deploy
On
Friday
Can't have dry code or the equivalent of dry software without paying for it by ensuring its right (tested/qa)
The promise of Continuous Delivery (as Dave Farley explains so often) is that you can release quickly and safely because you have a lots of tests. You work in small steps to achieve that. There might be an imminent threat and we will have to make a big change to our software to deal with it. You are back at square one: How do you know your change actually deals with the threat? Oh, that's right: By testing. If you say you need to skip that phase you don't believe in testing in the first place. If you skip that phase you get something to the market quicker. But will it help? Or are you pouring oil into the fire? The practice of continuous delivery with TDD is the best insurance that your software stays flexible and easy to change so you can deal with such problems quickly when they arise suddenly.
The two best ways to to reduce the risk and impact: Canary releases, and switch in Linux.
is this the peak of AGILE
technician/ programmer : we need do , this , this , that …. Manager : nope , you are fired ….After while , big error …. Manager – it is technician/ programmer fault …
This is called “Experience”
The system's singularity, monopoly, or homogeneity will cause the same problem to form a chain reaction like a nuclear explosion. As the number of computer systems increases, they penetrate deeper and deeper into every aspect of human life, and these systems become more and more like an ecosystem. There are invasions, defenses, upgrades, mutations, and evolutions. And the systems created by humans don't seem to work for a long time. Perhaps we should learn from the Creator and get some inspiration from nature. We need diverse systems, just like biodiversity. At the same time, we need isolation between systems, just like species in the biological world are isolated by geographical climate. If an update spreads all over the world in an instant, it would be like a virus spreading between organisms without hindrance. Sooner or later, it will be a disaster, and the next time it may be a nuclear facility. Whether the crowdstrike accident has caused problems in some nuclear facilities is still unknown.
Testing, Testing and more testing. The update wasn't tested properly.
I don't understand how this happens. Why doesn't Crowdstrike have a few VMs with Windows 7, 8.1, 10 and 11 running to test first? Push the update, reboot the VMs a few times and make sure it's working. It's almost like we are testing their crap in production.
Also 4:45, it's all fun and games until the self-driving car runs over a young boy playing outside.
Agile > Fragile
I am glad other programmers and developers as well as engineers are being vocal about this. I have been waiting for this because I also think what happened exposes a major problem in IT. I personally do not think it was an update issue but more so a cyber attack which the CEO cannot reveal as this would undermine the entire point of his product. The bottom line is we cannot rely on a single centralized agency with regards to IT security whether or not my theory is correct or otherwise.
Maybe the kernel security layer should be virtualized, so that a corruption of the kernel can be quickly be switched off.
Despite the claimed need for such deep access, if companies like Crowdstrike can corrupt the kernel, hackers (including nation state actors) could the same, or worse. At least the Crowdstrike bug just crashed the system, but other bugs can subvert it.
That make you think about a self-driving car with a bug in it.🦅
Use AI to build more tests. 😉
One suggestion, I heard: make EULAs illegal. There is no reason that crowdstrike should not be subject to a giant class action lawsuit over this failure. If companies are legally liable and face financially percussions, they will implement the processes needed on their own. Eg, Ford Pinto.
Engineering trade offs man, we can't get away from the fact that most, many, all? decisions are trade offs.
In the good old days we had big mainframes running code which took checkpoints and did automatic rollbacks upon failure. They were replaced by lots of networked Microsoft boxes.
Part of the solution is the adoption of safe languages like Rust for system-related components.
Most threats do not require an immediate response level, for many a canary release mechanism based on system criticalness level
1) deploy to non-critical systems (grocery stores, small businesses, gas stations, government)
2) wait 36 hours
3) deploy to mid-level critical systems (banks, financial institutions)
4) wait 36 hours
5) deploy to critical systems, (hospitals, pharmacy's, airports)
For the defcon 5 threat level scenarios, then perhaps use the shotgun approach.
Perhaps Microsoft should make their products more secure, and not have us all pushed to a daft business model. The whole paradigm is flawed as it is based on making as much money as possible and tying people into products, not about security, safety and performance.
In this particular case, there's something more basic that would have caught the problem, and it requires no slow down of the development process when new malware shows up. When they deploy code, they need their software to phone home once the OS boots up, that's it. If when they deploy a new channel file and then reboot their test servers, they were to wait for the "phone home" before moving the code down the line for further testing and eventual deployment, they would have caught this problem. The user-land code would never have phoned home because the OS was stuck in boot. This is really just basic smoke testing, and it shows how immature their deployment pipeline must be.
I don't know, maybe paying people well, retaining them so they develop expertise and not rushing?