CrowdStrike’s IT outage makes it clear why cyber resilience matters

A misconfigured content update released by CrowdStrike late on Thursday, inadvertently triggered worldwide outages across Microsoft Windows systems, taking many of the world’s most essential services offline.

CrowdStrike was attempting to update content that their Falcon Sensor uses to perform real-time threat detection and endpoint protection by monitoring system activities that identify suspicious behavior to prevent cyber attacks. The content update contains logic designed to fine-tune the detection of malicious activities and is based on the latest threat intelligence CrowdStrike collects on a real-time, continuous basis.

“This was not a code update. This was actually an update to content. And what that means is there’s a single file that drives some additional logic on how we look for bad actors. And this logic was pushed out and caused an issue only in the Microsoft environment,” CrowdStrike CEO and founder George Kurtz told Jim Cramer during an interview on CNBC earlier today.

The outage was first spotted in Australia, with Windows machines crashing and displaying the Blue Screen of Death (BSOD). The faulty update triggered a Windows blackout worldwide, impacting dozens of airports, airlines, banking institutions, and service companies that all rely on Windows-based systems to operate their businesses. Hundreds of thousands of travelers are stranded in airports around the world. Approximately 2,600 U.S. flights had been canceled as of Friday afternoon, and more than 4,200 flights had been canceled globally based on FlightAware data as reported by the Wall Street Journal.

The effects of the IT outage also spread across the Microsoft Azure cloud platform. Azure customers complained that they were “experiencing unresponsiveness and startup failures on Windows machines using the CrowdStrike Falcon agent, affecting both on-premises and various cloud platforms.” Azure Health Status shows the outage still impacts Azure virtual machines across the four regions of America, Europe, Asia-Pacific, and the Middle East and Africa.

IT teams are in for a long weekend and a tough July, as many cloud-based configurations will require individualized updates for every customer running a cloud-based system. Give IT teams a break and, if possible, postpone any large-scale projects until the misconfiguration can be solved.

Outage needs to be a call to action for greater cyber resilience

The more cyber resilient a business is, the greater the ability to anticipate, withstand, and recover from a wide variety of adverse conditions, including attacks, intrusion and compromises. It’s often on CISOs to get cyber resilience right as a core part of their roles in senior management and, increasingly, on boards.

“Ultimately, every enterprise has challenges around patching cadence. Today is CrowdStrike’s bad day, and it became a bad day for a lot of folks. The fact that Crowdstrike required their end customers to do the work to ameliorate created more time to respond and time to remediate,” Merritt Baer, CISO at Reco and advisor to Expanso, Andesite and EnkryptAI told VentureBeat.

Trustwave CISO Kory Daniels recently said that “boards have begun asking the question: Is it important to have a formally titled chief resilience officer?” VentureBeat has learned that more boards of directors are adding cyber resilience to their broader risk management project teams. High-profile ransomware attacks that create chaos across supply chains are among the most costly for any business to withstand, as the United Healthcare breach makes clear.

Outages caused by misconfigurations highlight the need for a unique form of cyber resilience so actively pursued that it becomes a core part of a company’s DNA. Misconfigured updates will continue to cause global outages. That goes with the territory of an always-on, real-time world defined by intricate, integrated systems. “The scale is significant but the source is too— for example, Snowflake was due to SaaS misconfigurations, and SolarWinds was a Russian-backed supply chain attack. This is good old-fashioned security pain,” Baer said.

This week’s global outage is what a nation-state attack would look like if a nation’s cybersecurity was weak or didn’t exist. To get a glimpse into what’s at stake when it comes to national cyber resilience and cyber defense, check out the recently released 2024 Annual Threat Assessment of the U.S. Intelligence Community.

Cyber-resilience, in response to misconfigurations, needs to quickly identify and define issues, define a fix (ideally at a scale that can be automated), and over-communicate with every customer and person affected. Getting internal cyber resilience right needs to be supported with reporting that’s accurate, easily accessible to everyone, and as real-time as possible. The goal needs to be giving everyone involved in updates a chance to own the outcome and know regression testing and testing across partner platforms is complete.

“Earlier today, CrowdStrike’s Falcon service suffered an unfortunate global outage that affected many customers using the software on Windows systems. CrowdStrike’s incident response team’s speedy action to determine the root cause and notify customers quickly is commendable, and their CEO’s blog was honest and clear,” Paul Davis, Field CISO at JFrog, told VentureBeat.

Kurtz continues to post updates across social media platforms X and LinkedIn. In the most recent X post below, he commits to providing a root cause analysis of how the outage happened.

“In the world of security, one must always be prepared for the unexpected and have an incident plan for those surprise events. There is no such thing as perfect software. After all, software is built by humans, and to err is human. It’s how quickly you identify and recover from the problem that matters most,” Davis told VentureBeat.

Recovering your system

Earlier today, CrowdStrike posted instructions on its site for recovering systems affected by the outage and for finding systems or hosts impacted by the misconfigured update.

You’ll need to start any affected machine in safe mode first. This step is necessary because the Falcon Sensor software, which needs updating, is embedded within a subdirectory of the Windows operating system. Booting into safe mode is essential to access this subdirectory and perform the necessary updates.

If the affected PC uses BitLocker or other full-disk encryption (FDE) software, you’ll need the recovery key for each machine. CrowdStrike recommends the following steps in their blog post detailing how to recover an affected machine:

Cyber resiliency is a proxy for customer trust

“Security vendors need to understand that they are holding customer outcomes in their hands. I imagine Crowdstrike won’t push updates in the same way in the future,” Baer told VentureBeat. The worldwide outage continues to disrupt hundreds of thousands of people’s lives and force businesses to a standstill. From the shop floors of designers who rely on cloud-based systems to connect with their customers to large-scale enterprises with thousands of colleagues unable to log in, today’s experiences make it clear that cyber resiliency is more than a security initiative. It needs to be a cornerstone of customer experience.

Earning and keeping the trust of customers hinges on making a business as cyber-resilient as possible. The outage is a compelling event every business needs to see as a crucible to evaluate how well prepared they are for a comparable event.

Given the complex integrations and connections between global systems, there will be future outages. Every business must take responsibility for cyber resilience and choose to excel at it now rather than later.

The post CrowdStrike’s IT outage makes it clear why cyber resilience matters appeared first on Venture Beat.