The events of Friday the 19th 2024 had a profound impact on organizations around the globe. I suppose a widespread global IT outage has a way of clarifying the mind—and IT leaders are recognizing that resilience is crucial for maintaining operations for their consumers, customers, staff, partners, and shareholders.
But much of the post-mortem analysis seems to be misunderstanding key lessons from the outage. Here are three key misconceptions I noted in the press coverage of the recent CrowdStrike update outage.
Misconception #1: Companies “let this happen”
In the wake of the outage, ABC News published an article about the impact of the CrowdStrike event on Delta Airlines’ operations. The article focused not just on the impact, but questioned why it took so long to restore and recover operations. Henry Harteveldt, a travel and industry analyst at Atmosphere Research Group told ABC News, "It’s a surprise that a multi-billion-dollar corporation like Delta would allow this to happen."
It’s important to remember that Delta didn’t intentionally cause the problem. To fully understand how this occurred, one must consider Crowdstrike’s update methodology, pipeline testing, cohort rollout, and more.
Customer organizations have limited visibility into every piece of code updated by every vendor in their environment. This is true for any software company, not just Crowdstrike. For example, do you read the release notes for every iOS or Android update? Do you test updates on a secondary device before applying them to your primary device? How about updates for Chrome or Safari? With this in mind, it’s easy to see how a piece of software might unintentionally impact other software.
Misconception #2: Fixes needed to be implemented “by hand”
Next, Mark Lanterman, Chief Technology Officer at Computer Forensic Services also offered ABC News his opinion of why it took so long to restore operations: it’s a labor issue.
"This isn’t a fix that could be done automatically; IT resources can’t just sit at a computer and push out an update and everything is fixed," Mark commented. "It took so long because Delta has a lot of computers and likely they have limited IT resources to go from computer to computer."
While manually implementing corrective actions from CrowdStrike could take days or weeks, today’s technology offers better solutions. Although Mark’s perspective is logical, especially from an Information Security standpoint, there are tools available for bulk operations that restore systems to a specific point in time. For instance, Rubrik’s “In Place Recovery” and “Orchestrated Recovery” functionalities allowed our customers to recover Windows Virtual Machines (the majority of machines affected) to a point before the Crowdstrike update. This bulk operation is quick and can roll back hundreds or even thousands of machines in minutes.
Additionally, other customers used our Threat Hunting technology to identify the specific file causing the issue. Once they understood the impact, they could determine which systems needed to be rolled back. Our telemetry data shows hundreds of our customers initiated bulk recovery actions on July 19th, proving that automated recovery is possible.
Misconception #3: IT and InfoSec Operate in Lockstep
I’ve observed a troubling trend across many organizations: Information Security and Information Technology teams operate separately (and not in service of the greater good).
While it’s important to have dedicated expertise in each area, the lack of ongoing collaboration between these teams and the silos of information kept close to the chest is concerning. Often, they don’t communicate or practice response actions together and have very little consensus on how the individual pieces of technology managed by each team can be used for a response. A yearly tabletop exercise is simply not enough.
Information Security teams focus on prevention, building strong defenses like the proverbial taller walls and wider moats to keep threats out and the data in. They are effective at this, handling thousands of attacks daily with few successful breaches. However, it only takes one successful attack to render these perimeter defenses useless.
Information Technology teams handle ongoing technical operations and disaster recovery (DR). Traditionally, DR addressed natural disasters like earthquakes and power outages, but these processes often fall short in systematic outages or cyberattacks. Legacy DR methods are not always equipped to handle modern cyber threats or large-scale outages.
So, how well do these teams collaborate during an outage or cyberattack? Not as well as they could. When an attack occurs, chaos ensues, and Information Security initiates their incident response plan. Eventually, IT gets involved in recovery but often lacks a full understanding of what has transpired and ultimately is recovering bad or incomplete data. Conversely, Information Security may not be aware of all IT processes and capabilities that could aid in recovery so instead tries to repair or remediate the production systems which is lengthy and oftentimes fraught with failure. Legacy technologies make this even worse and often provide limited capabilities, making them less effective or completely non effective against cyberattacks.
Where Do We Go From Here?
Rubrik aims to bridge this gap by offering true cyber resilience in the Data Security market. Our platform supports both IT and cybersecurity needs with easy-to-understand policies, data protection features, and the ability to recover to specific points in time. We also offer threat detection, anomaly detection, sensitive data discovery, and quarantine capabilities to help prevent reinfection. Our platform supports recovery across various environments (cloud, SaaS, on-premises) and automates recovery for thousands of machines or workloads. Additionally, Rubrik’s API-first architecture integrates seamlessly with other tools in your environment to enhance your data experience. While traditional information security can be thought of a top down approach, Rubrik is presenting capabilities at the point of data, or bottom up.
Rubrik is a rare technology that enables closer collaboration between distinct teams, providing insights and capabilities for Information Security while supporting IT in protection and recovery.
Did Delta have access to similar tools? Was there a disconnect between Information Security and IT? Could better technology have helped Delta recover? All questions that would be interesting to know the answer. From a Rubrik perspective, we absolutely provided a technology that could assist with recovery, and our customers used it.