General Tech

The Day the Data Center Stood Still: A Tabletop DR Workshop

Rubrik - The Day the Data Center Stood Still: A Tabletop DR Workshop - The Day the Data Center Stood Still: A Tabletop DR Workshop

When it comes to being ready for the real thing, regular DR testing is crucial to preparing a team for all the moving parts involved when some random day goes really wrong. The trouble is, it’s not always possible to scrape together the time and resources needed to test as often as you should. In many organizations, the first time a crisis team gets together is for the big one.  

This blog series shows how you can use tabletop workshops to do some hypothetical training and strengthen your disaster recovery and response strategy. In my first post, I discussed the importance and preparation of a tabletop exercise. This post will help walk through the setup and execution of the exercise. Let’s get started!

Tips for Running Your Tabletop Workshop

Each workshop has one facilitator to guide the exercise and 5-10 participants in the core and extended crisis team to roleplay the scenario. Note that participants will not necessarily be playing their actual role at the company, but working collectively as a group to address the disaster.

Group Discussions

One of the main goals of a tabletop workshop is to encourage group discussions that identify holes in your current DR strategy and strengthen your approach. The facilitator reads each segment, and then participants have 5 minutes for group discussion. Participants may want to briefly ask for more details during this time to better understand the situation. To supply the facilitator with additional background during group discussions, sections marked Facilitator’s Notes  provide supporting information. These notes also help make the scenario interactive and slightly unpredictable. Feel free to embellish or reduce detail as is seen fit.

After group discussions and before moving on to the next scenario, the team should assess their current status. It is recommended to print out the following status questions and give it to the group for reference:

  • What is your priority in this scenario? 
  • What additional information do you need? 
  • What is the escalation chain? 
  • What is the communications strategy?
  • Is this an emergency? Time to declare a disaster? Who needs to be informed at this point (e.g. CIO, CEO, PR, customers/suppliers, law enforcement, the media)?

During their discussion, the group should record the agreed upon course of action for these points, even if that answer is N/A. They should then provide this feedback, plus justification of how and why they arrived at these decisions, to the facilitator.

Workshop Begins: We Are Shipping for the Holidays!

Below is the information on the company’s current state when the disaster unfolds, as well as a timeline of events. Each time marks a new scenario for the group to discuss and provides questions and notes for the facilitator. Let the workshop begin!

Backstory

You’re a midsize enterprise that produces consumer electronic devices. The company has two sites with data centers in each using cross-site replication. IT, help desk, and many business functions are consolidated in the main site. In the past two years, three people have been responsible for the security role. Because of this turnover, many security functions are outsourced to a Managed Security Service Provider (MSSP), including log analysis, endpoint security policy management, firewall change management, and incident response.

Several weeks ago, employees reported two separate rounds of suspicious emails. The first batch was phishing emails with instructions to visit a website. These were forwarded to the external MSSP response team, who reported that the website was indeed a conduit for drive-by malware. They sent a sample of the malware to the endpoint security vendor so that the signatures would be updated in case anyone had been infected. The second phishing incident was much more targeted. The response team visited the website and reported nothing unusual, but still sent a company-wide email to be vigilant.

The holiday season is coming up, and the new product release is scheduled to be shipped starting today. Pre-release devices have gotten favorable reviews from journalists, and orders are still coming in – with significant increases to existing orders. Marketing has already started an integrated Cyber Monday campaign for the online retailers and in-store materials and special incentives for the brick-and-mortar electronics retailers.

8:00 am

Folks are starting to come in and the first trucks will be showing up shortly. Susan from Help Desk calls IT to say, “The team down in shipping says their scanners are displaying error messages when they try to send or receive.”

Facilitator’s Notes: If and when questions about the technical situation are posed to you by the group, do not answer as a blanket statement. Instead, use the information under Technical Background to answer these questions as they arise in this segment and all subsequent segments. 

Technical Background at 8:00am: Networks, firewalls, and routers are all functioning normally. Wireless access points in the logistics center are not showing any errors. Active Directory is down. DNS is resolving. Time servers are reachable. Internet access functions but is very slow. Applications monitored via SNMP are reporting responsive, and the servers are up and running in vSphere. ERP/SCM is reporting up. Investigation of the ERP/SCM database will show that it is mounted.

8:25 am

When the ERP/SCM team doesn’t answer the phone, Joe from the Ops Center talks to the ERP/SCM team, who are on hold with Help Desk. They report that they cannot log in to the ERP/SCM system with any credentials, even local administrator. The I&O team reports that last night’s backup of the ERP/SCM system completed.

8:30 am

The first trucks begin to arrive. Joe from the Ops Center reports that the ERP/SCM team cannot log in. The IT team is now generally discovering that other teammates are dealing with multiple seemingly isolated silos of failures. With so much going on, the IT team calls a huddle.

Facilitator’s Notes:For the huddle, questions from the participants should be expected, but the full extent of the disruption is not visible. You may wish to prepare problem tickets in sealed envelopes to simulate the time it would take to discover, read, and convey each problem to other team members.

8:45 – 9:00 am

Susan calls someone she knows from the team to ask if there is something wrong with Active Directory because nobody can log in to the domain this morning. She says the Help Desk phones are ringing off the hook.

HR shows up immediately after to report that she and her team can’t log in to the domain, and the payroll drop is today. The manager of the logistics team shows up surprised to see everyone standing around, and says there is a line of trucks forming. They can’t pick and prepare shipments from the warehouse either. A sales manager, the marketing team, and the customer service team also call with similar issues.

Facilitator’s Notes: The most important thing right now is the ERP/SCM system, as it is tied to the core of the business: shipping product. Yes, it is problematic that nobody is going to get paid, sales cannot take new orders, and so forth, but it’s existential if those trucks don’t get loaded and on their way. If the team does not focus their attention on this, then note it for the follow-up/lessons learned. 

Technical Background at 9:00 am: AD DCs are in a reboot loop. Workstations and laptops can log in locally. The applications are failing for sign-on due to SSO. The ERP/SCM database is mounted and the application seems to be running, but error messages appear for every action.

The ERP/SCM database has been hit with targeted ransomware. Without the ERP/SCM system online, shipping is completely down. The orders are all managed through this system, and very few paper parallels exist. If the team decides to restore a backup, then they will note that the backups have been completing and think they will just be losing a few hours’ worth of work that can be manually re-entered. This is not the case though. The database got encrypted weeks ago, and the encryption keys were removed this morning from the database server by the attackers. This made everything look normal up until now, so unless they think to examine the backup reports, they will not notice the anomalous increase in storage consumption a few weeks back.

 Initiating a restore at this point will take several hours to complete, and it is not completely certain which day is good. Note this privately, and reveal it at the end. Failover to the other site will not fix this problem since the replication jobs have replicated the encrypted versions and snapshots are retained for a short time only. In other words, any attempt at remediation at this point is wasted time, though the team should not be told this.

9:15 am

Trucks are arriving for deliveries and pick-ups; the drivers are not able to pull into the parking lot without blocking the entrance due to trucks already queued up. They are pulling over to call dispatch, then parking on the side of the highway, partially blocking a lane.

Facilitator’s Notes: Typically, at this time of year, 8 trucks per hour are turned around from the 6 loading bay positions with loading times averaging 45 minutes. Each truck can carry 30 pallets that are loaded with 108 product units each. Each unit is worth $299 retail, which puts the value of each truck at over $968,000 retail and makes the cost for not shipping $62M per day (translated to ~$31M wholesale).

It would be a good idea to send someone out to locate and talk to the drivers to explain the situation. There are six lots ready to go with labels. This is a manual process and will slow down shipping significantly.

9:30 am

Your MSSP calls on the mobile phone and says your phone is busy. Nobody seems to have noticed. They report timeouts and difficulties getting to your network to investigate a few alerts due to a medium sized DDoS attack that is causing time-outs from latency. They are trying to mitigate the attack with your ISP.

Facilitator’s Notes: The MSSP missed the telltale signs during the phishing attacks and is unaware of the breach and the suspicious traffic that has been ongoing for the past several weeks. It wasn’t until the DDoS attack began that someone did some log analysis and saw they missed the alerts.

10:00 am

The DDoS attack ends. Ransom demand! When the AD servers reboot in safe mode, they display a ransom message that tells you to contact the attackers via ICQ chat for instructions. The ransom is 50 bitcoins. Meanwhile, a police officer shows up at reception investigating why there are trucks blocking the highway.

Facilitator’s Notes: At this point, there is still exfiltration going on. Ransomware is a cover for another op, but the damage is essentially enormous. Give the team as long as they need to discuss and decide what is priority 1, priority 2, and so on. Effectively, the holiday shipment is delayed and in jeopardy. If they decide to pay the ransom, then they will not get the encryption key. The goal of the attackers was destruction. 

The ERP/SCM database should be identified as a top priority, but getting AD back online and identifying that there is an ongoing attack should be in the top 3.  After that, priorities are subjective, and the process of how they reach the decisions is what is important. The participants should also have decided if PR, customers, press, and law enforcement should be involved by now.

Invoices and logistics data can be recovered from either the mail servers (once AD is back online) or from the sales team’s laptops. That out-of-the-box-thinking move will allow them to manually piece together the order trail and effectively win this scenario. There will still be lost orders and significant delays.

The Wrap Up 

Once the team has finished each scenario, wrap up the exercise with a Lessons Learned discussion. You can then take these lessons and apply them to your own DR strategy so you’re more prepared if a disaster actually strikes.

To learn how Rubrik can help strengthen your DR efforts, click here.