
WWW.INFORMATIONWEEK.COM
Strategies for Navigating a Multiday Outage
IT outages are a nightmare scenario for a business. Operations grind to halt. Internal teams and customers, possibly thousands of them, are thrown into confusion. Lost revenue piles up by the minute. Each year, businesses lose $400 billion to unplanned downtime, according to Oxford Economics. While enterprises can do their best to prevent this scenario, we have seen multiple examples of outages that stretch out over days. Businesses may not be able to control when an outage happens, but they can control how they respond. What Causes Multiday Outages? Outages can stem from all manner of causes. In 2023, we saw Scattered Spider and ALPHV hit MGM Resorts International with a ransomware attack caused widespread disruption at its hotels and casinos. Slot machines were down. Guests couldn’t use the digital keys for their rooms. But malicious attacks aren’t the only causes behind outages. The culprit can be something as seemingly innocuous as an update. In July 2024, a faulty sensor software update caused the CrowdStrike outage, resulting in global disruption that lasted for days. The ubiquitous reliance on third parties means that a company may not be directly responsible for the incident; it might suffer an outage due to an issue that originates with one of their vendors, like CrowdStrike. Last year, fast food behemoth McDonald’s, too, had a global outage caused by a configuration change made by one of its third parties. Related:In the beginning of this year, Capital One and several banks had to weather a multiday outage. In this case, the vendor Fidelity Information Services (FIS) experienced power loss and hardware failure that kicked off outages for its customers. Regardless of the cause, enterprise teams need to know how to work through outages. “We all understand that it's not if a breach happens or an outage occurs, it's when that occurs. [It’s] how you respond. That's what everybody looks at,” says Eric Schmitt, global CISO at claims management company Sedgwick. The right response can minimize the long-term damage and give a company the opportunity to rebuild trust in its brand. How Can Companies Prepare for One? A multiday outage is a scenario that should be thoroughly covered by incident response and business continuity planning. A business should know its risks and build a plan around them. And often, that means using your imagination for the worst-case scenarios. “The black swan. It's the things that you don't think of. The things that you don't know can happen really, you have to plan for this," says Sebastian Straub, principal solutions architect at N2WS, an AWS and Azure backup and recovery company. Related:Planning for those unforeseeable events is a multidisciplinary exercise. Different teams need to weigh in and participate in tabletop exercises to best prepare a company for the possibility of a lengthy outage. “It should never be a single team in a vacuum trying to identify all the risks that may impact the company,” says Schmitt. What Happens During the Response? So, an outage happens. What now? It is time to take that incident response plan off the shelf and put it into action. “There should be an incident commander or someone who's designated within the organization to take [the] lead in these types of incidents,” says Quentin Rhoads-Herrera, senior director of cybersecurity platforms at cybersecurity company Stratascale. However, the incident is very discovered, employees need to be ready to alert the teams involved in incident response and all of the stakeholders being impacted by the outage. “You need to alert all of the different departments to the fact that, yes, we are experiencing an outage, and sometimes people are just too reluctant to do that,” says Straub. Related:Once the right people are alerted, they can work through remediation and attribution. Communication is one of the most important aspects of working through an outage that drags on, and it is one of the toughest pieces to get right. “You see in many, many outages that communications are one of the weakest things,” says Schmitt. It is hard to find the balance between transparency, accuracy, and risk management when information about an outage is flooding in and changing so quickly. “You don't want to pass along incorrect information but being transparent and crisp in your communication outbound helps build trust with your end users, your investors, your clients, whoever it may be,” says Rhoads-Herrera. Finding that balance is made easier when you include your communications and legal teams in incident response planning, rather than waiting until you’re in the thick of a real-life incident. While a specific outage and the timeline for recovery are going to dictate what information a business is able to share, committing to a regular cadence of communication, every few hours or once a day, goes a long way. “Long-term, if you're providing quality services and you're not letting your customers or stakeholders down in your communications during the event, I think your brand can recover from that,” Schmitt encourages. The pressure to get operations back up and running is immense. And that goal is paramount, but it is important to not lose sight of the human element. People are going to be working long days not only during the initial response but beyond that. “These events are not eight hours and done. They're going to be multiday initial response, and the long-term remediation could stretch out of months or even years,” Schmitt points out. People are going to be tired and stressed. Emotions are going to run high. If leaders don’t pay attention to their people, they risk more mistakes being made and burnout that leads to employee churn in the long-term. One of the most important ways to safeguard the people responsible for working through a lengthy outage is an issue of culture. People need to know that mistakes happen. It is ok to speak up and get everyone on the same page to work through recovery. “[Make] sure people understand that you don't need to be updating your resume on one screen while you're responding to an event on the other,” says Schmitt. Getting lost in the trenches of the response can be easy. But there should be a leader who keeps an eye on people and their hours worked. When someone is hitting 10- and 12-hour days, enforce breaks. “I saw a firm … put all of their employees up in very close hotel rooms. They made sure lunch, breakfast, and dinner was catered. They had rotating teams going in and out so that people had downtime. They had rest,” Rhoads-Herrera shares. How Can Companies Learn from Experience? An outage, like any other major incident, needs to undergo a thorough postmortem. What went well in the response? What didn’t? How can the incident response plan be updated? As much temptation there may be to forget about an outage, taking the time to answer these questions is valuable. “If you're trying to hide what the actual issue was, you're trying to downplay it, well then you're robbing yourself of the opportunity to grow and become stronger and more versatile,” says Straub. Breaking down the cause of an outage and enterprise’s response is constructive, but playing the blame game rarely is. “It's all about listing the facts and digging into what exactly happened, being open and transparent about it that leads to a better outcome versus passing blame or walking in trying to deflect,” says Rhoads-Herrera. Are We Going to See More Multiday Outages? Reliance on third parties is only growing, and the concomitant risk of that interconnectedness along with it. Cyberattacks are in no way slowing down. Natural disasters are happening more often and becoming more destructive. Any of these can cause outages, and it is certainly possible that we will see more of them. “The companies that are going to be most successful in the future are those that are looking at: what are my risks and making the investment to address those so that when the next event happens, regardless of root cause, they're able to quickly pivot and recover more quickly,” says Schmitt.
0 Kommentare
0 Anteile
28 Ansichten