
2024: the year misconfigurations exposed digital vulnerabilities
www.computerweekly.com
.shock - stock.adobe.comOpinion2024: the year misconfigurations exposed digital vulnerabilitiesSmall configuration errors cascaded into major outages during 2024. Mike Hicks, from Cisco ThousandEyes, propounds techniques to defend digital resilience against tales of the unexpectedByMike HicksPublished: 10 Mar 2025 Imagine the impact of a sudden service disruption on your business. Customers unable to access your platform, transactions put on hold, and your team racing against the clock to fix the issue. These arent far-fetched scenarios theyre the kinds of challenges many organisations faced in 2024 when small configuration errors cascaded into major outages.Our increasingly digital world has provided incredible opportunities for growth and efficiency, but its also introduced new vulnerabilities. Configuration changes have always had the potential to take out services but with more of the digital landscape managed and configured with code, the propensity for mistakes is now much higher. The missteps of 2024 were a stark reminder that even minor errors can disrupt operations, dent user trust, and create lasting challenges for businesses across all industries.This makes digital resilience more than a best practiceits a critical necessity. By examining the high-profile outages of 2024 and understanding their causes, businesses can take actionable steps to build stronger, more reliable systems and safeguard their digital experiences.Identifying the route causeWhen it comes to configuration-caused outages, businesses were challenged by two key trends over the last year that elevate the importance of digital resilience in the face of disruptions: continuous improvement and delivery (CI/CD), and the accelerated deployment of modern applications and cloud services.The first trend, CI/CD, characterises modern software engineering best practices. It allows product and engineering teams to make small modifications and improvements faster and with greater frequency, but on the flipside, the rapid pace shortens the time available for end-to-end testing. In addition, the ever-changing nature of application code makes its behaviour unpredictable, even on a day to day basis.The second trend is the accelerated deployment of modern applications and cloud services, which are inherently distributed in design, including their underlying infrastructure. Digital applications comprise of many components that are orchestrated together to deliver a single, seamless experience. These components are often developed by different agile teams and may reside on either owned or unowned (third-party) infrastructure. In these environments, we often observe instances where a team making a change is doing so to improve their own patch or portion of the application, but may not have complete visibility into what flow-on impact their change might have on the rest of the infrastructure.While the resulting misconfigurations may be unintentional, software configuration outages can have a significant impact relative to the size of the change. So, what does this look like in practice for organisations?2024 - the year of outagesIn the networking space, unintended misconfiguration of routing policies has been a recurring issue over many years. A service provider, for instance, may mistakenly insert themselves into a traffic path by advertising a prefix it doesnt own or control and is unable to handle the sudden traffic influx, leading to timeouts and other connectivity-related failures for end users. One example took place in October last year, when a number of OVHcloud services were subject to a faulty configuration that impacted several regional telecom providers.With accelerated cloud adoption, configuration errors have also become an increasingly common issue in the cloud, impacting security functionality, performance, and availability. Last year, for example, two Azure resources were impacted: one in January, when an erroneous configuration change triggered a dormant defect that resulted in a 7-hour long degradation of the Azure Resource Manager; and one in July, when a configuration change impacted backend connections to compute and storage resources, ultimately impacting services such as Confluent, Elastic Cloud, and Microsoft 365. Later in the year, Salesforce also suffered a similar incident that prevented global users from accessing the cloud service when critical information was left out of an updated configuration file.It isnt just the network or cloud infrastructure where configuration errors occur. Problems also manifest within the applications themselves. Notably in July last year, an issue with a single CrowdStrike configuration file resulted in system crashes and blue screens of death (BSOD) on affected Windows systems worldwide - but there were other incidents as well. A series of temporary issues with ChatGPT pointed to configuration changes and re-architecture to improve the user experience. And Square merchants experienced payment problems when a new feature configuration could not be interpreted by Android devices.Digital resilience in the face of disruptionIn 2024, many configuration changes not only degraded digital experiences but also disrupted the delivery of the service completely. Its this subset of incidents that produced the biggest lessons of 2024 that shouldnt be repeated in 2025.For product owners and operations teams, the drive to continuously improve remains as important as ever, but user experience needs a bigger focus. Automation and assurance technologies both have a role to play here. These solutions can compare ongoing patterns against known outage patterns, providing visibility and correlating signals to allow early detection of degradations or disruptions to an application or other IT asset. In the case of a configuration change gone wrong, this could be the difference between a speedy rollback and a lengthy troubleshooting process.Successfully implementing a configuration change on the first attempt is key for businesses across all industries and indicates that the organisation has access to ample data and insights all the way from the end user to the cloud, allowing them to adequately assess the potential impact of changes made at any point in the end-to-end delivery chain.Be it caused by a misconfiguration or otherwise, lessons can be learned from the outages of 2024 and minimising the occurrence and impact of any disruption will be core to achieving digital resilience in 2025.Mick Hicks is a Principal Solutions Analyst at Cisco ThousandEyesRead more about outagesBig bank systems crashed for over 800 hours in last two years due to IT outages8 largest IT outages in historyCauses of IT outages explainedIn The Current Issue:DeepSeek-R1: Budgeting challenges for on-premise deploymentsInterview: Why Samsung put a UK startup centre stageDownload Current IssueSLM series - OurCrowd: Are domain-specific LLMs just as good (or better)? CW Developer NetworkSUSE Edge for Telco 3.2 dials into disaggregated network architectures Open Source InsiderView All Blogs
0 Comments
·0 Shares
·47 Views