The Cloud Keeps Breaking

Four months ago, we wrote about the October 2025 AWS outage that brought down over 3,500 companies across 60 countries. Smart beds overheated. Delivery apps went silent. Entire business operations froze because one cloud provider had a bad morning.

We ended that piece with a set of lessons about resilience and preparedness. What we didn’t expect was how quickly those lessons would be tested again.

In January 2026, Microsoft 365 suffered a global outage lasting nearly nine hours. Exchange, Teams, Outlook, SharePoint, all of them went dark simultaneously. Automated failover systems, the very mechanisms designed to prevent exactly this kind of cascading failure, couldn’t stop it. Millions of users across thousands of organizations sat idle, unable to send an email or join a meeting.

Then in February, Cloudflare experienced a six hour disruption when an internal maintenance task accidentally withdrew customer IP address routes from the internet. Services that relied on Cloudflare’s infrastructure simply vanished from the web.

These are not freak events anymore. They are the new normal.

Why the outages are getting worse

There is a structural reason behind the rising frequency and severity of cloud outages, and it has everything to do with artificial intelligence.

The major hyperscalers (AWS, Azure, and Google Cloud) are in an arms race to build AI infrastructure. GPU clusters, specialized cooling systems, massive power draws. The economics of this buildout are staggering. Every dollar poured into new AI data centers is a dollar not spent on maintaining and upgrading the legacy infrastructure that most enterprise workloads still run on.

Forrester made this explicit in their Predictions 2026 report, forecasting at least two major multi day hyperscaler outages this year specifically because cloud providers are prioritizing AI infrastructure over legacy system maintenance. The aging x86 and ARM environments that power the majority of enterprise applications are being left to accumulate technical debt while the spotlight and the budget shifts to GPU clusters.

The result is a growing reliability gap. The systems you depend on today are being maintained with yesterday’s budget.

The identity problem nobody talks about

When most people think about cloud outages, they picture websites going down or apps becoming unresponsive. But there is a deeper layer of risk that gets far less attention: identity.

Modern authentication systems are deeply woven into cloud infrastructure. Your Active Directory, your SSO provider, your OAuth tokens, your certificate authorities, all of these typically depend on the same cloud fabric that just went offline. When the cloud provider goes down, it is not just your applications that stop working. Your people literally cannot prove who they are to your own systems.

Think about what that means in practice. Even if you have a backup application running on a different provider, your employees may not be able to log in to use it because the identity layer that verifies their credentials is hosted on the provider that just went down.

This is the hidden single point of failure that most organizations only discover during an actual outage.

What the smart companies are doing differently

The response to this new reality is not panic. It is architecture.

Multi cloud is no longer optional. The number of organizations adopting multi cloud strategies jumped to 86% in 2025, and for good reason. Distributing critical workloads across providers means no single outage can take everything offline. But multi cloud done poorly is just twice the complexity. The key is to be intentional about which workloads go where, rather than spreading everything everywhere and hoping for the best.

Identity resilience needs its own strategy. Smart organizations are decoupling their identity infrastructure from any single cloud provider. That means maintaining independent identity stores, implementing local authentication fallbacks, and testing whether employees can actually access critical systems when the primary identity provider is unreachable. If you have never tested this, you should assume it will not work when you need it.

The “shadow cloud” approach is gaining traction too. Some companies are quietly maintaining relationships with smaller, regional cloud providers as backup infrastructure. These are not full production environments. They are lightweight standby systems that can keep essential operations running during a hyperscaler outage. Think of it as an insurance policy that costs far less than a day of downtime.

And then there is the business continuity conversation. Most existing BCP documents were written for a world where cloud outages lasted minutes, not days. If your plan does not account for a 48 hour loss of your primary cloud provider, it is outdated. Run the tabletop exercise. Identify which processes completely stop, which ones can limp along, and which ones need to keep running no matter what.

The uncomfortable question

Here is the thing that makes this conversation difficult for technology leaders: you cannot control your cloud provider’s investment priorities. You cannot force AWS to spend more on maintaining legacy infrastructure instead of building the next generation of AI compute. You cannot guarantee that Azure will never have another nine hour outage.

What you can control is your own architecture, your own redundancy strategy, and your own recovery plans.

The cloud is not going away. It remains one of the most powerful platforms for running modern business operations. But the era of treating any single cloud provider as infallible infrastructure is over. The October 2025 AWS outage was a warning. The January 2026 Microsoft outage was confirmation. The Cloudflare incident was a reminder that this extends beyond the big three.

At Intworks, we help organizations design cloud architectures that are built for resilience, not just performance. If your current setup depends on a single provider for everything from compute to identity, we should talk. Reach out to our cloud strategy team to start the conversation.