Cloudflare Outage June/2025: Lessons for Software Engineers
- Dependencies have dependencies: mapping the complete tree down 3-4 levels deep reveals hidden failure points that can cause catastrophic chain reactions.
- Circuit breakers must fail fast and safely, isolating problematic components before the blast radius spreads throughout the entire system.
- Graceful degradation with multiple fallback levels maintains core functionality even when critical services fail, prioritizing user experience.
- MTTR (recovery time) is more controllable than MTBF (time between failures): optimize for 2-minute detection, 5-minute response, and 30-minute mitigation.
- Blast radius indicators like error chain depth and dependency fan-out are critical for understanding whether an incident is getting better or worse.
It's Thursday. You grab your mug, open your laptop, and suddenly half the internet seems broken. Sites loading slowly, authentication failing, and your favorite apps showing mysterious error messages. Welcome to June 12, 2025 — the day Cloudflare learned a hard lesson about blast radius.
For 2 hours and 28 minutes, one of the internet's most reliable infrastructure providers experienced a chain reaction that spread to millions of applications around the world. This isn't just another post-mortem analysis; it's a story about how modern distributed systems can fail spectacularly and, more importantly, how we can build better systems.
The Beginning of the Domino Effect
It all started innocently. Somewhere in a data center, a third-party storage provider — one that most Cloudflare customers had never heard of — started having problems. It wasn't Cloudflare's fault, per se. They were simply a customer, just like you and me.
But here's where the story gets interesting.
This obscure storage provider turned out to be the backbone of Cloudflare's Workers Key-Value (KV) service — a critical piece of infrastructure that thousands of applications depended on for everything from user sessions to configuration data. When it stumbled, the dominoes began to fall:
- 91% of Workers KV requests started failing
- 100% failure rate on Access logins
- 90%+ error rate on Stream
- Workers AI, Images, Turnstile, and parts of Dashboard also affected
- Thousands of dependent applications around the globe started throwing errors
- Customer support channels lit up like Christmas trees
The Shock Wave Path
The Cloudflare incident is a masterclass in how modern systems fail. It wasn't a dramatic explosion — it was more like watching a carefully balanced house of cards collapse in slow motion.
The Failure Timeline
Ground Zero (T+0 minutes): A third-party storage provider experiences internal issues. Most of the world doesn't notice yet.
Primary Impact (T+5 minutes): Cloudflare's Workers KV service starts timing out. Alert dashboards begin showing yellow warnings that soon turn red.
Secondary Impact (T+15 minutes): Services that depend on Workers KV — Access for corporate authentication, Stream for video delivery, Workers AI for machine learning inference — start failing completely. These aren't graceful degradations; they're hard failures.
Tertiary Impact (T+30 minutes): Customer applications that relied on these services start experiencing outages. E-commerce sites can't authenticate users. Streaming platforms can't deliver content. AI-powered features simply disappear.
Ecosystem Impact (T+60 minutes): The blast radius has now extended to millions of end users who have no idea what "Workers KV" means. They just know their favorite apps aren't working.
This progression reveals something crucial about modern distributed systems: our dependencies have dependencies, and those dependencies have dependencies. We've built an intricate web where a single thread can bring down entire sections.
Hidden Dependencies
Here's what makes this story particularly compelling from an engineering standpoint: Cloudflare isn't a startup running systems with duct tape and prayer. They're one of the world's most sophisticated infrastructure companies, with brilliant engineers who understand distributed systems better than most.
Yet they were caught in this chain reaction.
Why? Because of what I call the "invisible dependencies problem." When you're building at scale, you tend to think about your immediate dependencies — the databases you talk to, the APIs you call, the services you integrate with. But you rarely map the dependency tree three or four levels deep.
Consider how most engineering teams think about dependencies:
Our App → Cloudflare Workers KV → ???
How we should be thinking:
Our App → Cloudflare Workers KV → Third-party Storage → Provider's Infrastructure → Their Suppliers → Their Dependencies...
Each arrow in that chain represents a potential point of failure. And here's the crucial point: the farther down the chain the failure occurs, the harder it is to predict, prevent, or even understand when things go wrong.
"Oh, Leo, you need to stop getting into these messes..."
But for companies like Cloudflare — whose entire business model depends on being the reliable infrastructure layer for millions of websites — this level of dependency analysis isn't optional. It's existential.
Anti-Fragile Systems
So how do we build systems that not only survive failures like this but actually get stronger from them? The answer lies in embracing what Nassim Taleb calls "anti-fragility" — systems that gain from chaos rather than just tolerating it.
"Systems that gain from chaos rather than just tolerating it."
Circuit Breaker
Think of a circuit breaker as a train's emergency stop button. When things start going wrong, it prevents the situation from becoming catastrophically worse.
Here's a simplified example of how you might implement one:
enum class CircuitState { CLOSED, OPEN, HALF_OPEN }
class CircuitBreaker(
private val threshold: Int = 5,
private val timeoutMs: Long = 60_000
) {
private var failureCount = 0
private var lastFailure = 0L
private var state = CircuitState.CLOSED
suspend fun <T> call(operation: suspend () -> T): T {
when (state) {
CircuitState.OPEN -> {
if (System.currentTimeMillis() - lastFailure > timeoutMs) {
state = CircuitState.HALF_OPEN
} else throw Exception("Circuit breaker is OPEN")
}
else -> { /* proceed */ }
}
return try {
operation().also { onSuccess() }
} catch (e: Exception) {
onFailure()
throw e
}
}
private fun onSuccess() {
failureCount = 0
state = CircuitState.CLOSED
}
private fun onFailure() {
failureCount++
lastFailure = System.currentTimeMillis()
if (failureCount >= threshold) state = CircuitState.OPEN
}
}
The beauty of circuit breakers: They fail fast and fail safely. Instead of letting a problematic dependency drag your entire system down, you isolate the problem and give the failing component time to recover.
Graceful Degradation: The Art of Failing With Style
Is it over, Jessica?
One of the most elegant aspects of resilient systems is their ability to degrade gracefully — to continue providing value even when parts of the system are failing. Think of it like a car that can still get you home even when the air conditioning breaks.
Here's an example:
data class UserProfile(
val id: String,
val name: String,
val features: String,
val message: String? = null
)
suspend fun getUserProfile(userId: String): UserProfile {
return runCatching {
// Primary: Fast Workers KV (50ms, full features)
workersKV.get(userId)
}.recoverCatching {
println("Workers KV unavailable, falling back to database")
// Secondary: Database + Redis (200ms, 90% features)
database.getUser(userId)
}.getOrElse {
println("Database unavailable, using static profile")
// Tertiary: Static defaults (5ms, 30% features)
UserProfile(
id = userId,
name = "User",
features = "limited",
message = "Some features temporarily unavailable"
)
}
}
Notice what's happening here: each fallback maintains core functionality while reducing features. The user remains logged in and can continue using the app, even if they lose some personalization. This is infinitely better than a blank error page.
The Dependency Triage System
Not all dependencies are created equal. The Cloudflare incident shows us the critical importance of classifying our dependencies by their potential blast radius:
🔴 Critical Dependencies (Red Zone) These are the ones that can bring down your entire system. In Cloudflare's case, Workers KV fell into this category because so many other services depended on it.
- Critical dependencies with chain reaction potential
- No graceful fallback possible
- Require immediate circuit breaker protection
- Should have multiple redundant providers
🟡 Important Dependencies (Yellow Zone) These affect functionality but don't kill the system. Analytics services often fall here.
- Soft dependencies that can degrade gracefully
- Noticeable when they fail, but not catastrophic
- Good candidates for async processing and retry logic
🟢 Optional Dependencies (Green Zone) Nice-to-have features that can fail without anyone noticing immediately.
- Edge CDN caches, recommendation engines, A/B testing platforms
- Can timeout or fail silently
- Often safe to disable during incidents
The Human Element
Here's something that's often overlooked in technical post-mortems: resilient systems require resilient teams. The Cloudflare incident lasted approximately 2.5 hours not just because of technical failures, but because of human factors — detection time, escalation procedures, decision-making under pressure.
The Psychology of Incident Response
When systems start failing, engineers face what psychologists call "cognitive load overload." You're simultaneously trying to:
- Understand what's happening (diagnosis)
- Prevent it from getting worse (containment)
- Fix the immediate problem (mitigation)
- Communicate with stakeholders (coordination)
- Document everything for later (learning)
That's a lot to juggle when your phone is buzzing with alerts and your Slack is filling with "is it just me or..." messages.
The most resilient organizations build systems that reduce cognitive load during incidents, not increase it. They practice regularly, have clear escalation paths, and most importantly, embrace failure as a learning opportunity rather than a blame opportunity.
Defense in Depth
Think of resilience like a medieval castle's defense system. You don't rely on just one strong wall — you have moats, multiple walls, archers, and escape routes. Modern systems need the same layered approach:
- Application Layer: Circuit breakers, timeouts, graceful degradation
- Service Layer: Load balancing, health checks, auto-scaling
- Infrastructure Layer: Multi-region deployment, replication, CDN
- Organizational Layer: Incident response training, blameless post-mortems, chaos engineering practice
The Cloudflare incident shows what happens when one of these layers fails. The technical layers (application, service, infrastructure) held up reasonably well, but the dependency management layer had a blind spot.
Monitoring That Actually Matters
Most monitoring systems are great at telling you that something is broken. What they're terrible at is telling you how broken things are and how far the damage is spreading.
After incidents like Cloudflare's, smart teams add "blast radius indicators" to their dashboards:
- Error chain depth: How many service layers are affected
- Customer impact percentage: Fraction of users experiencing issues
- Service dependency fan-out: Number of services that could be affected
- Recovery confidence level: How certain we are that our fix will work
These metrics help answer the critical question during an incident: "Is this getting better or worse?"
Mindset Shift
Here's a counterintuitive insight from the Cloudflare incident: focusing on Mean Time to Recovery (MTTR) is more valuable than obsessively focusing on Mean Time Between Failures (MTBF).
Why? Because you can't control when third-party dependencies fail, but you can definitely control how quickly you detect, respond to, and recover from those failures.
Formula 1 pit crews optimize their incident response:
- Detection: Automated alerts that wake people up (< 2 minutes)
- Response: Clear runbooks that anyone can follow (< 5 minutes)
- Mitigation: Practiced procedures that work under pressure (< 30 minutes)
- Recovery: Gradual restoration with rollback capability (< 60 minutes)
Cloudflare took approximately 2.5 hours for full recovery, but started seeing recovery signs about 2 hours after incident start.
Lessons Learned
1. Identify Blind Spots
2. Implement "Graceful Degradation" From the Start
Instead of treating fallbacks as future nice-to-haves, prioritize them in your architecture. Every critical user flow should have at least one fallback mode (and, depending on criticality, I recommend even two fallback modes).
3. Test Failures in a Controlled Environment
Start small: Disable a non-critical service for 10 minutes during low-traffic hours. Watch what breaks. Fix those things. Gradually work up to more critical components.
The goal isn't to break things for fun — it's to discover how your system behaves under stress before your users do.
4. Build Incident Response Muscle Memory
The time to discover your incident response process isn't during a real incident. Run monthly "game days" where you simulate failures and practice your response.
5. Embrace Post-Mortem Writing Practice
When things go wrong, resist the urge to find the "responsible person." Instead, ask: "What system conditions made this failure possible?" and "How can we make it impossible to repeat the same mistake?"
Complexity Requires Systems Thinking
The Cloudflare incident isn't really about Cloudflare. It's about the fundamental challenge of building reliable systems in an interconnected world where your system's reliability is only as strong as your weakest transitive dependency.
This reality requires a shift in how we think about system design:
- From preventing failures to containing blast radius
- From perfect uptime to graceful degradation
- From blame culture to learning culture
- From reactive monitoring to predictive observability
Conclusion
The story of Cloudflare's outage is a learning opportunity. It shows us that even the world's most sophisticated infrastructure companies can be brought down by failures they didn't anticipate in dependencies they barely knew existed (or even fully trusted to stay up).
But here's the central point: The goal isn't to prevent all possible failures, which is impossible, but to build systems and teams that have flexibility. Uncertainty and unexpected situations are the only certainty we have in the midst of chaos.
Next time you're designing a system, don't just ask "How do I prevent this from failing?" Ask instead:
- "When this fails (not if), how far will the damage spread?"
- "How quickly will we know something is wrong?"
- "What's the minimum viable functionality we need to maintain?"
- "How can we practice recovering from this failure?"
The internet is a complex, interconnected system built on trust, redundancy, and the assumption that most things work most of the time. The Cloudflare incident reminds us that this assumption, while generally true, is also occasionally catastrophically false.
The most resilient systems aren't those that never fail — they're those that fail in predictable ways so damage can be contained and that recover quickly. Build for the failure you can imagine, and you'll be better prepared for those you can't.