The Invisible Dependencies Problem

Key Points

You probably won't achieve four nines 🤷🏾‍♂️
Many dependencies become invisible over time, increasing the risk of failures
External service SLAs reduce availability without you realizing it
The inherent complexity of how we build applications today has made high availability primarily a statistical battle
Every system has 100% uptime until you start measuring. And even worse... Even when you measure, you're probably ignoring half the things that can bring down your service.

Developers love talking about "four nines" (99.99%) availability. Clients' eyes light up, and stakeholders love it. What few mention is that this means only 52 minutes and 35 seconds of downtime per year. If your payment partner goes down for an hour, that limit is already blown.

But the real problem isn't in what you can see and plan for. It's in the dependencies you don't even know exist, what I call "invisible dependencies."

Dev vs Probability… Fight!

Let's start with the dependencies you can see.

If your service depends on three components, each with 99.9% uptime, your maximum theoretical availability isn't 99.9%. It's 99.9% × 99.9% × 99.9% = 99.7%. You might even argue: "But it's only three dependencies!"

Let's list some then:

Your load balancer
Your application server
Your database
Your Redis cache
Your authentication service
Your CDN for assets
Your DNS provider
Your monitoring service
Your logging system
Your messaging service

That's already ten. With 99.9% each: 99.9%^10 = 99.0%. Lost a whole nine. And this assumes they all have three nines, which is quite optimistic to say the least.

Insight

"Each dependency multiplies your chances of failure."

What about invisible dependencies

What makes everything worse is that the dependencies you see are just the tip of the iceberg. For every service you consciously add, there are dozens of invisible dependencies waiting to screw you over.

Invisible dependencies include transitive dependencies (dependencies of your dependencies) and other non-obvious dependencies like third-party CDNs, shared DNS services, or libraries that make external calls silently.

Consider a real case I experienced when working with e-commerce a few years ago: users couldn't click the "Complete Purchase" button at 3 AM on a Tuesday. All because of an outage of a CDN that hosted icons used by a JavaScript library in the payment interface.

Nobody knew we depended on that CDN. It wasn't in any architecture diagram. It didn't appear in any runbook. It was invisible until the moment it became critical.

The incident timeline

14:00 - Frontend team updates a component library to fix a visual bug on the payment page
14:15 - Deploy passes beautifully in all tests, goes to production
14:30 - All good, test in production to ensure, normal metrics, no alerts
14:31 - The new library introduces a silent dependency: icons loaded from a third-party CDN
02:45 - Third-party CDN enters scheduled maintenance window (different timezone) and has SSL certificate problem
02:46 - Interface icons don't load, "Complete Purchase" button appears visually broken
02:55 - Users can't complete checkout, sales stop
03:15 - On-call alert wakes us up
05:30 - After 2 hours of investigation, someone discovers the invisible dependency

The worst part is that dependencies aren't born invisible. Entropy wins and makes them become invisible gradually, as they fall into oblivion.

It's not possible to completely beat entropy, but we can reduce it. The paradox is real: to monitor all your dependencies, you first need to know they exist. But the goal isn't perfect knowledge, it's to gradually reduce our blindness.

We invest millions in observability tools that show us in impressive detail the behavior of dependencies we already know about. Meanwhile, many dependencies that can bring down the system remain in our blind spot because we don't know we should look for them.

It's like having the best radar system in the world, pointed in the wrong direction.

How to detect these dependencies?

We need to look for some symptoms:

Latency increase "out of nowhere": Your service sometimes takes 5x longer to respond, but no internal metric explains why.
Correlated failures: Two "independent" services always go down together. There's a shared dependency nobody mapped.
Gradual degradation: Performance worsens X% per week. Some dependency is accumulating state or leaking resources.
Seasonal errors: Failures that happen always at the same time/day. Dependency on some batch process or maintenance routine.

How to survive in the real world?

After years of getting beaten up by invisible dependencies, I've developed some survival strategies. They're not perfect, but they reduce the pain.

1. Assume everything will eventually fail

"Design for failure" isn't pessimism, it's simply being realistic. If a dependency can fail, it will fail. The question isn't "if," but "when" and "how your system will react."

suspend fun fetchUserPreferences(userId: String): UserPreferences {
    return try {
        // Try to fetch from main service
        preferenceService.get(userId)
    } catch (e: Exception) {
        when (e) {
            is TimeoutException, is ConnectException -> {
                try {
                    // Fallback to local cache
                    localCache.get("prefs:$userId")
                } catch (cacheError: Exception) {
                    // Fallback to defaults
                    getDefaultPreferences()
                }
            }
            else -> throw e
        }
    }
}

2. Aggressive timeouts on everything

The biggest cause of cascading failures are poorly configured or missing timeouts. A slow service is bad. A service that hangs waiting for a response is catastrophic.

# Bad
redis:
  timeout: 30s  # 30 seconds of blocked thread

# Better  
redis:
  connect_timeout: 100ms
  read_timeout: 250ms
  max_retries: 2
  circuit_breaker:
    failure_threshold: 5
    recovery_timeout: 30s

3. Active dependency mapping

Yes, there's irony here: I'm recommending more tools to solve problems caused by... having too many tools. But there's a difference between necessary and accidental complexity.

Don't trust documentation or diagrams. Use tools that discover dependencies automatically by analyzing real traffic:

Service mesh with distributed tracing
Network traffic analysis
Dependency scanning at build time
Chaos engineering to reveal hidden dependencies

The goal is to trade accidental complexity (unknown dependencies) for necessary complexity (observability tools).

4. Graceful degradation by default

We need our applications to gracefully handle the invisible part of the iceberg. When submerged dependencies fail, the system should continue working without users noticing something went wrong underwater.

Your service should have levels of functionality, not be binary (works/doesn't work). When dependencies fail, degrade non-essential features first.

class RecommendationService(
    private val socialService: SocialService,
    private val browsingService: BrowsingService,
    private val locationService: LocationService
) {
    suspend fun getRecommendations(userId: String): Recommendations = coroutineScope {
        // Fetch essential data
        val essentials = getEssentialData(userId)
        
        // Try to enrich in parallel, but don't block if it fails
        val enrichments = listOf(
            async { runCatching { getSocialSignals(userId) } },
            async { runCatching { getBrowsingHistory(userId) } },
            async { runCatching { getLocationData(userId) } }
        ).map { deferred ->
            withTimeoutOrNull(500) { deferred.await() }
        }.mapNotNull { result ->
            result?.getOrNull()
        }
        
        // Use what you got, ignore what failed
        computeRecommendations(essentials, enrichments)
    }
}

5. Feature flags as business circuit breakers

The most underestimated strategy: transform mandatory dependencies into optional ones through feature flags. When the currency API from our example fails, you can instantly disable currency conversion without deployment.

class PricingService(
    private val currencyService: CurrencyService,
    private val featureFlags: FeatureFlags
) {
    suspend fun getPrice(productId: String, userCurrency: String): Price {
        val basePrice = getBasePriceFromDB(productId)
        
        return if (featureFlags.isEnabled("currency_conversion")) {
            try {
                val rate = currencyService.getExchangeRate("USD", userCurrency)
                basePrice.convert(rate, userCurrency)
            } catch (e: Exception) {
                // Fallback to default currency when conversion fails
                basePrice.asDefaultCurrency()
            }
        } else {
            // Feature disabled = zero dependency on external API
            basePrice.asDefaultCurrency()
        }
    }
}

Feature flags give you an "emergency button" for any problematic functionality. Slow recommendations API? Disable it and show popular products. Unstable reviews service? Hide reviews temporarily.

The best part: you transform invisible dependencies into conscious product decisions.

In our pricing endpoint example, if we disable currency conversion via feature flag during an emergency, availability jumps from 99.35% to 99.85% - reducing from 2.4 to 0.5 days of potential downtime per year. An "emergency button" that instantly improves your availability.

But what about four nines?

Back to the famous 99.99%: in practice, it's almost impossible to achieve with modern distributed architectures. Not because we're incompetent (and most are, unfortunately), but because the math is against us.

Look at a real "simple" e-commerce endpoint (GET /api/products/{id}/price):

RDS PostgreSQL (99.95%)
Internal pricing microservice (99.9%)
External currency exchange API (99.5% - for currency conversion)

Combined endpoint availability: 99.35%. That's 2.4 days this specific endpoint could be unavailable per year. Far from the promised 52 minutes.

But if we implement caching with fallback for exchange rates (using the last known rate when the API fails), we can improve the currency lookup availability to 99.9%. New availability: 99.75% - reducing to 0.9 days of potential downtime.

And this assumes you're using RDS correctly, with multi-AZ and all best practices. Probably doesn't count configuration errors, bad deployments, or that developer who forgot to configure HTTP client timeouts.

The solution? Accept reality and design accordingly:

Active redundancy: Not just backups, but multiple active paths
Failure isolation: Limited blast radius when something fails
Aggressive caching: Better to serve slightly stale data than nothing
Business monitoring: Can the customer buy?

Every time we add a new modern tool, a new library, a new service, we're betting availability in exchange for functionality or productivity. Sometimes it's worth it, but often it's not.

Before: Kotlin/Spring Boot monolith and PostgreSQL

2 points of failure
Simple debugging
Atomic deployment

After: microservices with Kubernetes

Many points of failure (easily 20-30+)
Debugging requires expertise in distributed systems
Deployment can leave system in inconsistent state

I'm not advocating going back to the past. But we need to be honest about the trade-offs. Complexity has a cost, and that cost is usually paid in availability.

If there's no way around it, then roll with it

The secret isn't to eliminate all invisible dependencies. That's impossible. The real secret is to build systems that survive them. That degrade gracefully. That fail fast and recover even faster. That assume the impossible will happen, because it will.

Because in the real world, the difference between three nines and four nines isn't in technical perfection. It's in understanding that what you see is only the tip of the iceberg. The submerged mass of invisible dependencies is always larger than we imagine.

Get more content like this

Subscribe to the newsletter to receive links, insights, and analysis about software engineering, architecture, and technical leadership directly in your inbox.

Subscribe to newsletter →