Case Study

Measuring Fragility

Failure is failure to meet your SLO. That means exceeding your error budget. Fragility measurement exists for one reason: to keep bad minutes within the error budget. To avoid the avoidable.

The problem with monitoring

SLOs are essential for indicating the current health of a system, but they function as lagging indicators. They tell you when a system is broken, not when it is about to break. A system can appear perfectly healthy, meeting every SLO, while lacking the designed redundancy to withstand the next inevitable component failure. High utilization or transient issues mask underlying weaknesses. The system is meeting its SLOs today, but the next failure is a service impact.

This gap between "currently working" and "resilient enough to keep working" leads to preventable outages, often caused by unidentified Single Points of Failure (SPOFs), components whose failure causes a disproportionately large service impact. The question fragility measurement answers is: how close is this system to exceeding its error budget, even though it looks healthy right now?

Defining fragility: the N+k model

Fragility measures the gap between a system's designed resilience and its current, operational resilience. The definition is anchored in the standard N+k redundancy model.

Base capacity

The minimum operational capacity required to meet current demand while satisfying all SLOs. What you need right now.

Designed redundancy

The intended safety margin. How many concurrent failures the system should tolerate by design (N+1 means m=1, N+2 means m=2).

Observed redundancy

The actual, measured safety margin beyond the base requirement n. What the system can really survive right now.

k vs m

The fragility gap

The relationship between observed (k) and designed (m) redundancy classifies the system into distinct states of resilience.

The relationship between k and m defines four states.

System states

From designed resilience to service impact, a progression that fragility measurement makes visible.

Design Compliant (k≥m) Degraded (0<k<m) Fragile (k≤0) Broken

Design Compliant means the system has at least its designed redundancy and can withstand the failures it was engineered for. Degraded means some redundancy remains but below the design target, the safety margin is partially consumed. Fragile means effectively no redundancy remains: the system is operational, but the next relevant failure is a service impact. Broken means the system is already failing its SLOs.

Fragility measurement focuses on the Degraded and Fragile states. It is the early warning system that provides visibility before the Broken state is reached and customers are impacted.

How fragility is measured

Two complementary methodologies assess fragility: a deterministic approach focused on design compliance, and a probabilistic approach focused on risk over time. Both require a high-fidelity representation of the serving infrastructure, including current load, demand, and projected demand.

Deterministic: design compliance and SPOF identification

This method provides a direct snapshot: does the system meet its resilience target right now, even under the worst expected failures? It works by simulating architecturally defined failure scenarios against the current topology and capacity.

k−1

Single Point of Failure

Simulate the failure of each individual component. Identify the single failure that causes the largest capacity reduction. This component is the primary SPOF candidate.

k−2

Dual Point of Failure

Simulate concurrent failure of component pairs. Identify the pair that causes the most impact. Reveals shared risk groups and correlated failure modes beyond single redundancy.

Impact

Impact radius

For each identified SPOF, quantify the blast radius: which services, customers, and demands would be unmet if this component fails.

State

Classification

Compare observed resilience against the design target to classify the system as Compliant, Degraded, or Fragile. A concrete, actionable state.

The deterministic approach answers: what is the worst thing that can happen right now, and are we designed to survive it? It identifies the specific components whose failure matters most and quantifies their impact.

Probabilistic: risk assessment over time

The deterministic snapshot tells you if the system can survive a failure. The probabilistic approach asks: how likely is that failure, and what is the risk over the next quarter or year? It incorporates the statistical reality of component failures and repair times.

Using empirical Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) data for each component type, Monte Carlo simulations model thousands of possible futures. Each simulation randomly introduces failures based on MTBF distributions (Weibull) and repairs based on MTTR distributions (Lognormal). At each step, the simulation evaluates whether the system can still meet demand given the currently failed components.

MTTR is particularly critical. Longer repair durations significantly increase the probability of overlapping failures overwhelming the system's redundancy. A system with excellent MTBF but poor MTTR can be more fragile than one with moderate MTBF and fast repair.

Quantifying risk: probability of SLO breach

The Monte Carlo simulations produce the most direct measure of fragility: P(SLO breach), the probability that the system will exceed its error budget over a given period.

P(SLO breach) = simulations where downtime exceeds error budget / total simulations Complement: P(SLO met) = 1 − P(SLO breach)

A system with P(SLO breach) = 0.02 has a 2% chance of missing its SLO this quarter — low risk, likely Design Compliant. A system at P(SLO breach) = 0.45 has nearly a coin-flip chance of breaching — that is a fragile system demanding immediate attention. The metric is intuitive, actionable, and needs no interpretation.

Because P(SLO breach) is normalized against each system's own SLO target, it is comparable across heterogeneous infrastructure. A network path at 12% breach probability and a power feed at 8% can be ranked on the same scale, prioritized by how likely each is to violate its own reliability contract.

The feedback loop

Measurement without action is just a dashboard. Fragility measurement closes the loop by feeding directly into the operational pipeline.

Repairing a component identified as a single point of failure (k−1) that pushes a system from Fragile (k=0) back to Degraded (k=1), or significantly lowers P(SLO breach), offers more risk reduction than a repair on a non-SPOF component in an already Compliant system. The decision is based on marginal impact to system redundancy, impact radius, and overall risk. The queue is never empty because infrastructure is never static.

Beyond reactive: what this enables

Prevent the preventable

Transform unscheduled emergency fixes into planned maintenance. Intervene while the system is Degraded, before it becomes Fragile.

Prioritize by risk

When multiple components await repair, fragility state and P(SLO breach) provide a quantitative, impact-driven ranking. Fix the highest-risk SPOF first.

Plan capacity with foresight

Run simulations against forecasted demand. Know whether current redundancy survives next quarter's growth before you get there.

Fragility measurement reframes reliability engineering from a reactive discipline (detect failure, respond, restore) to a proactive one (measure risk, prioritize, prevent). It turns preventive maintenance from an intuition-driven practice into a quantitative, continuous, closed-loop system.

The methodology is not specific to networks. Any infrastructure with a topology of dependencies, measurable redundancy, empirical failure data (MTBF/MTTR), and a concept of service impact can be assessed this way. Data center power, cooling, compute clusters, distributed services. Wherever a system can appear healthy while silently approaching its error budget, fragility can be measured.

Avoid the avoidable. Fragility measurement makes it possible to keep bad minutes within the error budget, not by reacting faster, but by preventing the conditions that consume it.

← Back to home Read the Automation Ladder →