Back
Technical Case Study

Topology-Aware Failure Detection

How a network's own topology model becomes the key to deterministic fault localization, replacing alarm noise with proof.

The noise problem

When a single link fails in a large network, the consequences are not confined to that link. Traffic reroutes. Monitoring probes that traversed the failed segment start reporting loss. Devices on both ends log events. Downstream systems see degraded performance and raise their own alarms. What was one failure becomes dozens, sometimes hundreds, of alerts.

Traditional monitoring systems treat each alarm independently. They apply thresholds, deduplicate, and aggregate, but they lack the one piece of context that would make the problem tractable: knowledge of the network's topology.

Without topology awareness, operators are left correlating alarms manually: reading through logs, tracing paths on diagrams, making educated guesses about root cause. The process is slow, error-prone, and scales poorly. At hyperscale, it becomes untenable.

The core problem

A single failure produces many symptoms. Monitoring sees the symptoms. Operators must reverse-engineer the cause. This inversion is where time, accuracy, and reliability are lost.


The insight: invert the problem

The key insight behind topology-aware failure detection is to invert the diagnostic process entirely. Instead of starting from observed alarms and reasoning backward to the cause, start from a hypothesized failure and reason forward to the expected symptoms.

If you have a model of the network topology, you know every device, every link, and every path between any two points. Given a failure hypothesis ("link between R3 and R4 is down"), you can deterministically compute which monitoring probes should be affected, which should show loss, and which should remain healthy.

The topology model transforms fault diagnosis from pattern recognition into deduction. You do not search for the cause. You predict the symptoms of every possible cause and match them against reality.

This approach is deterministic, not statistical. It does not rely on machine learning, historical baselines, or probabilistic inference. It relies on the structural fact that a network is a graph, and that failures propagate through that graph in predictable ways.


How it works

The system operates in three layers: a probing layer that continuously measures path health, a topology layer that encodes the structure of the network, and a correlation layer that matches observations against predictions.

01

Exhaustive path probing

Test packets are sent through pre-determined paths covering every forwarding segment. For each node, every ingress-to-egress combination is tested. These probes use high-priority queues to avoid false positives from traffic congestion.

02

Topology model

A structured model of the network encodes devices, interfaces, links, and the paths that traverse them. This is the same domain model that drives configuration and lifecycle automation: the single source of truth.

03

Failure propagation

For each possible failure (a link, an interface, a line card), the system computes the set of probes that would be affected. This is a graph traversal, not a guess. The result is a predicted alarm set for every failure hypothesis.

04

Correlation

The observed probe results are compared against every predicted alarm set. The hypothesis whose prediction best matches the observations is the root cause. Unexpected alarms signal additional or different failures.


Seeing it in the topology

Consider a simple network of eight routers. Multiple test paths traverse the topology, each following a specific route through the graph. In normal operation, every path reports healthy.

R1 R2 R3 R4 R5 R6 R7 R8 Path A Path B Path C Path D
Click the buttons above to step through a failure scenario.

When the link between R3 and R4 fails, paths that traverse it (Path A, which runs R1 → R2 → R3 → R4) report 100% loss. Paths that do not cross the failed link (Path B through the bottom row, Path C via R2 → R6, Path D via R3 → R7) remain healthy. The correlation is immediate: only Path A is affected, and the only link unique to Path A that is not shared by healthy paths is R3 ↔ R4. The fault is localized.


The correlator

The correlation engine sits at the heart of the system. It performs four functions: finding the faulty component, minimizing false positives and false negatives, calculating the magnitude of the fault, and producing a time series of network health.

Finding the fault

For every link and every node in the topology, the correlator maintains a pre-computed set of probes that traverse it. When probes fail, the correlator intersects the "failing probe" set with each component's probe set. The component whose probe set best explains the observed failures, with the fewest unexplained results, is the root cause.

Minimizing false signals

Probes use dedicated high-priority queues to avoid false positives from traffic congestion. If a probe fails, it is because the forwarding path is broken, not because the link is overloaded. This is a critical design choice: the probing layer measures reachability and forwarding correctness, not traffic health.

Quantifying fault magnitude

Not all failures are binary. A logical link may consist of a bundle of physical links with hash-based load balancing. If one member of the bundle fails, some probes traversing the logical link may fail while others succeed, depending on which physical member they are hashed to. The same link appears "both good and bad" to different probes. The correlator accounts for this by computing the fraction of affected probes per component, yielding a magnitude score rather than a binary healthy/unhealthy verdict.

The partial failure problem

A link that is "50% bad" is harder to diagnose than one that is fully down. Bundle members, ECMP hashing, and partial line-card failures all create scenarios where a single component exhibits mixed health. The correlator must handle this as a first-class concern, not an edge case.

Time series and trending

By running correlation continuously, the system produces a health time series for every component in the network. Intermittent faults, degrading optics, and flapping links all become visible as patterns over time, often long before they escalate into customer-impacting outages. This is where failure detection meets preventive maintenance: the same data that localizes a fault today can predict one tomorrow.


Why the topology model matters

None of this works without the topology model. The model provides the structural context that makes correlation deterministic rather than probabilistic. Without it, the correlator would be reduced to statistical inference: "these alarms tend to fire together, so they are probably related." With it, the correlator can state with certainty: "these alarms fired because they share a physical dependency, and here is the component."

This is a direct application of the domain modeling philosophy described in Infrastructure as Code, Infrastructure as Data. The topology model, with its entities (devices, interfaces, links), relationships (adjacencies, path memberships, bundle compositions), and attributes (probe assignments, queue priorities), is the single source of truth that enables the entire system.

The model does not just describe the network. It encodes the causal structure of failure propagation. Every relationship in the model is a potential failure path, and every entity is a potential root cause.

The same model serves multiple consumers. Configuration management uses it to render device configs. Lifecycle automation uses it to plan workflows. The fragility index uses it to simulate failure scenarios. And topology-aware failure detection uses it to correlate alarms in real time. The model is written once and reasoned about many times.


From alarm storms to actionable signals

The practical impact of this approach is a fundamental shift in how operations teams experience failures. Instead of an alarm storm requiring manual triage, the system produces a single, high-confidence signal: "Link R3 ↔ R4 has failed, magnitude 100%, affecting 12 paths, first detected at 14:23:07 UTC."

Before

Alarm-driven operations

Hundreds of alerts fire. Operators triage, deduplicate, and correlate manually. Time to root cause: minutes to hours. Accuracy depends on individual expertise.

After

Topology-aware detection

One actionable signal identifies the faulty component, its magnitude, and its blast radius. Time to root cause: seconds. Accuracy is structural, not experiential.

Unexpected alarms, those that do not match any single-failure hypothesis, are equally valuable. They indicate either a multi-failure scenario or a topology model inconsistency. Both are worth investigating immediately.


Connecting the systems

Topology-aware failure detection does not exist in isolation. It is one stage in the observability pipeline: the correlation layer that transforms raw telemetry into assessed signals. Those signals feed into the closed-loop lifecycle system, triggering automated repair workflows when confidence is high enough.

The connection to the Network Fragility Index is equally direct. Every fault the correlator localizes updates the observed state of the network. Components that accumulate partial failures, intermittent faults, or trending degradation contribute to a rising fragility score, surfacing them for preventive action before they cause a hard failure.

The closed loop

Detect deterministically. Assess the blast radius. Feed the fragility model. Trigger preventive repair. The topology model ties it all together: one source of truth, many consumers, continuous convergence toward a healthier network.


Published foundations

Patent
Deterministic Network Failure Detection
Systems for sending packets through pre-determined paths to proactively monitor and diagnose network health in packet-switched networks.
Talk · NANOG
Presented at NANOG on leveraging network topology context to build smarter, more actionable monitoring systems that understand the blast radius of failures.