Turning Noisy Telemetry Into Stable Incidents (Without Losing Determinism)

·

By Dimitri Lafleur

Real-world telemetry is messy. Even when sensors, collectors, and networks are "fine," the stream can still be unreliable in ways that create operational drag. You get spikes that mean nothing, values that go stale without going fully missing, brief dropouts that self-heal, and timing irregularities that make triage harder than it needs to be.

This post describes a design pattern: convert scan-driven telemetry into stable, testable incident signals that are useful for humans and downstream systems, while staying deterministic and resource-bounded.

Implementation details are intentionally omitted, including thresholds, tuning strategy, edge-case handling, state transition rules, and grouping heuristics.

The failure mode: noise turns into work

Raw telemetry tends to produce two bad outcomes at the same time:

A single point can flicker around a threshold. A source can drift or intermittently report. A collector can miss a few updates and then recover. If alert logic treats every fluctuation as meaningful, you get alarm fatigue. If you over-suppress, you miss legitimate problems.

The requirement becomes: stabilize meaning without smearing reality.

Constraints that drive the design

Deterministic replay

Given the same input stream, the pipeline must produce the same outputs every time. That means:

Determinism enables regression testing, version comparison, and behavior audits after the fact.

Bounded memory and bounded state

Per-signal state must be small and predictable. Telemetry streams can be large, so the design assumes fixed-size rolling context per signal and compact incident lifecycle state.

Scan-driven processing

This is not batch analytics. Inputs arrive as scans or updates, and the pipeline responds incrementally. That pushes you toward lifecycle thinking and deterministic update rules.

Architectural shape

The system is intentionally simple to describe. The complexity lives in how each block preserves determinism and stability.

1) Ingest (normalize, do not decide)

Input samples are normalized into a consistent internal representation (value, timestamp when available, and optional quality hints). Ingestion is normalization, not decision-making.

2) Per-signal evidence (bounded, incremental)

Each signal maintains lightweight rolling evidence about whether behavior is departing from expectations. Examples of evidence signals (non-exhaustive):

One standard tool used in this space is exponential smoothing:

y[n] = α x[n] + (1 − α) y[n−1]

The important property is that it can be computed incrementally with bounded state.

3) Incidentization (evidence becomes stable incidents)

Instead of "alert on condition," treat conditions as inputs to an incident lifecycle. The lifecycle uses persistence and hysteresis to reduce alert chatter while staying deterministic.

The outcome is a stable object with a start, an active period, and a clean resolution. Not spam on every scan.

4) Optional summarization

Once you can create stable incidents per signal, you can reduce alert surface area by summarizing related incidents into higher-level rollups using deterministic rules. The details are intentionally omitted.

5) Replay and evaluation

Because the pipeline is deterministic, you can run historical streams through it and compare outputs across versions. This enables regression tests, behavior audits, and safe iteration.

Testing discipline that actually works

Testing this kind of system is less about isolated unit tests and more about proving end-to-end behavior across time.

Golden outputs

Pick representative synthetic streams (or sanitized streams) and record the expected incident timeline output. When logic changes, you can see exactly what moved.

Deterministic ordering rules

If incidents can be emitted in different orders depending on timing, you do not have a testable system. Ordering is part of the contract.

Portable test vectors

Test inputs should be small, explicit, and runnable anywhere:

If you cannot run the same test vector locally and in CI and get the same output, you are not deterministic.

Closing thought

Raw telemetry is data. Incidents are meaning. The point is to create meaning that is stable under noise, predictable under replay, and bounded under scale.

One unavoidable reality behind all of this is sampling. Telemetry is discrete:

fs = 1 / Δt

That fact alone explains a large fraction of the artifacts people see in real systems.

← Back to Writing