Technical

How Machine Learning Builds a Segment-Level Risk Score for Water Mains

How Machine Learning Builds a Segment-Level Risk Score for Water Mains

At some point in the evaluation of any predictive analytics platform, a utility's engineering leadership will be asked to defend the methodology in front of a city council or utilities board. The question is usually some version of: "Why should we trust this score?" It's a fair question. This post is an attempt to answer it honestly — explaining what the model does, how it's validated, and where its limits are.

We've deliberately written this for engineers who need to understand the methodology well enough to explain it and defend it, not for data scientists. The technical depth here is enough to evaluate the approach rigorously; it's not a derivation.

What the model is predicting

The Watsynq model predicts the probability that a specific pipe segment will experience a structural failure event — a break, joint separation, or leak requiring emergency excavation — within a defined forward window (typically 90 days or 12 months, depending on the output layer). The output is a 0–100 risk score, where higher scores indicate higher predicted failure probability in that window.

The model is not predicting the exact date or location of a break within a segment. It's not predicting failure caused by a specific external event (a contractor strike, an unusual frost event). It's predicting the statistical failure hazard of a segment relative to other segments in the same network, based on the features available at prediction time.

Feature engineering: what goes into the model

The model inputs are organized into four feature groups:

Pipe attribute features: Installation year (age as of prediction date), pipe material encoded categorically (unlined cast iron, lined cast iron, ductile iron, AC, steel, PVC — each treated distinctly), nominal diameter, segment length, lining presence flag, and a binary flag for segments with unknown installation year. Material × age interactions are explicitly engineered: a 60-year unlined cast iron segment is treated differently from a 60-year ductile iron segment.

Hydraulic and operational features: These are derived from SCADA historian data via the segment-pressure association table (described in our SCADA-GIS integration post). Features include: rolling 30-day pressure variance at the associated measurement point; maximum transient amplitude in the prior 90-day window; transient event frequency per week; number of pressure exceedances above the segment's normal operating range; and a 12-month trend in pressure variance (increasing variance over time is a break precursor signal).

Soil and environmental features: SSURGO soil corrosivity class at the pipe location; SSURGO shrink-swell group (important for arid Southwest climates); depth to caliche (where available from soil boring data); a seasonal soil moisture index derived from NOAA and ADWR drought monitor data; and surface elevation (a proxy for depth-to-water-table in some geographic contexts).

Spatial and break history features: Break event count for the segment in the prior 5-year window; time since last break event; count of break events in a 300-meter spatial neighborhood in the prior 5 years (spatial autocorrelation in breaks is a real phenomenon — segments near recent breaks are at elevated risk regardless of their own break history); and a corridor risk index derived from historical break density along the same street or pipeline corridor.

Model architecture and training approach

The core model is a gradient-boosted decision tree ensemble (XGBoost), trained on historical break events as binary labels (break in the forward window = 1, no break = 0). The training set is constructed with temporal care: we use a sliding window approach where the model is trained on features as of a historical date and evaluated on outcomes in the forward period — never allowing future information to leak into the training features.

The class imbalance problem is significant: even in a utility with a high break rate, only 2–8% of segments experience a break in any given 12-month window. We address this through a combination of oversampling of the minority class (breaks) and threshold calibration of the predicted probability outputs. The final 0–100 score is a calibrated probability estimate — a score of 75 should correspond empirically to roughly 75% of segments at that score experiencing a break within the forward window, after calibration.

The model is retrained monthly as new operational data accumulates. Each retraining incorporates the most recent break events as new labeled training examples, which means the model continuously updates its understanding of how the current feature distribution relates to actual outcomes in the specific utility's network.

Validation: how do you know the scores are right?

This is the critical question for utility leadership. The validation approach for a time-series predictive model requires temporal holdout — evaluating the model on a period that was not in the training data. For a utility with 10 years of break records, we typically train on years 1–8 and evaluate on years 9–10, then look at the rank-order correlation between predicted risk scores (as of the end of year 8) and actual breaks in years 9–10.

The metric we report is the lift ratio at the top decile: what fraction of all breaks that occurred in years 9–10 fell within the top 10% of segments by risk score as of the prediction date? For networks with adequate data coverage, we typically see top-decile lift of 3.5–6x — meaning the top 10% of segments by risk score accounted for 35–60% of actual breaks in the validation period. A random selection of 10% of segments would account for 10% of breaks; the model's top decile captures substantially more.

We share this validation analysis, specific to each utility's own data, as part of the 90-day pilot deliverable. You shouldn't accept a risk score as meaningful without seeing the validation analysis for your own network. Any vendor that can't or won't produce it should not be trusted with your capital replacement decisions.

What the model gets wrong — and why it matters to know

The model performs worst in three conditions. First, segments with sparse break history in a network that hasn't been operated long enough to generate adequate training data — usually newer networks or segments that have been extensively replaced. Second, segments where the failure mechanism is genuinely unusual — a contractor strike against an unlined cast iron main during road construction, for example, is not predictable from the operational features the model uses. Third, segments with poor data quality — unknown installation year, missing pressure coverage, GIS geometry that doesn't accurately reflect the field condition.

These limitations are real and your council should know about them. The honest framing is this: the model improves on a random or age-only prioritization substantially in the segments where it has adequate data, and it flags segments where it has low confidence so that field judgment can fill the gap. It doesn't replace engineering judgment — it informs it with better data than was previously available.

Nadia Vasquez is Head of Data Science at Watsynq.