Your SCADA historian has years of pressure, flow, and chlorine residual readings indexed by measurement point ID. Your GIS system has pipe geometry, installation records, material classifications, and break history indexed by asset ID. Neither system knows the other exists. The failure prediction signal you need lives at the join between them — and that join is where most utilities are leaving the most analytical value on the table.
This isn't a new problem. The siloed relationship between SCADA and GIS in water utilities is a legacy of procurement history: SCADA platforms were bought from one vendor, configured by OT specialists, and owned by operations. GIS was bought from another vendor, configured by GIS analysts, and owned by engineering or planning. The organizational boundaries reinforced the data boundaries, and now utilities that want to do serious analytics have to bridge a gap that neither original system was designed to accommodate.
What you need at the segment level — and why neither system has it alone
Meaningful pipe failure prediction requires, at minimum, associating each pipe segment with its time-series pressure history. A "pipe segment" in AWWA M36 terms is a discrete section of pipe between two fittings, tees, or valves — typically 50 to 400 meters in a municipal distribution main. A SCADA pressure sensor covers a node or zone, not a specific segment. The join problem is: which sensor reading most closely represents the hydraulic conditions experienced by which specific pipe segment?
This requires spatial reasoning: identifying the nearest upstream and downstream pressure measurement points for each segment, accounting for the network topology (which nodes are hydraulically connected, through what sequence of pipe, at what elevation), and weighting the pressure time series accordingly. That topology lives in GIS, not SCADA. You cannot do this join reliably using the sensor's physical location alone — two sensors that are geographically close may be hydraulically separated by a closed valve. The GIS model of the network, including valve status and zone boundaries, is the authoritative source.
The pipe attributes needed for failure prediction — installation year, material, diameter, lining type, soil classification at the pipe location, proximity to previous breaks — are all in GIS. The dynamic operational signals — pressure transient frequency, sustained pressure deviation, flow anomalies that may indicate an active leak — are all in SCADA. A model that uses only one set of inputs will underperform one that fuses both.
The practical data integration architecture
The integration doesn't require replacing either system. The practical architecture has three layers:
Layer 1 — Segment feature table: A static table built from GIS, with one row per pipe segment, containing the pipe attributes needed for the model: segment ID, geometry (line WKT or centroid coordinates), length, installation year, material, diameter, lining (or "unlined"), soil shrink-swell classification (joined from SSURGO), break event count and dates, proximity to nearest previous break event. This table is updated when GIS data is refreshed — typically quarterly at minimum for a utility with an active capital program.
Layer 2 — Pressure zone assignment: A mapping table that assigns each pipe segment to its nearest hydraulically relevant pressure measurement point(s). This is built once using network topology analysis, validated against known pressure zone boundaries, and updated when zone configurations change (new pressure reducing valves, zone reconfigurations, etc.). Getting this right requires a water engineer who understands the system — it can't be fully automated.
Layer 3 — Time-series feature extraction: A pipeline that reads SCADA historian data, applies the zone assignment mapping to associate pressure signals with segments, and computes the engineered features needed for the model: rolling average pressure, pressure variance, transient event frequency, max transient amplitude, flow anomaly flags. These are computed on a rolling window (typically 7-day, 30-day, and 90-day) and joined to the segment feature table to produce the model input matrix.
The actual data volumes are manageable. A utility with 1,000 pipe segments and 50 SCADA pressure points, logging at 1-minute intervals, generates roughly 26 million raw time-series readings per year. The engineered feature set that the model actually consumes is orders of magnitude smaller — a few dozen features per segment, updated daily.
Common data quality failures and how to handle them
In practice, nearly every utility we've worked with has at least one of the following data quality issues that must be addressed before integration produces reliable results:
- GIS pipe records with unknown installation year: Anywhere from 5–25% of pipe records in aging systems have null or estimated installation dates. These need to be flagged and handled explicitly in the model — either imputed using surrounding segment vintage, flagged as high-uncertainty, or excluded from the training set.
- SCADA sensor drift or calibration gaps: Pressure sensors that haven't been calibrated in 2–3 years may have systematic offsets. The model can accommodate systematic drift through feature normalization, but erratic readings around sensor failure events need to be identified and removed from the training window.
- Break records in CMMS that don't match GIS segment IDs: The work order system and GIS often use different asset identifiers. A join that fails silently — where a break event is recorded in CMMS against an asset ID that doesn't exist in GIS — produces a training set with artificially low break frequency for the affected segments. This is a model quality issue, not just a data cleanliness issue.
- Valve status records not current in GIS: If the hydraulic zone assignment is built from GIS valve status records that haven't been updated to reflect recent valve operations, the pressure zone assignments will be wrong for affected segments. Zone reassignment after a planned valve change should trigger a zone assignment review.
What the integrated system makes possible
Once the SCADA-GIS join is established and data quality issues are resolved, the analytical capabilities expand substantially. The immediate application is failure prediction — which segments to prioritize for inspection or replacement. But the integrated dataset also supports:
NRW (non-revenue water) attribution: Comparing expected flow at zone entry points against metered consumption, segmented by pressure zone and correlated with pipe age and condition in that zone, gives a more granular view of where losses are concentrated than a system-wide NRW calculation. A growing utility with a 15% apparent NRW rate needs to know whether that loss is concentrated in a few aging zones or distributed across the system — the integrated model answers that.
C-factor degradation tracking: Hydraulic modeling calibration requires knowing the actual Hazen-Williams C-factor for each pipe segment. As unlined cast iron tuberculates over time, the C-factor degrades from an initial value around 130 for new pipe to values below 80 for severely tuberculated segments. The SCADA flow data, when properly joined to GIS pipe geometry, supports C-factor estimation without requiring physical inspections of every main — a significant operational efficiency for utilities preparing hydraulic model updates.
We're not suggesting that SCADA-GIS integration alone solves the asset management problem. It doesn't replace the engineering judgment required for capital planning, and it doesn't eliminate the need for physical condition assessment of high-risk segments. What it does is make the data that utilities are already collecting actually useful for decisions that currently rely on intuition, relationship, and accident — the kind of decisions that get made differently at 2 a.m. when a main has just broken and the crew needs to know where to dig.
Nadia Vasquez is Head of Data Science at Watsynq.