Every forecast the system makes is recorded and later graded against what the river actually did — this page is that report card. It currently scores 272 physics crest predictions, 484 empirical likelihood forecasts, and 226 recession countdowns (647 still maturing). Updated daily; treat it as experimental.
Predicts the crest a gauge will reach from rainfall + antecedent moisture. Scored on how close the predicted peak was to the actual peak.
| Predictor | Scored | Avg error (MAE) | Bias (mean / median) | Within ±20% |
|---|---|---|---|---|
| Cossatot River | 159 | 0.53 ft | 0.33 / 0.24 ft | 76.0% |
| Richland Creek | 70 | 0.91 ft | 0.43 / 0.26 ft | 53.0% |
| Hailstone (upper Buffalo) | 43 | 455.1 cfs | -377.43 / 14.9 cfs | 35.0% |
Bias is the mean / median signed error (predicted − actual). A large gap between them means a few outlier events — often a single flash-flood onset — are dragging the mean; the median is the typical miss.
Cossatot River
Richland Creek
Hailstone (upper Buffalo)
Predicts the likelihood of a rise to a given level from recent rainfall. Hit-rate = the river rose to at least the called level; no-rise-rate = the forecast rise never happened (lower is better).
| Basin | Forecasts | Hit-rate rose to ≥ called level | No-rise rate rise never came | Outcome detail |
|---|---|---|---|---|
| Big Piney Creek | 44 | 34.1% | 65.9% | 15 exact · 0 higher · 0 short · 29 none |
| Buffalo at Boxley | 106 | 21.7% | 78.3% | 11 exact · 12 higher · 0 short · 83 none |
| Cossatot River | 101 | 47.5% | 49.5% | 24 exact · 24 higher · 3 short · 50 none |
| Hailstone (upper Buffalo) | 62 | 29.0% | 71.0% | 10 exact · 8 higher · 0 short · 44 none |
| Mulberry River | 99 | 20.2% | 79.8% | 9 exact · 11 higher · 0 short · 79 none |
| Richland Creek | 72 | 12.5% | 87.5% | 2 exact · 7 higher · 0 short · 63 none |
| Confidence | n | Hit-rate | No-rise rate |
|---|---|---|---|
| high | 134 | 51.5% | 48.5% |
| medium | 132 | 34.1% | 64.4% |
| low | 100 | 19.0% | 80.0% |
| Band | n | Hit-rate | No-rise rate |
|---|---|---|---|
| p25_to_p50 | 179 | 35.2% | 64.8% |
| p50_to_p75 | 118 | 50.8% | 47.5% |
| above_p75 | 69 | 14.5% | 84.1% |
A well-calibrated engine would show hit-rate rising with the band. Where it does not, the engine is over-warning — a known selection-bias issue tracked for recalibration.
Predicts how long until a falling river drops to each threshold. HIT% = reached the level near the predicted time; MAE = average timing error (hours). Newly launched — most predictions are still maturing.
| Gauge | Target | Graded | HIT% | Never reached | Timing MAE | Bias |
|---|---|---|---|---|---|---|
| big_piney/above_longpool | low_floatable | 4 | 100.0% | 0.0% | 0.6h | 0.5h |
| big_piney/above_longpool | too_low | 23 | 48.0% | 52.0% | 24.0h | -24.0h |
| big_piney/below_longpool | low_floatable | 9 | 67.0% | 33.0% | 83.3h | -83.3h |
| big_piney/below_longpool | too_low | 9 | 0.0% | 100.0% | — | — |
| buffalo/ponca | low_floatable | 42 | 100.0% | 0.0% | 13.0h | -12.3h |
| cossatot/cossatot | low_floatable | 28 | 100.0% | 0.0% | 3.9h | -0.3h |
| cossatot/cossatot | too_low | 71 | 100.0% | 0.0% | 6.2h | -0.7h |
| mulberry/above_hwy_23 | low_floatable | 11 | 73.0% | 27.0% | 90.0h | -90.0h |
| mulberry/above_hwy_23 | too_low | 11 | 0.0% | 100.0% | — | — |
11 additional gauge/target combinations have fewer than 4 graded outcomes (mostly still-censored or single-sample) and are hidden until they accumulate enough data to be meaningful. Bias is hours predicted − actual; negative means the countdown fires early.
When rain arms it, predicts how big the Ponca gauge will get — a class (Fizzle/Moderate/High/Flood), a typical-peak band, and a flood-risk %. Graded against Ponca's actual crest over the 36 h after each call.
| Calls graded | Events | Class exact | Class within 1 | Peak in IQR | Peak error (MAE) |
|---|---|---|---|---|---|
| 231 | 13 | 71% | 97% | 61% | 1149.0 cfs |
Flood-risk calibration (Brier score, lower is better; 193 calls since the raw analog was also logged): override floor 0.163 vs. raw k-NN 0.092 — the raw analog is currently the better-calibrated of the two — the deterministic override floor stays high through the post-crest recession. Small sample, directional only.
| Event (UTC) | Calls | Predicted class | Max flood-risk | Actual crest | Actual class |
|---|---|---|---|---|---|
| 2026-06-29T13:02 | 1 | Flood | 85% | 224 cfs | Moderate |
| 2026-06-27T13:02 | 52 | Moderate | 55% | 758 cfs | Moderate |
| 2026-06-24T20:02 | 50 | Moderate | 6% | 493 cfs | Moderate |
| 2026-06-23T09:02 | 4 | Flood | 85% | 1020 cfs | High |
| 2026-06-23T08:02 | 4 | Flood | 85% | 1070 cfs | High |
| 2026-06-23T07:02 | 4 | Flood | 85% | 1110 cfs | High |
For each Buffalo mainstem gauge, predicts a coming rise (slight / moderate / large) from local rain + upstream propagation, with a timing window. Graded on whether the gauge actually rose, and within the predicted window. Newly recording — predictions only fire during rain events, so this fills in over time.
| Gauge | Graded | Rise happened | On-time | No-rise (false alarm) |
|---|---|---|---|---|
| boxley | 0 | — | — | 0 |
| ponca | 22 | 36% | 75% | 14 |
| pruitt | 31 | 45% | 57% | 17 |
| st_joe | 14 | 64% | 11% | 5 |
| harriet | 20 | 25% | 40% | 15 |
Typical actual rise by predicted category: slight: ~483.5 cfs (n=12), moderate: ~410.0 cfs (n=24). (Categories come from rainfall, so this is how they map to real gauge rises — calibration that accrues over time.)
| When (UTC) | Gauge | Predicted | Window | Outcome | Actual rise |
|---|---|---|---|---|---|
| 2026-06-28T21:09 | harriet | moderate | 5.0-10.0h | rose | +410.0 cfs @ 8.3h |
| 2026-06-28T20:09 | harriet | moderate | 5.0-10.0h | rose | +400.0 cfs @ 9.3h |
| 2026-06-28T19:09 | harriet | moderate | 5.0-10.0h | rose | +410.0 cfs @ 10.3h |
| 2026-06-28T18:09 | harriet | moderate | 5.0-10.0h | rose | +400.0 cfs @ 11.3h |
| 2026-06-28T17:09 | harriet | moderate | 5.0-10.0h | rose | +400.0 cfs @ 12.3h |
| 2026-06-28T05:09 | st_joe | moderate | 8.0-16.0h | rose | +410.0 cfs @ 15.6h |
When Ponca rises, predicts whether and how big a bump reaches Pruitt and St. Joe, and how many hours after Ponca peaks. Graded on the bump/no-bump call, the crest size, and the timing. Newly recording — only logs during a Ponca rise.
| Reach | Graded | Bump call right | Crest size hit | Timing hit |
|---|---|---|---|---|
| pruitt | 3 | 33% | 0% | 0% |
| st_joe | 3 | 33% | 0% | 0% |
What the scorecard grades today, and what is still being wired into the loop:
| Predictor family | Status | Notes |
|---|---|---|
| Physics rise predictors | graded | Cossatot, Richland, Hailstone — predicted crest height/flow vs. the actual peak. |
| Empirical forecast engine | graded | 6 basins — 'likelihood of rise to tier X' vs. the tier the gauge actually reached. |
| Recession countdowns | graded | 9 gauges — maturing; the longest horizons settle ~7-10 days after they're issued. |
| Ponca AI Rainfall Event Analysis | graded | Each armed call graded vs Ponca's actual crest over the next 36 h (ponca_analog_eval.py). |
| Downstream propagation forecast | graded | Ponca -> Pruitt -> St. Joe; predictions now logged + graded on bump/magnitude/timing (propagation_eval.py). |
| Buffalo per-gauge rise predictions | graded | Per-gauge rise nowcasts recorded + graded vs the gauge's actual rise (buffalo_predictions_archive.py). |
| Buffalo flood_risk labels + propagation_alerts | not yet logged | Lower-priority companions in buffalo_output; still ungraded. |