Guide 20: BFP digital twin.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 20

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

A digital twin is a computational model that mirrors a physical asset's expected behavior. For a BFP, the OEM pump curve defines what head and flow the pump should produce at a given speed. By comparing actual performance against the pump curve in real time, you can quantify exactly how much efficiency the pump has lost and predict when maintenance is needed. In this guide you will:

Load OEM pump curve data and plot head-capacity and efficiency curves
Interpolate pump curves to compute expected performance at any operating point
Compute performance deviation (actual vs. expected) for each hour
Build a physics-informed regression model that respects pump affinity laws
Monitor residuals over time to track efficiency decay
Construct a composite Pump Health Index combining multiple indicators
Correlate health index trends with maintenance events from the operator action log

Prerequisites: This guide builds on concepts from Guide 17: BFP Health Monitoring and Guide 18: Feedwater Load Correlation. Familiarity with pump curves and regression analysis is assumed.

SP&L Data You Will Use

bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
pump_curves.csv — OEM head-capacity and efficiency curves at design speed
design_parameters.json — BFP design-point specifications (rated speed, head, flow, efficiency)
operator_actions.csv — timestamped maintenance and operator intervention records
heat_balance.csv — thermodynamic design data for cross-referencing expected feedwater conditions

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    import json
    from scipy.interpolate import interp1d
    from sklearn.linear_model import LinearRegression

    bfp = pd.read_parquet(
        "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
    )
    curves = pd.read_csv(
        "sisyphean-power-and-light/generation/reference/pump_curves.csv"
    )
    print(f"Setup OK! BFP: {len(bfp):,} rows, Pump curves: {len(curves)} points.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! BFP: 8,784 rows, Pump curves: N points.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load Pump Curves

import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from scipy.interpolate import interp1d
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Load all data sources
bfp = pd.read_parquet(
    "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
curves = pd.read_csv(
    "sisyphean-power-and-light/generation/reference/pump_curves.csv"
)
with open("sisyphean-power-and-light/generation/reference/design_parameters.json") as f:
    design = json.load(f)
ops = pd.read_csv(
    "sisyphean-power-and-light/generation/events/operator_actions.csv",
    parse_dates=["timestamp"]
)

print("Pump curve data:")
print(curves.head(10))
print(f"\nDesign parameters:")
for key, val in design.items():
    print(f"  {key}: {val}")
print(f"\nOperator actions: {len(ops):,} records")
                    

Plot OEM Curves

The OEM pump curve defines the theoretical relationship between flow (capacity) and head (pressure rise) at design speed. The efficiency curve shows what percentage of input power converts to useful hydraulic work.

# Plot OEM head-capacity and efficiency curves
fig, ax1 = plt.subplots(figsize=(10, 6))

# Head vs Flow
ax1.plot(curves["flow_gpm"], curves["head_ft"],
         color="#5FCCDB", linewidth=2.5, label="Head (ft)")
ax1.set_xlabel("Flow (GPM)", fontsize=12)
ax1.set_ylabel("Head (ft)", fontsize=12, color="#5FCCDB")
ax1.tick_params(axis="y", labelcolor="#5FCCDB")

# Efficiency on secondary axis
ax2 = ax1.twinx()
ax2.plot(curves["flow_gpm"], curves["efficiency_pct"],
         color="#D69E2E", linewidth=2.5, linestyle="--", label="Efficiency (%)")
ax2.set_ylabel("Efficiency (%)", fontsize=12, color="#D69E2E")
ax2.tick_params(axis="y", labelcolor="#D69E2E")

# Mark design point (Best Efficiency Point)
if "bep_flow_gpm" in design:
    ax1.axvline(x=design["bep_flow_gpm"], color="gray",
                linestyle=":", alpha=0.7, label="BEP")

ax1.set_title("OEM Pump Curve: Head & Efficiency vs Flow", fontsize=13)
fig.legend(loc="upper right", bbox_to_anchor=(0.88, 0.88))
ax1.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
                    

What is a pump curve? Every centrifugal pump has a characteristic curve published by the OEM (Original Equipment Manufacturer). The head-capacity curve shows that as flow increases, head decreases. The Best Efficiency Point (BEP) is where the pump converts the maximum percentage of input power to hydraulic work. Operating far from BEP causes extra vibration, heat, and wear.

Compute Expected Performance

Using the pump affinity laws, we can scale the OEM curve to any speed and compute the expected head and efficiency at the pump's actual operating point each hour.

# Build interpolation functions from OEM curves
head_curve = interp1d(
    curves["flow_gpm"], curves["head_ft"],
    kind="cubic", fill_value="extrapolate"
)
eff_curve = interp1d(
    curves["flow_gpm"], curves["efficiency_pct"],
    kind="cubic", fill_value="extrapolate"
)

# Get design speed from parameters
design_speed = design.get("rated_speed_rpm", 5500)
print(f"Design speed: {design_speed} RPM")
                    

# Identify actual operating columns
mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0]
bfpa_cols = [c for c in bfp.columns if "BFPA" in c]

flow_col  = [c for c in bfpa_cols if "FLOW" in c][0]
speed_col = [c for c in bfpa_cols if "SPD" in c or "RPM" in c][0]
head_col  = [c for c in bfpa_cols if "HEAD" in c or "DISCH" in c and "PRESS" in c][0]

# Filter to running conditions
running = bfp[bfp[mw_col] > 100].copy()

# Pump affinity laws: Q2/Q1 = N2/N1, H2/H1 = (N2/N1)^2
speed_ratio = running[speed_col] / design_speed

# Scale actual flow back to design-speed equivalent
running["flow_at_design_speed"] = running[flow_col] / speed_ratio

# Expected head at design speed for this flow
running["expected_head_design"] = running["flow_at_design_speed"].apply(
    lambda q: float(head_curve(np.clip(q, curves["flow_gpm"].min(),
                                               curves["flow_gpm"].max())))
)

# Scale expected head back to actual speed
running["expected_head"] = running["expected_head_design"] * speed_ratio**2

# Expected efficiency at this operating point
running["expected_eff"] = running["flow_at_design_speed"].apply(
    lambda q: float(eff_curve(np.clip(q, curves["flow_gpm"].min(),
                                              curves["flow_gpm"].max())))
)

print("Expected performance computed for all running hours.")
print(running[[flow_col, head_col, "expected_head", "expected_eff"]].describe().round(2))
                    

Pump affinity laws: These fundamental relationships govern centrifugal pump behavior. Flow scales linearly with speed (Q ~ N). Head scales with the square of speed (H ~ N^2). Power scales with the cube of speed (P ~ N^3). By scaling actual measurements back to design speed, we can compare any operating point against the single OEM curve.

Performance Deviation

# Compute deviation: actual vs expected
running["head_deviation"] = running[head_col] - running["expected_head"]
running["head_dev_pct"] = (running["head_deviation"] / running["expected_head"]) * 100

print("Head deviation statistics:")
print(running[["head_deviation", "head_dev_pct"]].describe().round(2))
                    

# Overlay actual vs expected on the pump curve
fig, ax = plt.subplots(figsize=(10, 7))

# OEM curve
ax.plot(curves["flow_gpm"], curves["head_ft"],
        color="#2D6A7A", linewidth=3, label="OEM Curve (design speed)", zorder=3)

# Actual operating points (scaled to design speed)
scatter = ax.scatter(
    running["flow_at_design_speed"],
    running[head_col] / speed_ratio**2,  # scale actual head back to design speed
    c=running.index.to_julian_date(),
    cmap="viridis", s=3, alpha=0.4, label="Actual (scaled to design speed)"
)

cbar = plt.colorbar(scatter, ax=ax, label="Time")
cbar.set_ticks([])
cbar.set_label("Time (Jan -> Dec)")

ax.set_xlabel("Flow (GPM) at Design Speed", fontsize=12)
ax.set_ylabel("Head (ft) at Design Speed", fontsize=12)
ax.set_title("Actual Operating Points vs OEM Pump Curve")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
                    

# Head deviation trend over time
fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(running.index, running["head_dev_pct"],
        alpha=0.3, color="#718096", linewidth=0.5)

# 7-day rolling mean
rolling_dev = running["head_dev_pct"].rolling(168).mean()
ax.plot(running.index, rolling_dev, color="#D69E2E",
        linewidth=2, label="7-day rolling mean")
ax.axhline(y=0, color="red", linestyle="--", alpha=0.7)

ax.set_ylabel("Head Deviation (%)")
ax.set_title("BFP A Head Deviation from OEM Curve Over Time")
ax.legend()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

What deviation tells you: A healthy pump operates on or near the OEM curve. If actual points consistently fall below the curve (negative head deviation), the pump is producing less head than it should at that flow and speed. This typically indicates impeller wear, internal recirculation, or increased internal clearances. A 3-5% deviation is significant for a BFP.

Physics-Informed Model

Instead of a purely data-driven regression, we build a physics-informed model that uses the affinity-law-scaled features. The model learns the residual deviation from the OEM curve as a function of operating conditions and time.

# Physics-informed features
running["speed_ratio"] = running[speed_col] / design_speed
running["speed_ratio_sq"] = running["speed_ratio"]**2
running["flow_pct_bep"] = running[flow_col] / design.get("bep_flow_gpm", 1000)
running["days_since_start"] = (running.index - running.index[0]).total_seconds() / 86400

# Feature matrix for the physics-informed model
phys_features = ["speed_ratio", "speed_ratio_sq", "flow_pct_bep",
                  "days_since_start"]
X_phys = running[phys_features].dropna()
y_phys = running.loc[X_phys.index, "head_dev_pct"]

model_phys = LinearRegression()
model_phys.fit(X_phys, y_phys)

y_pred_phys = model_phys.predict(X_phys)

print("Physics-informed model coefficients:")
for feat, coef in zip(phys_features, model_phys.coef_):
    print(f"  {feat:25s}: {coef:+.4f}")
print(f"  {'intercept':25s}: {model_phys.intercept_:+.4f}")
print(f"\n  R-squared: {r2_score(y_phys, y_pred_phys):.4f}")
                    

# The days_since_start coefficient tells us the daily degradation rate
deg_rate = model_phys.coef_[phys_features.index("days_since_start")]
print(f"\nEstimated degradation rate: {deg_rate:.4f}% head loss per day")
print(f"                           {deg_rate * 365:.2f}% head loss per year")
                    

Why "physics-informed"? A pure ML model treats all features as black-box inputs. By structuring our features around the pump affinity laws (speed ratio, speed ratio squared, flow relative to BEP), we embed physical knowledge into the model. The regression then only needs to learn the deviation from known physics, which requires less data and produces more interpretable coefficients. The days_since_start coefficient directly quantifies the degradation rate.

Residual Monitoring

# Model residuals (what the physics model can't explain)
running.loc[X_phys.index, "phys_residual"] = y_phys - y_pred_phys

# These residuals represent non-degradation anomalies
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Panel 1: Model fit vs actual
axes[0].plot(running.index, running["head_dev_pct"],
            alpha=0.3, color="#718096", linewidth=0.5, label="Actual deviation")
axes[0].plot(X_phys.index, y_pred_phys,
            color="#D69E2E", linewidth=1.5, label="Physics model fit")
axes[0].set_ylabel("Head Deviation (%)")
axes[0].set_title("Physics-Informed Model: Actual vs Predicted Degradation")
axes[0].legend()

# Panel 2: Residuals with control limits
resid = running["phys_residual"].dropna()
resid_std = resid.std()

axes[1].plot(resid.index, resid, alpha=0.5, color="#5FCCDB", linewidth=0.5)
axes[1].axhline(y=0, color="gray", linestyle="-")
axes[1].axhline(y=3 * resid_std, color="red", linestyle="--",
                label=f"+3 sigma ({3*resid_std:.2f})")
axes[1].axhline(y=-3 * resid_std, color="red", linestyle="--",
                label=f"-3 sigma ({-3*resid_std:.2f})")
axes[1].set_ylabel("Model Residual")
axes[1].set_title("Physics Model Residuals (Non-Degradation Anomalies)")
axes[1].legend()

axes[1].xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

Separating degradation from anomalies: The physics model captures the expected degradation trend. Its residuals represent everything else: control system changes, unusual operating conditions, measurement noise, and genuine acute faults. Points beyond 3 sigma in the residual plot warrant investigation because the physics model cannot explain them through normal degradation.

Pump Health Index

A single composite health index combines multiple indicators into one number that operators and planners can track. We will combine head deviation, efficiency deviation, bearing temperature trends, and vibration levels into a 0-100 score.

# Component scores (each normalized to 0-100, higher = healthier)

# 1. Head deviation score: 0% deviation = 100, -5% = 0
running["head_score"] = np.clip(
    100 + running["head_dev_pct"] * 20,  # 5% loss = score of 0
    0, 100
)

# 2. Efficiency score
running["eff_score"] = np.clip(running["expected_eff"], 0, 100)

# 3. Bearing temperature score (lower is healthier)
bear_cols = [c for c in running.columns
             if "BFPA" in c and "BEAR" in c and "TEMP" in c]
max_bear_temp = running[bear_cols].max(axis=1)
running["temp_score"] = np.clip(
    100 - (max_bear_temp - 140) * 2.5,  # 140F=100, 180F=0
    0, 100
)

# 4. Vibration score (lower is healthier)
vib_cols = [c for c in running.columns
            if "BFPA" in c and "VIB" in c
            and "roll" not in c and "delta" not in c]
max_vib = running[vib_cols].max(axis=1)
running["vib_score"] = np.clip(
    100 - max_vib * 10,  # 0 mils=100, 10 mils=0
    0, 100
)

# Composite Health Index (weighted average)
running["health_index"] = (
    0.35 * running["head_score"] +
    0.25 * running["eff_score"] +
    0.20 * running["temp_score"] +
    0.20 * running["vib_score"]
)

print("Pump Health Index statistics:")
print(running["health_index"].describe().round(1))
                    

# Health Index dashboard
fig, axes = plt.subplots(5, 1, figsize=(14, 16), sharex=True)

components = [
    ("head_score", "Head Score", "#5FCCDB"),
    ("eff_score", "Efficiency Score", "#2D6A7A"),
    ("temp_score", "Bearing Temp Score", "#D69E2E"),
    ("vib_score", "Vibration Score", "#718096"),
    ("health_index", "Composite Health Index", "#E53E3E"),
]

for i, (col, title, color) in enumerate(components):
    daily = running[col].resample("D").mean()
    axes[i].fill_between(daily.index, daily, alpha=0.4, color=color)
    axes[i].plot(daily.index, daily, color=color, linewidth=1.5)
    axes[i].set_ylabel(title, fontsize=9)
    axes[i].set_ylim(0, 105)
    axes[i].axhline(y=70, color="orange", linestyle="--", alpha=0.5)
    axes[i].axhline(y=40, color="red", linestyle="--", alpha=0.5)

axes[0].set_title("BFP A Digital Twin: Health Index Components",
                  fontsize=14, fontweight="bold")
axes[4].xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

Health Index interpretation: Above 70 (green zone) = healthy operation. Between 40 and 70 (yellow zone) = degraded, plan maintenance. Below 40 (red zone) = critical, immediate action required. The weights (35% head, 25% efficiency, 20% temperature, 20% vibration) reflect the relative importance of each indicator for BFP health. Adjust these weights based on your plant's experience.

Correlate with Maintenance Events

The final validation is correlating the health index trend with actual maintenance events. If the health index drops before a maintenance outage and recovers afterward, the digital twin is capturing real degradation.

# Filter operator actions to BFP-related maintenance
bfp_maint = ops[ops["description"].str.contains("BFP|pump|bearing|vibr",
                                                    case=False, na=False)]

print(f"BFP maintenance events: {len(bfp_maint)}")
if len(bfp_maint) > 0:
    print(bfp_maint[["timestamp", "description"]].to_string(index=False))
                    

# Overlay maintenance events on health index
fig, ax = plt.subplots(figsize=(14, 6))

daily_hi = running["health_index"].resample("D").mean()
ax.fill_between(daily_hi.index, daily_hi, alpha=0.3, color="#5FCCDB")
ax.plot(daily_hi.index, daily_hi, color="#5FCCDB", linewidth=2,
        label="Health Index (daily avg)")

# Zone shading
ax.axhspan(70, 105, alpha=0.05, color="green")
ax.axhspan(40, 70, alpha=0.05, color="orange")
ax.axhspan(0, 40, alpha=0.05, color="red")

# Maintenance event markers
for _, event in bfp_maint.iterrows():
    ax.axvline(x=event["timestamp"], color="purple",
               linestyle="-", alpha=0.6, linewidth=1.5)
    ax.annotate(event["description"][:30],
                xy=(event["timestamp"], 95),
                fontsize=7, rotation=45, ha="left", color="purple")

ax.set_ylabel("Health Index", fontsize=12)
ax.set_ylim(0, 105)
ax.set_title("BFP A Digital Twin: Health Index vs Maintenance Events",
            fontsize=13, fontweight="bold")
ax.legend()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

# Compute health index change around maintenance events
WINDOW_DAYS = 7
for _, event in bfp_maint.iterrows():
    t = event["timestamp"]
    before = running[
        (running.index >= t - pd.Timedelta(days=WINDOW_DAYS)) &
        (running.index < t)
    ]["health_index"].mean()

    after = running[
        (running.index > t) &
        (running.index <= t + pd.Timedelta(days=WINDOW_DAYS))
    ]["health_index"].mean()

    print(f"{t.strftime('%Y-%m-%d')}: {event['description'][:40]}")
    print(f"  Before: {before:.1f}  After: {after:.1f}  Change: {after - before:+.1f}")
                    

✓

What You Built and Next Steps

Loaded OEM pump curves and design parameters for the BFP
Plotted head-capacity and efficiency curves with the Best Efficiency Point
Used pump affinity laws to compute expected performance at actual operating speed
Quantified head and efficiency deviation from the OEM curve over time
Built a physics-informed regression model that estimates the daily degradation rate
Monitored model residuals with 3-sigma control limits for acute anomalies
Constructed a composite Pump Health Index from four component scores
Correlated health index trends with actual maintenance events

Ideas to Try Next

BFP B digital twin: Repeat the entire pipeline for Pump B and compare degradation rates between pumps
Predictive maintenance scheduling: Extrapolate the degradation trend to estimate when the health index will cross the 40-point threshold
Nonlinear degradation model: Try polynomial or exponential degradation curves instead of linear
Heat balance integration: Use heat_balance.csv to compute feedwater enthalpy rise and compare actual vs. design heat absorption
Real-time dashboard: Build an interactive Plotly dashboard that updates as new data arrives
Multi-pump fleet management: Extend to all rotating equipment in the plant and rank by health index for prioritized maintenance

Congratulations!

You have completed the full BFP analysis series. From basic health monitoring through load correlation, fault diagnosis, and digital twin modeling, you now have the tools to build comprehensive rotating equipment analytics for any generating unit.

Explore All ML Playground Guides →

← Prev: Advanced Rotating Equipment Next: OpenDSS Power Flow →

— Adam · adam@sgridworks.com