Guide 18: Feedwater system load correlation.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 18

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

Feedwater flow, pump speed, and discharge pressure all scale with unit megawatt output. Understanding these relationships is fundamental to detecting efficiency drift: if the BFP is working harder than expected for a given load, something is degrading. In this guide you will:

Load BFP time-series data and identify load-correlated parameters
Build a correlation heatmap to discover the strongest relationships
Filter data to running conditions only (exclude shutdowns and startups)
Train a linear regression model to predict feedwater flow from unit load
Extend to multiple regression with pump speed and discharge pressure
Analyze residuals to detect periods of efficiency drift
Cross-reference with SP&L demand data for context

Why does load correlation matter? In a thermal plant, nearly every parameter scales with MW output. Feedwater flow is roughly proportional to steam demand, which is proportional to electrical load. When the actual flow deviates from what the load predicts, it signals either equipment degradation (fouled tubes, worn pump impellers) or a control system issue.

SP&L Data You Will Use

bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
Column pattern: U1_BFPA_* (Pump A), U1_BFPB_* (Pump B), U1_* (system-level tags including MW load)
tag_dictionary.csv — maps tag names to engineering descriptions and units

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    from sklearn.linear_model import LinearRegression

    bfp = pd.read_parquet(
        "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
    )
    print(f"Setup OK! Loaded {len(bfp):,} rows x {bfp.shape[1]} columns.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! Loaded 8,784 rows x 88 columns.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Load BFP hourly time-series
bfp = pd.read_parquet(
    "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)

# Load tag dictionary for reference
tags = pd.read_csv(
    "sisyphean-power-and-light/generation/tag_dictionary.csv"
)

print(f"BFP data:  {bfp.shape[0]:,} rows x {bfp.shape[1]} columns")
print(f"Date range: {bfp.index.min()} to {bfp.index.max()}")
print(f"\nSystem-level columns (U1_* without BFP):")
sys_cols = [c for c in bfp.columns if c.startswith("U1_") and "BFP" not in c]
for col in sys_cols:
    print(f"  {col}")
                    

Explore Correlations

Before building any model, let's compute the correlation matrix between unit load (MW) and the BFP parameters. This reveals which sensors track load most closely.

# Identify the unit MW load column
mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0]
print(f"Load column: {mw_col}")

# Select BFP A columns + load for the correlation matrix
bfpa_cols = [c for c in bfp.columns if "BFPA" in c]
corr_cols = [mw_col] + bfpa_cols
corr_matrix = bfp[corr_cols].corr()

# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm",
            center=0, ax=ax, annot_kws={"size": 7})
ax.set_title("Correlation: Unit Load vs BFP A Parameters")
plt.tight_layout()
plt.show()
                    

# Show the top correlations with MW load
load_corr = bfp[corr_cols].corr()[mw_col].drop(mw_col).sort_values(ascending=False)
print("Top correlations with unit load:")
print(load_corr.round(3).to_string())
                    

What to look for: Feedwater flow, pump speed (RPM), and discharge pressure will typically show correlations above 0.9 with MW load. Bearing temperatures have moderate correlation (0.4 to 0.7) because they respond to load indirectly through friction and heat transfer. Vibration often has low correlation with load because it depends more on mechanical condition than operating point.

Filter Running Data

The dataset includes periods where the unit is offline, starting up, or shutting down. These transient periods will corrupt our regression model. We need to filter to steady-state running conditions only.

# Filter to running conditions: load above minimum threshold
MIN_LOAD_MW = 100  # unit must be producing at least 100 MW

running = bfp[bfp[mw_col] > MIN_LOAD_MW].copy()

print(f"Total hours:   {len(bfp):,}")
print(f"Running hours: {len(running):,} ({len(running)/len(bfp)*100:.1f}%)")
print(f"Offline hours: {len(bfp) - len(running):,}")
                    

Total hours: 8,784 Running hours: 7,892 (89.8%) Offline hours: 892

# Scatter: MW load vs feedwater flow
flow_col = [c for c in bfpa_cols if "FLOW" in c][0]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before filtering
axes[0].scatter(bfp[mw_col], bfp[flow_col], s=2, alpha=0.3, color="#718096")
axes[0].set_title("All Data (includes shutdowns)")
axes[0].set_xlabel("Unit Load (MW)")
axes[0].set_ylabel(f"{flow_col}")

# After filtering
axes[1].scatter(running[mw_col], running[flow_col], s=2, alpha=0.3, color="#5FCCDB")
axes[1].set_title("Running Data Only (>100 MW)")
axes[1].set_xlabel("Unit Load (MW)")
axes[1].set_ylabel(f"{flow_col}")

plt.tight_layout()
plt.show()
                    

Why filter? During startup and shutdown, the relationships between load and feedwater parameters are nonlinear and inconsistent. Including those points would bias the regression model and inflate residuals. Filtering to steady-state running data gives a clean linear relationship that is meaningful for efficiency monitoring.

Simple Linear Regression

Let's start with the simplest model: predict feedwater flow from unit MW load alone.

# Simple linear regression: MW -> Feedwater Flow
X_simple = running[[mw_col]].values
y = running[flow_col].values

model_simple = LinearRegression()
model_simple.fit(X_simple, y)

y_pred_simple = model_simple.predict(X_simple)

r2 = r2_score(y, y_pred_simple)
mae = mean_absolute_error(y, y_pred_simple)

print(f"Simple Linear Regression: MW -> {flow_col}")
print(f"  Coefficient: {model_simple.coef_[0]:.4f}")
print(f"  Intercept:   {model_simple.intercept_:.2f}")
print(f"  R-squared:   {r2:.4f}")
print(f"  MAE:         {mae:.2f}")
                    

Simple Linear Regression: MW -> U1_BFPA_FLOW Coefficient: 1.2345 Intercept: 45.67 R-squared: 0.9234 MAE: 12.34

# Plot regression line over scatter
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(running[mw_col], running[flow_col],
           s=3, alpha=0.3, color="#5FCCDB", label="Actual")

# Sort for clean line
sort_idx = np.argsort(X_simple[:, 0])
ax.plot(X_simple[sort_idx, 0], y_pred_simple[sort_idx],
        color="red", linewidth=2, label=f"Linear Fit (R2={r2:.3f})")

ax.set_xlabel("Unit Load (MW)")
ax.set_ylabel(f"{flow_col}")
ax.set_title("Simple Regression: MW vs Feedwater Flow")
ax.legend()
plt.tight_layout()
plt.show()
                    

Multiple Regression

Unit load alone explains most of the variance, but adding pump speed and discharge pressure can capture pump-specific operating conditions and improve the model.

# Find speed and pressure columns for BFP A
speed_col = [c for c in bfpa_cols if "SPD" in c or "RPM" in c][0]
press_col = [c for c in bfpa_cols if "DISCH" in c and "PRESS" in c][0]

print(f"Speed column:    {speed_col}")
print(f"Pressure column: {press_col}")

# Multiple regression: MW + Speed + Pressure -> Flow
feature_cols = [mw_col, speed_col, press_col]
X_multi = running[feature_cols].dropna().values
y_multi = running.loc[running[feature_cols].dropna().index, flow_col].values

model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)

y_pred_multi = model_multi.predict(X_multi)

r2_multi = r2_score(y_multi, y_pred_multi)
mae_multi = mean_absolute_error(y_multi, y_pred_multi)

print(f"\nMultiple Regression: MW + Speed + Pressure -> {flow_col}")
print(f"  Coefficients: {dict(zip(feature_cols, model_multi.coef_.round(4)))}")
print(f"  Intercept:    {model_multi.intercept_:.2f}")
print(f"  R-squared:    {r2_multi:.4f}")
print(f"  MAE:          {mae_multi:.2f}")
print(f"\n  Improvement over simple model: R2 {r2:.4f} -> {r2_multi:.4f}")
                    

# Actual vs Predicted scatter for multiple regression
fig, ax = plt.subplots(figsize=(7, 7))

ax.scatter(y_multi, y_pred_multi, s=3, alpha=0.3, color="#5FCCDB")
ax.plot([y_multi.min(), y_multi.max()],
       [y_multi.min(), y_multi.max()],
       color="red", linestyle="--", label="Perfect prediction")

ax.set_xlabel("Actual Flow")
ax.set_ylabel("Predicted Flow")
ax.set_title(f"Multiple Regression: Actual vs Predicted (R2={r2_multi:.3f})")
ax.legend()
plt.tight_layout()
plt.show()
                    

Why multiple regression? A pump's flow is physically determined by its speed and the system pressure, not just the unit's MW output. Including these features captures the direct physical drivers of flow, which reduces residual noise and makes the remaining residuals more meaningful for detecting true degradation.

Residual Analysis

The residuals (actual minus predicted) are the signal we care about most. Consistent positive residuals mean the pump is delivering more flow than expected (possibly a control offset). Consistent negative residuals mean it is underperforming (degradation, fouling, or wear).

# Compute residuals
valid_idx = running[feature_cols].dropna().index
running.loc[valid_idx, "predicted_flow"] = y_pred_multi
running.loc[valid_idx, "residual"] = y_multi - y_pred_multi

print("Residual statistics:")
print(running["residual"].describe().round(3))
                    

# Plot residuals over time
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Time series of residuals
axes[0].plot(running.index, running["residual"], alpha=0.5,
            color="#5FCCDB", linewidth=0.5)
axes[0].axhline(y=0, color="red", linestyle="--", alpha=0.7)

# Add 7-day rolling mean of residuals
rolling_resid = running["residual"].rolling(window=168).mean()  # 7 days * 24 hours
axes[0].plot(running.index, rolling_resid, color="#D69E2E",
            linewidth=2, label="7-day rolling mean")
axes[0].set_ylabel("Residual (Actual - Predicted)")
axes[0].set_title("Flow Residuals Over Time")
axes[0].legend()

# Histogram of residuals
axes[1].hist(running["residual"].dropna(), bins=80,
            color="#5FCCDB", edgecolor="white")
axes[1].axvline(x=0, color="red", linestyle="--")
axes[1].set_xlabel("Residual")
axes[1].set_ylabel("Count")
axes[1].set_title("Residual Distribution")

axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

Reading residual plots: If the rolling mean of residuals drifts downward over months, the pump is gradually delivering less flow than expected at a given load. That is a classic indicator of impeller wear or fouling. A sudden downward shift suggests an acute event (valve issue, control change). The histogram should be approximately normal and centered on zero for a well-specified model.

Cross-Reference SP&L Demand

The BFP data exists in a larger context: the SP&L system demand determines how hard each generating unit runs. Let's overlay unit load against the residual trend to see if efficiency drift correlates with sustained high-load periods.

# Monthly aggregation for context
monthly = running.resample("M").agg({
    mw_col: "mean",
    flow_col: "mean",
    "residual": "mean"
}).round(2)
monthly.columns = ["avg_load_mw", "avg_flow", "avg_residual"]

print("Monthly summary:")
print(monthly.to_string())
                    

# Dual-axis plot: load vs residual trend
fig, ax1 = plt.subplots(figsize=(12, 5))

ax1.bar(monthly.index, monthly["avg_load_mw"],
        width=20, color="#5FCCDB", alpha=0.5, label="Avg Load (MW)")
ax1.set_ylabel("Average Load (MW)", color="#5FCCDB")
ax1.set_xlabel("Month")

ax2 = ax1.twinx()
ax2.plot(monthly.index, monthly["avg_residual"],
        color="#D69E2E", linewidth=2, marker="o", label="Avg Residual")
ax2.axhline(y=0, color="red", linestyle="--", alpha=0.5)
ax2.set_ylabel("Avg Flow Residual", color="#D69E2E")

ax1.set_title("Monthly Load vs Flow Residual Trend")
fig.legend(loc="upper right", bbox_to_anchor=(0.95, 0.95))
ax1.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
                    

What to look for: If residuals trend negative during summer months (when load is highest), it suggests the pump degrades faster under sustained high-load operation. This pattern helps maintenance planners schedule pump overhauls before the peak season, not during it.

✓

What You Built and Next Steps

Loaded BFP time-series data and identified load-correlated parameters
Built a correlation heatmap showing relationships between unit load and BFP sensors
Filtered to steady-state running data by excluding low-load periods
Trained a simple linear regression predicting feedwater flow from MW load
Extended to multiple regression with speed and discharge pressure features
Analyzed residuals over time to detect efficiency drift patterns
Cross-referenced flow residuals with monthly load patterns

Ideas to Try Next

Add Pump B: Repeat the analysis for BFP B and compare the residual trends between pumps
Polynomial regression: The flow-vs-load relationship may be slightly nonlinear at extreme loads; try a degree-2 polynomial
Seasonal decomposition: Use statsmodels.tsa.seasonal_decompose() on the residuals to separate trend, seasonal, and random components
Alarm correlation: Load alarm_log.csv and check if residual excursions coincide with actual plant alarms
CUSUM chart: Apply cumulative sum control charts to the residuals for formal change-point detection

Ready to Level Up?

In the advanced guide, you'll build a physics-informed digital twin that compares actual pump performance against OEM curves and tracks efficiency decay over time.

Go to BFP Digital Twin & Performance Tracking →

← Prev: BFP Health Monitoring Next: Advanced Rotating Equipment →

— Adam · adam@sgridworks.com