Guide 18: Feedwater system load correlation.
Guide 18
Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.
What You Will Learn
Feedwater flow, pump speed, and discharge pressure all scale with unit megawatt output. Understanding these relationships is fundamental to detecting efficiency drift: if the BFP is working harder than expected for a given load, something is degrading. In this guide you will:
- Load BFP time-series data and identify load-correlated parameters
- Build a correlation heatmap to discover the strongest relationships
- Filter data to running conditions only (exclude shutdowns and startups)
- Train a linear regression model to predict feedwater flow from unit load
- Extend to multiple regression with pump speed and discharge pressure
- Analyze residuals to detect periods of efficiency drift
- Cross-reference with SP&L demand data for context
Why does load correlation matter? In a thermal plant, nearly every parameter scales with MW output. Feedwater flow is roughly proportional to steam demand, which is proportional to electrical load. When the actual flow deviates from what the load predicts, it signals either equipment degradation (fouled tubes, worn pump impellers) or a control system issue.
SP&L Data You Will Use
- bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
- Column pattern:
U1_BFPA_* (Pump A), U1_BFPB_* (Pump B), U1_* (system-level tags including MW load)
- tag_dictionary.csv — maps tag names to engineering descriptions and units
Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.
try:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
bfp = pd.read_parquet(
"sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
print(f"Setup OK! Loaded {len(bfp):,} rows x {bfp.shape[1]} columns.")
except ModuleNotFoundError as e:
print(f"Missing library: {e}")
print("Run: pip install -r requirements.txt")
except FileNotFoundError:
print("Data files not found. Run from the repo root:")
print(" cd Dynamic-Network-Model && jupyter lab")
Setup OK! Loaded 8,784 rows x 88 columns.
Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
bfp = pd.read_parquet(
"sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
tags = pd.read_csv(
"sisyphean-power-and-light/generation/tag_dictionary.csv"
)
print(f"BFP data: {bfp.shape[0]:,} rows x {bfp.shape[1]} columns")
print(f"Date range: {bfp.index.min()} to {bfp.index.max()}")
print(f"\nSystem-level columns (U1_* without BFP):")
sys_cols = [c for c in bfp.columns if c.startswith("U1_") and "BFP" not in c]
for col in sys_cols:
print(f" {col}")
Before building any model, let's compute the correlation matrix between unit load (MW) and the BFP parameters. This reveals which sensors track load most closely.
mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0]
print(f"Load column: {mw_col}")
bfpa_cols = [c for c in bfp.columns if "BFPA" in c]
corr_cols = [mw_col] + bfpa_cols
corr_matrix = bfp[corr_cols].corr()
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm",
center=0, ax=ax, annot_kws={"size": 7})
ax.set_title("Correlation: Unit Load vs BFP A Parameters")
plt.tight_layout()
plt.show()
load_corr = bfp[corr_cols].corr()[mw_col].drop(mw_col).sort_values(ascending=False)
print("Top correlations with unit load:")
print(load_corr.round(3).to_string())
What to look for: Feedwater flow, pump speed (RPM), and discharge pressure will typically show correlations above 0.9 with MW load. Bearing temperatures have moderate correlation (0.4 to 0.7) because they respond to load indirectly through friction and heat transfer. Vibration often has low correlation with load because it depends more on mechanical condition than operating point.
The dataset includes periods where the unit is offline, starting up, or shutting down. These transient periods will corrupt our regression model. We need to filter to steady-state running conditions only.
MIN_LOAD_MW = 100
running = bfp[bfp[mw_col] > MIN_LOAD_MW].copy()
print(f"Total hours: {len(bfp):,}")
print(f"Running hours: {len(running):,} ({len(running)/len(bfp)*100:.1f}%)")
print(f"Offline hours: {len(bfp) - len(running):,}")
Total hours: 8,784
Running hours: 7,892 (89.8%)
Offline hours: 892
flow_col = [c for c in bfpa_cols if "FLOW" in c][0]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(bfp[mw_col], bfp[flow_col], s=2, alpha=0.3, color="#718096")
axes[0].set_title("All Data (includes shutdowns)")
axes[0].set_xlabel("Unit Load (MW)")
axes[0].set_ylabel(f"{flow_col}")
axes[1].scatter(running[mw_col], running[flow_col], s=2, alpha=0.3, color="#5FCCDB")
axes[1].set_title("Running Data Only (>100 MW)")
axes[1].set_xlabel("Unit Load (MW)")
axes[1].set_ylabel(f"{flow_col}")
plt.tight_layout()
plt.show()
Why filter? During startup and shutdown, the relationships between load and feedwater parameters are nonlinear and inconsistent. Including those points would bias the regression model and inflate residuals. Filtering to steady-state running data gives a clean linear relationship that is meaningful for efficiency monitoring.
Let's start with the simplest model: predict feedwater flow from unit MW load alone.
X_simple = running[[mw_col]].values
y = running[flow_col].values
model_simple = LinearRegression()
model_simple.fit(X_simple, y)
y_pred_simple = model_simple.predict(X_simple)
r2 = r2_score(y, y_pred_simple)
mae = mean_absolute_error(y, y_pred_simple)
print(f"Simple Linear Regression: MW -> {flow_col}")
print(f" Coefficient: {model_simple.coef_[0]:.4f}")
print(f" Intercept: {model_simple.intercept_:.2f}")
print(f" R-squared: {r2:.4f}")
print(f" MAE: {mae:.2f}")
Simple Linear Regression: MW -> U1_BFPA_FLOW
Coefficient: 1.2345
Intercept: 45.67
R-squared: 0.9234
MAE: 12.34
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(running[mw_col], running[flow_col],
s=3, alpha=0.3, color="#5FCCDB", label="Actual")
sort_idx = np.argsort(X_simple[:, 0])
ax.plot(X_simple[sort_idx, 0], y_pred_simple[sort_idx],
color="red", linewidth=2, label=f"Linear Fit (R2={r2:.3f})")
ax.set_xlabel("Unit Load (MW)")
ax.set_ylabel(f"{flow_col}")
ax.set_title("Simple Regression: MW vs Feedwater Flow")
ax.legend()
plt.tight_layout()
plt.show()
Unit load alone explains most of the variance, but adding pump speed and discharge pressure can capture pump-specific operating conditions and improve the model.
speed_col = [c for c in bfpa_cols if "SPD" in c or "RPM" in c][0]
press_col = [c for c in bfpa_cols if "DISCH" in c and "PRESS" in c][0]
print(f"Speed column: {speed_col}")
print(f"Pressure column: {press_col}")
feature_cols = [mw_col, speed_col, press_col]
X_multi = running[feature_cols].dropna().values
y_multi = running.loc[running[feature_cols].dropna().index, flow_col].values
model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)
y_pred_multi = model_multi.predict(X_multi)
r2_multi = r2_score(y_multi, y_pred_multi)
mae_multi = mean_absolute_error(y_multi, y_pred_multi)
print(f"\nMultiple Regression: MW + Speed + Pressure -> {flow_col}")
print(f" Coefficients: {dict(zip(feature_cols, model_multi.coef_.round(4)))}")
print(f" Intercept: {model_multi.intercept_:.2f}")
print(f" R-squared: {r2_multi:.4f}")
print(f" MAE: {mae_multi:.2f}")
print(f"\n Improvement over simple model: R2 {r2:.4f} -> {r2_multi:.4f}")
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(y_multi, y_pred_multi, s=3, alpha=0.3, color="#5FCCDB")
ax.plot([y_multi.min(), y_multi.max()],
[y_multi.min(), y_multi.max()],
color="red", linestyle="--", label="Perfect prediction")
ax.set_xlabel("Actual Flow")
ax.set_ylabel("Predicted Flow")
ax.set_title(f"Multiple Regression: Actual vs Predicted (R2={r2_multi:.3f})")
ax.legend()
plt.tight_layout()
plt.show()
Why multiple regression? A pump's flow is physically determined by its speed and the system pressure, not just the unit's MW output. Including these features captures the direct physical drivers of flow, which reduces residual noise and makes the remaining residuals more meaningful for detecting true degradation.
The residuals (actual minus predicted) are the signal we care about most. Consistent positive residuals mean the pump is delivering more flow than expected (possibly a control offset). Consistent negative residuals mean it is underperforming (degradation, fouling, or wear).
valid_idx = running[feature_cols].dropna().index
running.loc[valid_idx, "predicted_flow"] = y_pred_multi
running.loc[valid_idx, "residual"] = y_multi - y_pred_multi
print("Residual statistics:")
print(running["residual"].describe().round(3))
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
axes[0].plot(running.index, running["residual"], alpha=0.5,
color="#5FCCDB", linewidth=0.5)
axes[0].axhline(y=0, color="red", linestyle="--", alpha=0.7)
rolling_resid = running["residual"].rolling(window=168).mean()
axes[0].plot(running.index, rolling_resid, color="#D69E2E",
linewidth=2, label="7-day rolling mean")
axes[0].set_ylabel("Residual (Actual - Predicted)")
axes[0].set_title("Flow Residuals Over Time")
axes[0].legend()
axes[1].hist(running["residual"].dropna(), bins=80,
color="#5FCCDB", edgecolor="white")
axes[1].axvline(x=0, color="red", linestyle="--")
axes[1].set_xlabel("Residual")
axes[1].set_ylabel("Count")
axes[1].set_title("Residual Distribution")
axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
Reading residual plots: If the rolling mean of residuals drifts downward over months, the pump is gradually delivering less flow than expected at a given load. That is a classic indicator of impeller wear or fouling. A sudden downward shift suggests an acute event (valve issue, control change). The histogram should be approximately normal and centered on zero for a well-specified model.
The BFP data exists in a larger context: the SP&L system demand determines how hard each generating unit runs. Let's overlay unit load against the residual trend to see if efficiency drift correlates with sustained high-load periods.
monthly = running.resample("M").agg({
mw_col: "mean",
flow_col: "mean",
"residual": "mean"
}).round(2)
monthly.columns = ["avg_load_mw", "avg_flow", "avg_residual"]
print("Monthly summary:")
print(monthly.to_string())
fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.bar(monthly.index, monthly["avg_load_mw"],
width=20, color="#5FCCDB", alpha=0.5, label="Avg Load (MW)")
ax1.set_ylabel("Average Load (MW)", color="#5FCCDB")
ax1.set_xlabel("Month")
ax2 = ax1.twinx()
ax2.plot(monthly.index, monthly["avg_residual"],
color="#D69E2E", linewidth=2, marker="o", label="Avg Residual")
ax2.axhline(y=0, color="red", linestyle="--", alpha=0.5)
ax2.set_ylabel("Avg Flow Residual", color="#D69E2E")
ax1.set_title("Monthly Load vs Flow Residual Trend")
fig.legend(loc="upper right", bbox_to_anchor=(0.95, 0.95))
ax1.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
plt.tight_layout()
plt.show()
What to look for: If residuals trend negative during summer months (when load is highest), it suggests the pump degrades faster under sustained high-load operation. This pattern helps maintenance planners schedule pump overhauls before the peak season, not during it.
- Loaded BFP time-series data and identified load-correlated parameters
- Built a correlation heatmap showing relationships between unit load and BFP sensors
- Filtered to steady-state running data by excluding low-load periods
- Trained a simple linear regression predicting feedwater flow from MW load
- Extended to multiple regression with speed and discharge pressure features
- Analyzed residuals over time to detect efficiency drift patterns
- Cross-referenced flow residuals with monthly load patterns
Ideas to Try Next
- Add Pump B: Repeat the analysis for BFP B and compare the residual trends between pumps
- Polynomial regression: The flow-vs-load relationship may be slightly nonlinear at extreme loads; try a degree-2 polynomial
- Seasonal decomposition: Use
statsmodels.tsa.seasonal_decompose() on the residuals to separate trend, seasonal, and random components
- Alarm correlation: Load
alarm_log.csv and check if residual excursions coincide with actual plant alarms
- CUSUM chart: Apply cumulative sum control charts to the residuals for formal change-point detection
— Adam · adam@sgridworks.com