← Back to All Guides
Guide 18

Feedwater System Load Correlation

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

Feedwater flow, pump speed, and discharge pressure all scale with unit megawatt output. Understanding these relationships is fundamental to detecting efficiency drift: if the BFP is working harder than expected for a given load, something is degrading. In this guide you will:

  • Load BFP time-series data and identify load-correlated parameters
  • Build a correlation heatmap to discover the strongest relationships
  • Filter data to running conditions only (exclude shutdowns and startups)
  • Train a linear regression model to predict feedwater flow from unit load
  • Extend to multiple regression with pump speed and discharge pressure
  • Analyze residuals to detect periods of efficiency drift
  • Cross-reference with SP&L demand data for context

Why does load correlation matter? In a thermal plant, nearly every parameter scales with MW output. Feedwater flow is roughly proportional to steam demand, which is proportional to electrical load. When the actual flow deviates from what the load predicts, it signals either equipment degradation (fouled tubes, worn pump impellers) or a control system issue.

SP&L Data You Will Use

  • bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
  • Column pattern: U1_BFPA_* (Pump A), U1_BFPB_* (Pump B), U1_* (system-level tags including MW load)
  • tag_dictionary.csv — maps tag names to engineering descriptions and units
0

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup try: import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression bfp = pd.read_parquet( "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet" ) print(f"Setup OK! Loaded {len(bfp):,} rows x {bfp.shape[1]} columns.") except ModuleNotFoundError as e: print(f"Missing library: {e}") print("Run: pip install -r requirements.txt") except FileNotFoundError: print("Data files not found. Run from the repo root:") print(" cd Dynamic-Network-Model && jupyter lab")
Setup OK! Loaded 8,784 rows x 88 columns.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

1

Load the Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.dates as mdates import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error # Load BFP hourly time-series bfp = pd.read_parquet( "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet" ) # Load tag dictionary for reference tags = pd.read_csv( "sisyphean-power-and-light/generation/tag_dictionary.csv" ) print(f"BFP data: {bfp.shape[0]:,} rows x {bfp.shape[1]} columns") print(f"Date range: {bfp.index.min()} to {bfp.index.max()}") print(f"\nSystem-level columns (U1_* without BFP):") sys_cols = [c for c in bfp.columns if c.startswith("U1_") and "BFP" not in c] for col in sys_cols: print(f" {col}")
2

Explore Correlations

Before building any model, let's compute the correlation matrix between unit load (MW) and the BFP parameters. This reveals which sensors track load most closely.

# Identify the unit MW load column mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0] print(f"Load column: {mw_col}") # Select BFP A columns + load for the correlation matrix bfpa_cols = [c for c in bfp.columns if "BFPA" in c] corr_cols = [mw_col] + bfpa_cols corr_matrix = bfp[corr_cols].corr() # Plot heatmap fig, ax = plt.subplots(figsize=(12, 10)) sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=ax, annot_kws={"size": 7}) ax.set_title("Correlation: Unit Load vs BFP A Parameters") plt.tight_layout() plt.show()
# Show the top correlations with MW load load_corr = bfp[corr_cols].corr()[mw_col].drop(mw_col).sort_values(ascending=False) print("Top correlations with unit load:") print(load_corr.round(3).to_string())

What to look for: Feedwater flow, pump speed (RPM), and discharge pressure will typically show correlations above 0.9 with MW load. Bearing temperatures have moderate correlation (0.4 to 0.7) because they respond to load indirectly through friction and heat transfer. Vibration often has low correlation with load because it depends more on mechanical condition than operating point.

3

Filter Running Data

The dataset includes periods where the unit is offline, starting up, or shutting down. These transient periods will corrupt our regression model. We need to filter to steady-state running conditions only.

# Filter to running conditions: load above minimum threshold MIN_LOAD_MW = 100 # unit must be producing at least 100 MW running = bfp[bfp[mw_col] > MIN_LOAD_MW].copy() print(f"Total hours: {len(bfp):,}") print(f"Running hours: {len(running):,} ({len(running)/len(bfp)*100:.1f}%)") print(f"Offline hours: {len(bfp) - len(running):,}")
Total hours: 8,784 Running hours: 7,892 (89.8%) Offline hours: 892
# Scatter: MW load vs feedwater flow flow_col = [c for c in bfpa_cols if "FLOW" in c][0] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Before filtering axes[0].scatter(bfp[mw_col], bfp[flow_col], s=2, alpha=0.3, color="#718096") axes[0].set_title("All Data (includes shutdowns)") axes[0].set_xlabel("Unit Load (MW)") axes[0].set_ylabel(f"{flow_col}") # After filtering axes[1].scatter(running[mw_col], running[flow_col], s=2, alpha=0.3, color="#5FCCDB") axes[1].set_title("Running Data Only (>100 MW)") axes[1].set_xlabel("Unit Load (MW)") axes[1].set_ylabel(f"{flow_col}") plt.tight_layout() plt.show()

Why filter? During startup and shutdown, the relationships between load and feedwater parameters are nonlinear and inconsistent. Including those points would bias the regression model and inflate residuals. Filtering to steady-state running data gives a clean linear relationship that is meaningful for efficiency monitoring.

4

Simple Linear Regression

Let's start with the simplest model: predict feedwater flow from unit MW load alone.

# Simple linear regression: MW -> Feedwater Flow X_simple = running[[mw_col]].values y = running[flow_col].values model_simple = LinearRegression() model_simple.fit(X_simple, y) y_pred_simple = model_simple.predict(X_simple) r2 = r2_score(y, y_pred_simple) mae = mean_absolute_error(y, y_pred_simple) print(f"Simple Linear Regression: MW -> {flow_col}") print(f" Coefficient: {model_simple.coef_[0]:.4f}") print(f" Intercept: {model_simple.intercept_:.2f}") print(f" R-squared: {r2:.4f}") print(f" MAE: {mae:.2f}")
Simple Linear Regression: MW -> U1_BFPA_FLOW Coefficient: 1.2345 Intercept: 45.67 R-squared: 0.9234 MAE: 12.34
# Plot regression line over scatter fig, ax = plt.subplots(figsize=(8, 6)) ax.scatter(running[mw_col], running[flow_col], s=3, alpha=0.3, color="#5FCCDB", label="Actual") # Sort for clean line sort_idx = np.argsort(X_simple[:, 0]) ax.plot(X_simple[sort_idx, 0], y_pred_simple[sort_idx], color="red", linewidth=2, label=f"Linear Fit (R2={r2:.3f})") ax.set_xlabel("Unit Load (MW)") ax.set_ylabel(f"{flow_col}") ax.set_title("Simple Regression: MW vs Feedwater Flow") ax.legend() plt.tight_layout() plt.show()
5

Multiple Regression

Unit load alone explains most of the variance, but adding pump speed and discharge pressure can capture pump-specific operating conditions and improve the model.

# Find speed and pressure columns for BFP A speed_col = [c for c in bfpa_cols if "SPD" in c or "RPM" in c][0] press_col = [c for c in bfpa_cols if "DISCH" in c and "PRESS" in c][0] print(f"Speed column: {speed_col}") print(f"Pressure column: {press_col}") # Multiple regression: MW + Speed + Pressure -> Flow feature_cols = [mw_col, speed_col, press_col] X_multi = running[feature_cols].dropna().values y_multi = running.loc[running[feature_cols].dropna().index, flow_col].values model_multi = LinearRegression() model_multi.fit(X_multi, y_multi) y_pred_multi = model_multi.predict(X_multi) r2_multi = r2_score(y_multi, y_pred_multi) mae_multi = mean_absolute_error(y_multi, y_pred_multi) print(f"\nMultiple Regression: MW + Speed + Pressure -> {flow_col}") print(f" Coefficients: {dict(zip(feature_cols, model_multi.coef_.round(4)))}") print(f" Intercept: {model_multi.intercept_:.2f}") print(f" R-squared: {r2_multi:.4f}") print(f" MAE: {mae_multi:.2f}") print(f"\n Improvement over simple model: R2 {r2:.4f} -> {r2_multi:.4f}")
# Actual vs Predicted scatter for multiple regression fig, ax = plt.subplots(figsize=(7, 7)) ax.scatter(y_multi, y_pred_multi, s=3, alpha=0.3, color="#5FCCDB") ax.plot([y_multi.min(), y_multi.max()], [y_multi.min(), y_multi.max()], color="red", linestyle="--", label="Perfect prediction") ax.set_xlabel("Actual Flow") ax.set_ylabel("Predicted Flow") ax.set_title(f"Multiple Regression: Actual vs Predicted (R2={r2_multi:.3f})") ax.legend() plt.tight_layout() plt.show()

Why multiple regression? A pump's flow is physically determined by its speed and the system pressure, not just the unit's MW output. Including these features captures the direct physical drivers of flow, which reduces residual noise and makes the remaining residuals more meaningful for detecting true degradation.

6

Residual Analysis

The residuals (actual minus predicted) are the signal we care about most. Consistent positive residuals mean the pump is delivering more flow than expected (possibly a control offset). Consistent negative residuals mean it is underperforming (degradation, fouling, or wear).

# Compute residuals valid_idx = running[feature_cols].dropna().index running.loc[valid_idx, "predicted_flow"] = y_pred_multi running.loc[valid_idx, "residual"] = y_multi - y_pred_multi print("Residual statistics:") print(running["residual"].describe().round(3))
# Plot residuals over time fig, axes = plt.subplots(2, 1, figsize=(14, 8)) # Time series of residuals axes[0].plot(running.index, running["residual"], alpha=0.5, color="#5FCCDB", linewidth=0.5) axes[0].axhline(y=0, color="red", linestyle="--", alpha=0.7) # Add 7-day rolling mean of residuals rolling_resid = running["residual"].rolling(window=168).mean() # 7 days * 24 hours axes[0].plot(running.index, rolling_resid, color="#D69E2E", linewidth=2, label="7-day rolling mean") axes[0].set_ylabel("Residual (Actual - Predicted)") axes[0].set_title("Flow Residuals Over Time") axes[0].legend() # Histogram of residuals axes[1].hist(running["residual"].dropna(), bins=80, color="#5FCCDB", edgecolor="white") axes[1].axvline(x=0, color="red", linestyle="--") axes[1].set_xlabel("Residual") axes[1].set_ylabel("Count") axes[1].set_title("Residual Distribution") axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%b")) plt.tight_layout() plt.show()

Reading residual plots: If the rolling mean of residuals drifts downward over months, the pump is gradually delivering less flow than expected at a given load. That is a classic indicator of impeller wear or fouling. A sudden downward shift suggests an acute event (valve issue, control change). The histogram should be approximately normal and centered on zero for a well-specified model.

7

Cross-Reference SP&L Demand

The BFP data exists in a larger context: the SP&L system demand determines how hard each generating unit runs. Let's overlay unit load against the residual trend to see if efficiency drift correlates with sustained high-load periods.

# Monthly aggregation for context monthly = running.resample("M").agg({ mw_col: "mean", flow_col: "mean", "residual": "mean" }).round(2) monthly.columns = ["avg_load_mw", "avg_flow", "avg_residual"] print("Monthly summary:") print(monthly.to_string())
# Dual-axis plot: load vs residual trend fig, ax1 = plt.subplots(figsize=(12, 5)) ax1.bar(monthly.index, monthly["avg_load_mw"], width=20, color="#5FCCDB", alpha=0.5, label="Avg Load (MW)") ax1.set_ylabel("Average Load (MW)", color="#5FCCDB") ax1.set_xlabel("Month") ax2 = ax1.twinx() ax2.plot(monthly.index, monthly["avg_residual"], color="#D69E2E", linewidth=2, marker="o", label="Avg Residual") ax2.axhline(y=0, color="red", linestyle="--", alpha=0.5) ax2.set_ylabel("Avg Flow Residual", color="#D69E2E") ax1.set_title("Monthly Load vs Flow Residual Trend") fig.legend(loc="upper right", bbox_to_anchor=(0.95, 0.95)) ax1.xaxis.set_major_formatter(mdates.DateFormatter("%b")) plt.tight_layout() plt.show()

What to look for: If residuals trend negative during summer months (when load is highest), it suggests the pump degrades faster under sustained high-load operation. This pattern helps maintenance planners schedule pump overhauls before the peak season, not during it.

What You Built and Next Steps

  1. Loaded BFP time-series data and identified load-correlated parameters
  2. Built a correlation heatmap showing relationships between unit load and BFP sensors
  3. Filtered to steady-state running data by excluding low-load periods
  4. Trained a simple linear regression predicting feedwater flow from MW load
  5. Extended to multiple regression with speed and discharge pressure features
  6. Analyzed residuals over time to detect efficiency drift patterns
  7. Cross-referenced flow residuals with monthly load patterns

Ideas to Try Next

  • Add Pump B: Repeat the analysis for BFP B and compare the residual trends between pumps
  • Polynomial regression: The flow-vs-load relationship may be slightly nonlinear at extreme loads; try a degree-2 polynomial
  • Seasonal decomposition: Use statsmodels.tsa.seasonal_decompose() on the residuals to separate trend, seasonal, and random components
  • Alarm correlation: Load alarm_log.csv and check if residual excursions coincide with actual plant alarms
  • CUSUM chart: Apply cumulative sum control charts to the residuals for formal change-point detection

Key Terms Glossary

  • Linear regression — a model that predicts a target variable as a weighted sum of input features plus an intercept
  • R-squared (R2) — the proportion of variance in the target explained by the model; 1.0 = perfect, 0.0 = no explanatory power
  • Residual — the difference between the actual value and the model's prediction; positive = actual exceeds expected
  • MAE (Mean Absolute Error) — the average magnitude of prediction errors, ignoring direction
  • Correlation coefficient — a measure of linear association between two variables; ranges from -1 to +1
  • Multiple regression — linear regression with more than one input feature
  • Efficiency drift — a gradual decline in equipment performance over time, detectable through residual trend analysis

Ready to Level Up?

In the advanced guide, you'll build a physics-informed digital twin that compares actual pump performance against OEM curves and tracks efficiency decay over time.

Go to BFP Digital Twin & Performance Tracking →