What You Will Learn
Feedwater flow, pump speed, and discharge pressure all scale with unit megawatt output. Understanding these relationships is fundamental to detecting efficiency drift: if the BFP is working harder than expected for a given load, something is degrading. In this guide you will:
- Load BFP time-series data and identify load-correlated parameters
- Build a correlation heatmap to discover the strongest relationships
- Filter data to running conditions only (exclude shutdowns and startups)
- Train a linear regression model to predict feedwater flow from unit load
- Extend to multiple regression with pump speed and discharge pressure
- Analyze residuals to detect periods of efficiency drift
- Cross-reference with SP&L demand data for context
Why does load correlation matter? In a thermal plant, nearly every parameter scales with MW output. Feedwater flow is roughly proportional to steam demand, which is proportional to electrical load. When the actual flow deviates from what the load predicts, it signals either equipment degradation (fouled tubes, worn pump impellers) or a control system issue.
SP&L Data You Will Use
- bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
- Column pattern:
U1_BFPA_*(Pump A),U1_BFPB_*(Pump B),U1_*(system-level tags including MW load) - tag_dictionary.csv — maps tag names to engineering descriptions and units
Verify Your Setup
Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.
Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab
Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.
Load the Data
Explore Correlations
Before building any model, let's compute the correlation matrix between unit load (MW) and the BFP parameters. This reveals which sensors track load most closely.
What to look for: Feedwater flow, pump speed (RPM), and discharge pressure will typically show correlations above 0.9 with MW load. Bearing temperatures have moderate correlation (0.4 to 0.7) because they respond to load indirectly through friction and heat transfer. Vibration often has low correlation with load because it depends more on mechanical condition than operating point.
Filter Running Data
The dataset includes periods where the unit is offline, starting up, or shutting down. These transient periods will corrupt our regression model. We need to filter to steady-state running conditions only.
Why filter? During startup and shutdown, the relationships between load and feedwater parameters are nonlinear and inconsistent. Including those points would bias the regression model and inflate residuals. Filtering to steady-state running data gives a clean linear relationship that is meaningful for efficiency monitoring.
Simple Linear Regression
Let's start with the simplest model: predict feedwater flow from unit MW load alone.
Multiple Regression
Unit load alone explains most of the variance, but adding pump speed and discharge pressure can capture pump-specific operating conditions and improve the model.
Why multiple regression? A pump's flow is physically determined by its speed and the system pressure, not just the unit's MW output. Including these features captures the direct physical drivers of flow, which reduces residual noise and makes the remaining residuals more meaningful for detecting true degradation.
Residual Analysis
The residuals (actual minus predicted) are the signal we care about most. Consistent positive residuals mean the pump is delivering more flow than expected (possibly a control offset). Consistent negative residuals mean it is underperforming (degradation, fouling, or wear).
Reading residual plots: If the rolling mean of residuals drifts downward over months, the pump is gradually delivering less flow than expected at a given load. That is a classic indicator of impeller wear or fouling. A sudden downward shift suggests an acute event (valve issue, control change). The histogram should be approximately normal and centered on zero for a well-specified model.
Cross-Reference SP&L Demand
The BFP data exists in a larger context: the SP&L system demand determines how hard each generating unit runs. Let's overlay unit load against the residual trend to see if efficiency drift correlates with sustained high-load periods.
What to look for: If residuals trend negative during summer months (when load is highest), it suggests the pump degrades faster under sustained high-load operation. This pattern helps maintenance planners schedule pump overhauls before the peak season, not during it.
What You Built and Next Steps
- Loaded BFP time-series data and identified load-correlated parameters
- Built a correlation heatmap showing relationships between unit load and BFP sensors
- Filtered to steady-state running data by excluding low-load periods
- Trained a simple linear regression predicting feedwater flow from MW load
- Extended to multiple regression with speed and discharge pressure features
- Analyzed residuals over time to detect efficiency drift patterns
- Cross-referenced flow residuals with monthly load patterns
Ideas to Try Next
- Add Pump B: Repeat the analysis for BFP B and compare the residual trends between pumps
- Polynomial regression: The flow-vs-load relationship may be slightly nonlinear at extreme loads; try a degree-2 polynomial
- Seasonal decomposition: Use
statsmodels.tsa.seasonal_decompose()on the residuals to separate trend, seasonal, and random components - Alarm correlation: Load
alarm_log.csvand check if residual excursions coincide with actual plant alarms - CUSUM chart: Apply cumulative sum control charts to the residuals for formal change-point detection
Key Terms Glossary
- Linear regression — a model that predicts a target variable as a weighted sum of input features plus an intercept
- R-squared (R2) — the proportion of variance in the target explained by the model; 1.0 = perfect, 0.0 = no explanatory power
- Residual — the difference between the actual value and the model's prediction; positive = actual exceeds expected
- MAE (Mean Absolute Error) — the average magnitude of prediction errors, ignoring direction
- Correlation coefficient — a measure of linear association between two variables; ranges from -1 to +1
- Multiple regression — linear regression with more than one input feature
- Efficiency drift — a gradual decline in equipment performance over time, detectable through residual trend analysis
Ready to Level Up?
In the advanced guide, you'll build a physics-informed digital twin that compares actual pump performance against OEM curves and tracks efficiency decay over time.
Go to BFP Digital Twin & Performance Tracking →