Guide 19: Advanced rotating equipment fault diagnosis.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 19

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

In the beginner guide (Guide 17) you detected anomalies in BFP sensor data. Now you will go further: classifying what type of fault is occurring by fusing multiple sensor signals, training a Random Forest multi-class classifier, and explaining its decisions with SHAP. In this guide you will:

Load and filter BFP time-series data to running conditions
Engineer features from raw sensor channels (rolling stats, ratios, deltas)
Label fault windows using alarm and trip log data
Train a Random Forest classifier for multi-class fault diagnosis
Evaluate with confusion matrix, precision, recall, and F1
Use SHAP to explain which sensors drive each fault prediction
Build a sliding-window real-time fault detector

Prerequisites: This guide builds on concepts from Guide 17: BFP Health Monitoring. Complete that guide first if you are new to the BFP dataset.

SP&L Data You Will Use

bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
alarm_log.csv — timestamped alarm events with tag, severity, and description
trip_log.csv — unit trip events with cause codes and affected equipment
tag_dictionary.csv — maps tag names to engineering descriptions and units

Additional Libraries

pip install shap

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    import shap

    bfp = pd.read_parquet(
        "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
    )
    alarms = pd.read_csv(
        "sisyphean-power-and-light/generation/events/alarm_log.csv"
    )
    print(f"Setup OK! BFP: {len(bfp):,} rows, Alarms: {len(alarms):,} events.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt shap")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! BFP: 8,784 rows, Alarms: N events.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Extra dependency: pip install shap

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load & Filter Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay
)
import shap

# Load all data sources
bfp = pd.read_parquet(
    "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
alarms = pd.read_csv(
    "sisyphean-power-and-light/generation/events/alarm_log.csv",
    parse_dates=["timestamp"]
)
trips = pd.read_csv(
    "sisyphean-power-and-light/generation/events/trip_log.csv",
    parse_dates=["timestamp"]
)

# Filter to running conditions (load > 100 MW)
mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0]
running = bfp[bfp[mw_col] > 100].copy()

print(f"BFP data:    {bfp.shape}")
print(f"Running:     {running.shape}")
print(f"Alarm events: {len(alarms):,}")
print(f"Trip events:  {len(trips):,}")
                    

Feature Engineering

Raw sensor values alone are not sufficient for fault classification. We need to engineer features that capture the dynamic behavior of the equipment: rolling statistics, cross-sensor ratios, and rate-of-change indicators.

# Identify sensor groups
bfpa_bear = [c for c in running.columns if "BFPA" in c and "BEAR" in c and "TEMP" in c]
bfpa_vib  = [c for c in running.columns if "BFPA" in c and "VIB" in c]

# Rolling statistics (6-hour window)
for col in bfpa_bear + bfpa_vib:
    running[f"{col}_roll6_mean"] = running[col].rolling(6).mean()
    running[f"{col}_roll6_std"]  = running[col].rolling(6).std()

# Rate of change (hourly delta)
for col in bfpa_bear + bfpa_vib:
    running[f"{col}_delta"] = running[col].diff()

# Cross-sensor ratios
if len(bfpa_bear) >= 2:
    running["bear_temp_ratio"] = running[bfpa_bear[0]] / running[bfpa_bear[1]].replace(0, np.nan)

if len(bfpa_vib) >= 2:
    running["vib_ratio"] = running[bfpa_vib[0]] / running[bfpa_vib[1]].replace(0, np.nan)

# Max vibration across all BFP A sensors
running["max_vib_a"] = running[bfpa_vib].max(axis=1)

print(f"Features engineered. New shape: {running.shape}")
print(f"New columns: {running.shape[1] - bfp.shape[1]} added")
                    

Why these features? Rolling standard deviation captures volatility that precedes failures. Rate-of-change (delta) detects rapid deterioration. Cross-sensor ratios reveal imbalances: a healthy pump has roughly equal DE and NDE bearing temperatures, so a ratio drifting away from 1.0 signals uneven wear or lubrication issues. These engineered features give the classifier far more diagnostic power than raw values alone.

Label Fault Windows

We need labeled data to train a classifier. The alarm and trip logs provide timestamps of known fault events. We will label each hour in the time series as one of several fault categories based on proximity to alarm events.

# Filter alarms to BFP-related events
bfp_alarms = alarms[alarms["tag"].str.contains("BFP", na=False)].copy()

# Define fault categories based on alarm descriptions
def classify_alarm(desc):
    desc_lower = str(desc).lower()
    if "bearing" in desc_lower or "temp" in desc_lower:
        return "bearing_fault"
    elif "vibration" in desc_lower or "vib" in desc_lower:
        return "vibration_fault"
    elif "seal" in desc_lower or "leak" in desc_lower:
        return "seal_fault"
    elif "cavit" in desc_lower or "suction" in desc_lower:
        return "cavitation"
    else:
        return "other_fault"

bfp_alarms["fault_type"] = bfp_alarms["description"].apply(classify_alarm)
print("Fault distribution:")
print(bfp_alarms["fault_type"].value_counts())
                    

# Label each hour in the time series
# A fault window is +/- 4 hours around each alarm event
FAULT_WINDOW_HOURS = 4

running["fault_label"] = "normal"

for _, alarm in bfp_alarms.iterrows():
    t = alarm["timestamp"]
    mask = (running.index >= t - pd.Timedelta(hours=FAULT_WINDOW_HOURS)) & \
           (running.index <= t + pd.Timedelta(hours=FAULT_WINDOW_HOURS))
    running.loc[mask, "fault_label"] = alarm["fault_type"]

print("\nLabel distribution:")
print(running["fault_label"].value_counts())
                    

Fault window size: A 4-hour window captures the lead-up to and aftermath of each fault event. Plant data typically shows sensor drift in the hours before an alarm fires. Wider windows capture more precursor behavior but risk labeling normal operation as faulty. Start with 4 hours and adjust based on your confusion matrix results.

Train the Classifier

# Select feature columns (all engineered features)
feature_cols = [c for c in running.columns
                if "roll6" in c or "delta" in c
                or "ratio" in c or "max_vib" in c
                or ("BFPA" in c and ("BEAR" in c or "VIB" in c))]

# Drop NaN rows from rolling/diff operations
model_data = running[feature_cols + ["fault_label"]].dropna()

X = model_data[feature_cols]
y = model_data["fault_label"]

print(f"Feature matrix: {X.shape}")
print(f"Classes: {y.nunique()} ({y.value_counts().to_dict()})")

# Train/test split (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"\nTraining: {len(X_train):,} samples")
print(f"Test:     {len(X_test):,} samples")
                    

# Train Random Forest with class weight balancing
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
print("Random Forest training complete.")
print(f"OOB-like train accuracy: {rf.score(X_train, y_train):.3f}")
print(f"Test accuracy:           {rf.score(X_test, y_test):.3f}")
                    

Evaluate with Confusion Matrix

# Classification report
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
                    

# Confusion matrix visualization
fig, ax = plt.subplots(figsize=(8, 7))

cm = confusion_matrix(y_test, y_pred, labels=rf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_)
disp.plot(cmap="Blues", ax=ax, values_format="d")

ax.set_title("Fault Classification Confusion Matrix")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
                    

Reading the confusion matrix: Each row is the true fault type, each column is the predicted type. The diagonal shows correct predictions. Off-diagonal entries show misclassifications. For rotating equipment, confusing "vibration_fault" with "bearing_fault" is common because high vibration often causes bearing damage. The classification report shows precision (of all predicted bearing faults, how many were correct?) and recall (of all actual bearing faults, how many did the model find?).

SHAP Analysis

SHAP (SHapley Additive exPlanations) provides a mathematically rigorous way to explain why the model made each prediction. For each sample, SHAP assigns a contribution value to every feature, showing which sensors pushed the prediction toward each fault class.

# Compute SHAP values (use a sample for speed)
X_sample = X_test.sample(n=min(500, len(X_test)), random_state=42)

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_sample)

print(f"SHAP values shape: {np.array(shap_values).shape}")
print(f"Classes: {rf.classes_.tolist()}")
                    

# SHAP summary plot for the most important fault class
# Find the index of the most interesting fault (not "normal")
fault_classes = [c for c in rf.classes_ if c != "normal"]
fault_idx = list(rf.classes_).index(fault_classes[0])

plt.figure(figsize=(10, 8))
shap.summary_plot(
    shap_values[fault_idx],
    X_sample,
    max_display=15,
    show=False
)
plt.title(f"SHAP Feature Importance: {fault_classes[0]}")
plt.tight_layout()
plt.show()
                    

# SHAP bar plot comparing feature importance across all fault types
fig, axes = plt.subplots(1, len(fault_classes),
                         figsize=(5 * len(fault_classes), 6))

for i, fault in enumerate(fault_classes):
    idx = list(rf.classes_).index(fault)
    mean_shap = np.abs(shap_values[idx]).mean(axis=0)
    top_features = pd.Series(mean_shap, index=feature_cols).nlargest(10)

    axes[i].barh(range(len(top_features)),
                top_features.values, color="#5FCCDB")
    axes[i].set_yticks(range(len(top_features)))
    axes[i].set_yticklabels(top_features.index, fontsize=8)
    axes[i].set_title(fault, fontsize=10)
    axes[i].set_xlabel("Mean |SHAP|")

plt.suptitle("Top Features by Fault Type", fontsize=13)
plt.tight_layout()
plt.show()
                    

What SHAP tells you: If the bearing temperature rolling standard deviation is the top SHAP feature for "bearing_fault", it means the model learned that volatile bearing temperatures are the strongest indicator of bearing problems. This aligns with engineering knowledge: a failing bearing exhibits erratic temperature swings before a sustained rise. SHAP converts a black-box model into an interpretable diagnostic tool that plant engineers can trust.

Sliding Window Detection

In a real plant, data arrives hour by hour. We simulate a real-time fault detector that processes each hour as it arrives, maintains a sliding window of recent predictions, and raises an alarm when a fault type is consistently predicted.

# Sliding window fault detector
WINDOW_SIZE = 6       # hours of recent predictions to consider
ALARM_THRESHOLD = 4   # if a fault type appears 4+ times in 6 hours, alarm

# Predict all hours
all_predictions = rf.predict(model_data[feature_cols])

# Simulate streaming detection
detector_alarms = []
for i in range(WINDOW_SIZE, len(all_predictions)):
    window = all_predictions[i - WINDOW_SIZE:i]
    for fault_type in fault_classes:
        count = sum(1 for p in window if p == fault_type)
        if count >= ALARM_THRESHOLD:
            detector_alarms.append({
                "timestamp": model_data.index[i],
                "fault_type": fault_type,
                "confidence": count / WINDOW_SIZE
            })

alarm_df = pd.DataFrame(detector_alarms)
print(f"Sliding window detector raised {len(alarm_df)} alarms")
if len(alarm_df) > 0:
    print("\nAlarm breakdown:")
    print(alarm_df["fault_type"].value_counts())
    print("\nFirst 10 alarms:")
    print(alarm_df.head(10).to_string(index=False))
                    

# Visualize detector alarms on timeline
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Panel 1: Sensor data
temp_col = [c for c in bfpa_bear][0]
axes[0].plot(model_data.index, model_data[temp_col],
            alpha=0.6, color="#5FCCDB", linewidth=0.5)
axes[0].set_ylabel("Bearing Temp (F)")
axes[0].set_title("Sliding Window Fault Detection")

# Overlay detector alarms
if len(alarm_df) > 0:
    for fault in fault_classes:
        fault_alarms = alarm_df[alarm_df["fault_type"] == fault]
        if len(fault_alarms) > 0:
            axes[0].scatter(fault_alarms["timestamp"],
                [model_data.loc[t, temp_col] if t in model_data.index else np.nan
                 for t in fault_alarms["timestamp"]],
                s=20, marker="x", label=fault, zorder=5)
    axes[0].legend(fontsize=8)

# Panel 2: Prediction confidence over time
pred_series = pd.Series(all_predictions, index=model_data.index)
is_fault = (pred_series != "normal").astype(int)
fault_rate = is_fault.rolling(WINDOW_SIZE).mean()

axes[1].fill_between(model_data.index, fault_rate,
                      alpha=0.5, color="#D69E2E")
axes[1].axhline(y=ALARM_THRESHOLD / WINDOW_SIZE, color="red",
                linestyle="--", label="Alarm threshold")
axes[1].set_ylabel("Fault Rate (6h window)")
axes[1].set_xlabel("Date")
axes[1].legend()

plt.tight_layout()
plt.show()
                    

✓

What You Built and Next Steps

Loaded BFP time-series, alarm, and trip data from the SP&L generation dataset
Engineered features from rolling statistics, rate-of-change, and cross-sensor ratios
Labeled fault windows from alarm events with automatic fault-type classification
Trained a Random Forest multi-class classifier with class-weight balancing
Evaluated with confusion matrix and per-class precision/recall/F1
Used SHAP to explain which sensors drive each fault type's prediction
Built a sliding-window real-time fault detector with configurable alarm thresholds

Ideas to Try Next

XGBoost comparison: Replace Random Forest with XGBoost and compare accuracy and SHAP explanations
Hyperparameter tuning: Use GridSearchCV to optimize tree depth, number of estimators, and leaf size
Temporal features: Add hour-of-day and day-of-week features to capture operational cycle effects
BFP B fusion: Include BFP B sensors as additional features for cross-pump fault detection
LIME comparison: Use lime for local explanations and compare with SHAP

Ready to Level Up?

In the next guide, you'll build a physics-informed digital twin that compares actual BFP performance against OEM pump curves.

Go to BFP Digital Twin & Performance Tracking →

← Prev: Feedwater Load Correlation Next: BFP Digital Twin →

— Adam · adam@sgridworks.com