Guide 19: Advanced rotating equipment fault diagnosis.
Guide 19
Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.
What You Will Learn
In the beginner guide (Guide 17) you detected anomalies in BFP sensor data. Now you will go further: classifying what type of fault is occurring by fusing multiple sensor signals, training a Random Forest multi-class classifier, and explaining its decisions with SHAP. In this guide you will:
- Load and filter BFP time-series data to running conditions
- Engineer features from raw sensor channels (rolling stats, ratios, deltas)
- Label fault windows using alarm and trip log data
- Train a Random Forest classifier for multi-class fault diagnosis
- Evaluate with confusion matrix, precision, recall, and F1
- Use SHAP to explain which sensors drive each fault prediction
- Build a sliding-window real-time fault detector
SP&L Data You Will Use
- bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
- alarm_log.csv — timestamped alarm events with tag, severity, and description
- trip_log.csv — unit trip events with cause codes and affected equipment
- tag_dictionary.csv — maps tag names to engineering descriptions and units
Additional Libraries
pip install shap
Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.
Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.
try:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import shap
bfp = pd.read_parquet(
"sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
alarms = pd.read_csv(
"sisyphean-power-and-light/generation/events/alarm_log.csv"
)
print(f"Setup OK! BFP: {len(bfp):,} rows, Alarms: {len(alarms):,} events.")
except ModuleNotFoundError as e:
print(f"Missing library: {e}")
print("Run: pip install -r requirements.txt shap")
except FileNotFoundError:
print("Data files not found. Run from the repo root:")
print(" cd Dynamic-Network-Model && jupyter lab")
Setup OK! BFP: 8,784 rows, Alarms: N events.
Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab
Extra dependency: pip install shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, confusion_matrix, ConfusionMatrixDisplay
)
import shap
bfp = pd.read_parquet(
"sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet"
)
alarms = pd.read_csv(
"sisyphean-power-and-light/generation/events/alarm_log.csv",
parse_dates=["timestamp"]
)
trips = pd.read_csv(
"sisyphean-power-and-light/generation/events/trip_log.csv",
parse_dates=["timestamp"]
)
mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0]
running = bfp[bfp[mw_col] > 100].copy()
print(f"BFP data: {bfp.shape}")
print(f"Running: {running.shape}")
print(f"Alarm events: {len(alarms):,}")
print(f"Trip events: {len(trips):,}")
Raw sensor values alone are not sufficient for fault classification. We need to engineer features that capture the dynamic behavior of the equipment: rolling statistics, cross-sensor ratios, and rate-of-change indicators.
bfpa_bear = [c for c in running.columns if "BFPA" in c and "BEAR" in c and "TEMP" in c]
bfpa_vib = [c for c in running.columns if "BFPA" in c and "VIB" in c]
for col in bfpa_bear + bfpa_vib:
running[f"{col}_roll6_mean"] = running[col].rolling(6).mean()
running[f"{col}_roll6_std"] = running[col].rolling(6).std()
for col in bfpa_bear + bfpa_vib:
running[f"{col}_delta"] = running[col].diff()
if len(bfpa_bear) >= 2:
running["bear_temp_ratio"] = running[bfpa_bear[0]] / running[bfpa_bear[1]].replace(0, np.nan)
if len(bfpa_vib) >= 2:
running["vib_ratio"] = running[bfpa_vib[0]] / running[bfpa_vib[1]].replace(0, np.nan)
running["max_vib_a"] = running[bfpa_vib].max(axis=1)
print(f"Features engineered. New shape: {running.shape}")
print(f"New columns: {running.shape[1] - bfp.shape[1]} added")
Why these features? Rolling standard deviation captures volatility that precedes failures. Rate-of-change (delta) detects rapid deterioration. Cross-sensor ratios reveal imbalances: a healthy pump has roughly equal DE and NDE bearing temperatures, so a ratio drifting away from 1.0 signals uneven wear or lubrication issues. These engineered features give the classifier far more diagnostic power than raw values alone.
We need labeled data to train a classifier. The alarm and trip logs provide timestamps of known fault events. We will label each hour in the time series as one of several fault categories based on proximity to alarm events.
bfp_alarms = alarms[alarms["tag"].str.contains("BFP", na=False)].copy()
def classify_alarm(desc):
desc_lower = str(desc).lower()
if "bearing" in desc_lower or "temp" in desc_lower:
return "bearing_fault"
elif "vibration" in desc_lower or "vib" in desc_lower:
return "vibration_fault"
elif "seal" in desc_lower or "leak" in desc_lower:
return "seal_fault"
elif "cavit" in desc_lower or "suction" in desc_lower:
return "cavitation"
else:
return "other_fault"
bfp_alarms["fault_type"] = bfp_alarms["description"].apply(classify_alarm)
print("Fault distribution:")
print(bfp_alarms["fault_type"].value_counts())
FAULT_WINDOW_HOURS = 4
running["fault_label"] = "normal"
for _, alarm in bfp_alarms.iterrows():
t = alarm["timestamp"]
mask = (running.index >= t - pd.Timedelta(hours=FAULT_WINDOW_HOURS)) & \
(running.index <= t + pd.Timedelta(hours=FAULT_WINDOW_HOURS))
running.loc[mask, "fault_label"] = alarm["fault_type"]
print("\nLabel distribution:")
print(running["fault_label"].value_counts())
Fault window size: A 4-hour window captures the lead-up to and aftermath of each fault event. Plant data typically shows sensor drift in the hours before an alarm fires. Wider windows capture more precursor behavior but risk labeling normal operation as faulty. Start with 4 hours and adjust based on your confusion matrix results.
feature_cols = [c for c in running.columns
if "roll6" in c or "delta" in c
or "ratio" in c or "max_vib" in c
or ("BFPA" in c and ("BEAR" in c or "VIB" in c))]
model_data = running[feature_cols + ["fault_label"]].dropna()
X = model_data[feature_cols]
y = model_data["fault_label"]
print(f"Feature matrix: {X.shape}")
print(f"Classes: {y.nunique()} ({y.value_counts().to_dict()})")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
print(f"\nTraining: {len(X_train):,} samples")
print(f"Test: {len(X_test):,} samples")
rf = RandomForestClassifier(
n_estimators=300,
max_depth=12,
min_samples_leaf=5,
class_weight="balanced",
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
print("Random Forest training complete.")
print(f"OOB-like train accuracy: {rf.score(X_train, y_train):.3f}")
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
fig, ax = plt.subplots(figsize=(8, 7))
cm = confusion_matrix(y_test, y_pred, labels=rf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_)
disp.plot(cmap="Blues", ax=ax, values_format="d")
ax.set_title("Fault Classification Confusion Matrix")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
Reading the confusion matrix: Each row is the true fault type, each column is the predicted type. The diagonal shows correct predictions. Off-diagonal entries show misclassifications. For rotating equipment, confusing "vibration_fault" with "bearing_fault" is common because high vibration often causes bearing damage. The classification report shows precision (of all predicted bearing faults, how many were correct?) and recall (of all actual bearing faults, how many did the model find?).
SHAP (SHapley Additive exPlanations) provides a mathematically rigorous way to explain why the model made each prediction. For each sample, SHAP assigns a contribution value to every feature, showing which sensors pushed the prediction toward each fault class.
X_sample = X_test.sample(n=min(500, len(X_test)), random_state=42)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_sample)
print(f"SHAP values shape: {np.array(shap_values).shape}")
print(f"Classes: {rf.classes_.tolist()}")
fault_classes = [c for c in rf.classes_ if c != "normal"]
fault_idx = list(rf.classes_).index(fault_classes[0])
plt.figure(figsize=(10, 8))
shap.summary_plot(
shap_values[fault_idx],
X_sample,
max_display=15,
show=False
)
plt.title(f"SHAP Feature Importance: {fault_classes[0]}")
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, len(fault_classes),
figsize=(5 * len(fault_classes), 6))
for i, fault in enumerate(fault_classes):
idx = list(rf.classes_).index(fault)
mean_shap = np.abs(shap_values[idx]).mean(axis=0)
top_features = pd.Series(mean_shap, index=feature_cols).nlargest(10)
axes[i].barh(range(len(top_features)),
top_features.values, color="#5FCCDB")
axes[i].set_yticks(range(len(top_features)))
axes[i].set_yticklabels(top_features.index, fontsize=8)
axes[i].set_title(fault, fontsize=10)
axes[i].set_xlabel("Mean |SHAP|")
plt.suptitle("Top Features by Fault Type", fontsize=13)
plt.tight_layout()
plt.show()
What SHAP tells you: If the bearing temperature rolling standard deviation is the top SHAP feature for "bearing_fault", it means the model learned that volatile bearing temperatures are the strongest indicator of bearing problems. This aligns with engineering knowledge: a failing bearing exhibits erratic temperature swings before a sustained rise. SHAP converts a black-box model into an interpretable diagnostic tool that plant engineers can trust.
In a real plant, data arrives hour by hour. We simulate a real-time fault detector that processes each hour as it arrives, maintains a sliding window of recent predictions, and raises an alarm when a fault type is consistently predicted.
WINDOW_SIZE = 6
ALARM_THRESHOLD = 4
all_predictions = rf.predict(model_data[feature_cols])
detector_alarms = []
for i in range(WINDOW_SIZE, len(all_predictions)):
window = all_predictions[i - WINDOW_SIZE:i]
for fault_type in fault_classes:
count = sum(1 for p in window if p == fault_type)
if count >= ALARM_THRESHOLD:
detector_alarms.append({
"timestamp": model_data.index[i],
"fault_type": fault_type,
"confidence": count / WINDOW_SIZE
})
alarm_df = pd.DataFrame(detector_alarms)
print(f"Sliding window detector raised {len(alarm_df)} alarms")
if len(alarm_df) > 0:
print("\nAlarm breakdown:")
print(alarm_df["fault_type"].value_counts())
print("\nFirst 10 alarms:")
print(alarm_df.head(10).to_string(index=False))
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)
temp_col = [c for c in bfpa_bear][0]
axes[0].plot(model_data.index, model_data[temp_col],
alpha=0.6, color="#5FCCDB", linewidth=0.5)
axes[0].set_ylabel("Bearing Temp (F)")
axes[0].set_title("Sliding Window Fault Detection")
if len(alarm_df) > 0:
for fault in fault_classes:
fault_alarms = alarm_df[alarm_df["fault_type"] == fault]
if len(fault_alarms) > 0:
axes[0].scatter(fault_alarms["timestamp"],
[model_data.loc[t, temp_col] if t in model_data.index else np.nan
for t in fault_alarms["timestamp"]],
s=20, marker="x", label=fault, zorder=5)
axes[0].legend(fontsize=8)
pred_series = pd.Series(all_predictions, index=model_data.index)
is_fault = (pred_series != "normal").astype(int)
fault_rate = is_fault.rolling(WINDOW_SIZE).mean()
axes[1].fill_between(model_data.index, fault_rate,
alpha=0.5, color="#D69E2E")
axes[1].axhline(y=ALARM_THRESHOLD / WINDOW_SIZE, color="red",
linestyle="--", label="Alarm threshold")
axes[1].set_ylabel("Fault Rate (6h window)")
axes[1].set_xlabel("Date")
axes[1].legend()
plt.tight_layout()
plt.show()
- Loaded BFP time-series, alarm, and trip data from the SP&L generation dataset
- Engineered features from rolling statistics, rate-of-change, and cross-sensor ratios
- Labeled fault windows from alarm events with automatic fault-type classification
- Trained a Random Forest multi-class classifier with class-weight balancing
- Evaluated with confusion matrix and per-class precision/recall/F1
- Used SHAP to explain which sensors drive each fault type's prediction
- Built a sliding-window real-time fault detector with configurable alarm thresholds
Ideas to Try Next
- XGBoost comparison: Replace Random Forest with XGBoost and compare accuracy and SHAP explanations
- Hyperparameter tuning: Use
GridSearchCV to optimize tree depth, number of estimators, and leaf size
- Temporal features: Add hour-of-day and day-of-week features to capture operational cycle effects
- BFP B fusion: Include BFP B sensors as additional features for cross-pump fault detection
- LIME comparison: Use
lime for local explanations and compare with SHAP
— Adam · adam@sgridworks.com