← Back to All Guides
Guide 19

Advanced Rotating Equipment Fault Diagnosis

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

In the beginner guide (Guide 17) you detected anomalies in BFP sensor data. Now you will go further: classifying what type of fault is occurring by fusing multiple sensor signals, training a Random Forest multi-class classifier, and explaining its decisions with SHAP. In this guide you will:

  • Load and filter BFP time-series data to running conditions
  • Engineer features from raw sensor channels (rolling stats, ratios, deltas)
  • Label fault windows using alarm and trip log data
  • Train a Random Forest classifier for multi-class fault diagnosis
  • Evaluate with confusion matrix, precision, recall, and F1
  • Use SHAP to explain which sensors drive each fault prediction
  • Build a sliding-window real-time fault detector

Prerequisites: This guide builds on concepts from Guide 17: BFP Health Monitoring. Complete that guide first if you are new to the BFP dataset.

SP&L Data You Will Use

  • bfp_train_hourly.parquet — 8,784 rows x 88 columns of hourly BFP and system data
  • alarm_log.csv — timestamped alarm events with tag, severity, and description
  • trip_log.csv — unit trip events with cause codes and affected equipment
  • tag_dictionary.csv — maps tag names to engineering descriptions and units

Additional Libraries

pip install shap

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

0

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup try: import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier import shap bfp = pd.read_parquet( "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet" ) alarms = pd.read_csv( "sisyphean-power-and-light/generation/events/alarm_log.csv" ) print(f"Setup OK! BFP: {len(bfp):,} rows, Alarms: {len(alarms):,} events.") except ModuleNotFoundError as e: print(f"Missing library: {e}") print("Run: pip install -r requirements.txt shap") except FileNotFoundError: print("Data files not found. Run from the repo root:") print(" cd Dynamic-Network-Model && jupyter lab")
Setup OK! BFP: 8,784 rows, Alarms: N events.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Extra dependency: pip install shap

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

1

Load & Filter Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import ( classification_report, confusion_matrix, ConfusionMatrixDisplay ) import shap # Load all data sources bfp = pd.read_parquet( "sisyphean-power-and-light/generation/timeseries/bfp_train_hourly.parquet" ) alarms = pd.read_csv( "sisyphean-power-and-light/generation/events/alarm_log.csv", parse_dates=["timestamp"] ) trips = pd.read_csv( "sisyphean-power-and-light/generation/events/trip_log.csv", parse_dates=["timestamp"] ) # Filter to running conditions (load > 100 MW) mw_col = [c for c in bfp.columns if "MW" in c or "LOAD" in c][0] running = bfp[bfp[mw_col] > 100].copy() print(f"BFP data: {bfp.shape}") print(f"Running: {running.shape}") print(f"Alarm events: {len(alarms):,}") print(f"Trip events: {len(trips):,}")
2

Feature Engineering

Raw sensor values alone are not sufficient for fault classification. We need to engineer features that capture the dynamic behavior of the equipment: rolling statistics, cross-sensor ratios, and rate-of-change indicators.

# Identify sensor groups bfpa_bear = [c for c in running.columns if "BFPA" in c and "BEAR" in c and "TEMP" in c] bfpa_vib = [c for c in running.columns if "BFPA" in c and "VIB" in c] # Rolling statistics (6-hour window) for col in bfpa_bear + bfpa_vib: running[f"{col}_roll6_mean"] = running[col].rolling(6).mean() running[f"{col}_roll6_std"] = running[col].rolling(6).std() # Rate of change (hourly delta) for col in bfpa_bear + bfpa_vib: running[f"{col}_delta"] = running[col].diff() # Cross-sensor ratios if len(bfpa_bear) >= 2: running["bear_temp_ratio"] = running[bfpa_bear[0]] / running[bfpa_bear[1]].replace(0, np.nan) if len(bfpa_vib) >= 2: running["vib_ratio"] = running[bfpa_vib[0]] / running[bfpa_vib[1]].replace(0, np.nan) # Max vibration across all BFP A sensors running["max_vib_a"] = running[bfpa_vib].max(axis=1) print(f"Features engineered. New shape: {running.shape}") print(f"New columns: {running.shape[1] - bfp.shape[1]} added")

Why these features? Rolling standard deviation captures volatility that precedes failures. Rate-of-change (delta) detects rapid deterioration. Cross-sensor ratios reveal imbalances: a healthy pump has roughly equal DE and NDE bearing temperatures, so a ratio drifting away from 1.0 signals uneven wear or lubrication issues. These engineered features give the classifier far more diagnostic power than raw values alone.

3

Label Fault Windows

We need labeled data to train a classifier. The alarm and trip logs provide timestamps of known fault events. We will label each hour in the time series as one of several fault categories based on proximity to alarm events.

# Filter alarms to BFP-related events bfp_alarms = alarms[alarms["tag"].str.contains("BFP", na=False)].copy() # Define fault categories based on alarm descriptions def classify_alarm(desc): desc_lower = str(desc).lower() if "bearing" in desc_lower or "temp" in desc_lower: return "bearing_fault" elif "vibration" in desc_lower or "vib" in desc_lower: return "vibration_fault" elif "seal" in desc_lower or "leak" in desc_lower: return "seal_fault" elif "cavit" in desc_lower or "suction" in desc_lower: return "cavitation" else: return "other_fault" bfp_alarms["fault_type"] = bfp_alarms["description"].apply(classify_alarm) print("Fault distribution:") print(bfp_alarms["fault_type"].value_counts())
# Label each hour in the time series # A fault window is +/- 4 hours around each alarm event FAULT_WINDOW_HOURS = 4 running["fault_label"] = "normal" for _, alarm in bfp_alarms.iterrows(): t = alarm["timestamp"] mask = (running.index >= t - pd.Timedelta(hours=FAULT_WINDOW_HOURS)) & \ (running.index <= t + pd.Timedelta(hours=FAULT_WINDOW_HOURS)) running.loc[mask, "fault_label"] = alarm["fault_type"] print("\nLabel distribution:") print(running["fault_label"].value_counts())

Fault window size: A 4-hour window captures the lead-up to and aftermath of each fault event. Plant data typically shows sensor drift in the hours before an alarm fires. Wider windows capture more precursor behavior but risk labeling normal operation as faulty. Start with 4 hours and adjust based on your confusion matrix results.

4

Train the Classifier

# Select feature columns (all engineered features) feature_cols = [c for c in running.columns if "roll6" in c or "delta" in c or "ratio" in c or "max_vib" in c or ("BFPA" in c and ("BEAR" in c or "VIB" in c))] # Drop NaN rows from rolling/diff operations model_data = running[feature_cols + ["fault_label"]].dropna() X = model_data[feature_cols] y = model_data["fault_label"] print(f"Feature matrix: {X.shape}") print(f"Classes: {y.nunique()} ({y.value_counts().to_dict()})") # Train/test split (stratified to preserve class balance) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42, stratify=y ) print(f"\nTraining: {len(X_train):,} samples") print(f"Test: {len(X_test):,} samples")
# Train Random Forest with class weight balancing rf = RandomForestClassifier( n_estimators=300, max_depth=12, min_samples_leaf=5, class_weight="balanced", random_state=42, n_jobs=-1 ) rf.fit(X_train, y_train) print("Random Forest training complete.") print(f"OOB-like train accuracy: {rf.score(X_train, y_train):.3f}") print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
5

Evaluate with Confusion Matrix

# Classification report y_pred = rf.predict(X_test) print(classification_report(y_test, y_pred))
# Confusion matrix visualization fig, ax = plt.subplots(figsize=(8, 7)) cm = confusion_matrix(y_test, y_pred, labels=rf.classes_) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_) disp.plot(cmap="Blues", ax=ax, values_format="d") ax.set_title("Fault Classification Confusion Matrix") plt.xticks(rotation=45, ha="right") plt.tight_layout() plt.show()

Reading the confusion matrix: Each row is the true fault type, each column is the predicted type. The diagonal shows correct predictions. Off-diagonal entries show misclassifications. For rotating equipment, confusing "vibration_fault" with "bearing_fault" is common because high vibration often causes bearing damage. The classification report shows precision (of all predicted bearing faults, how many were correct?) and recall (of all actual bearing faults, how many did the model find?).

6

SHAP Analysis

SHAP (SHapley Additive exPlanations) provides a mathematically rigorous way to explain why the model made each prediction. For each sample, SHAP assigns a contribution value to every feature, showing which sensors pushed the prediction toward each fault class.

# Compute SHAP values (use a sample for speed) X_sample = X_test.sample(n=min(500, len(X_test)), random_state=42) explainer = shap.TreeExplainer(rf) shap_values = explainer.shap_values(X_sample) print(f"SHAP values shape: {np.array(shap_values).shape}") print(f"Classes: {rf.classes_.tolist()}")
# SHAP summary plot for the most important fault class # Find the index of the most interesting fault (not "normal") fault_classes = [c for c in rf.classes_ if c != "normal"] fault_idx = list(rf.classes_).index(fault_classes[0]) plt.figure(figsize=(10, 8)) shap.summary_plot( shap_values[fault_idx], X_sample, max_display=15, show=False ) plt.title(f"SHAP Feature Importance: {fault_classes[0]}") plt.tight_layout() plt.show()
# SHAP bar plot comparing feature importance across all fault types fig, axes = plt.subplots(1, len(fault_classes), figsize=(5 * len(fault_classes), 6)) for i, fault in enumerate(fault_classes): idx = list(rf.classes_).index(fault) mean_shap = np.abs(shap_values[idx]).mean(axis=0) top_features = pd.Series(mean_shap, index=feature_cols).nlargest(10) axes[i].barh(range(len(top_features)), top_features.values, color="#5FCCDB") axes[i].set_yticks(range(len(top_features))) axes[i].set_yticklabels(top_features.index, fontsize=8) axes[i].set_title(fault, fontsize=10) axes[i].set_xlabel("Mean |SHAP|") plt.suptitle("Top Features by Fault Type", fontsize=13) plt.tight_layout() plt.show()

What SHAP tells you: If the bearing temperature rolling standard deviation is the top SHAP feature for "bearing_fault", it means the model learned that volatile bearing temperatures are the strongest indicator of bearing problems. This aligns with engineering knowledge: a failing bearing exhibits erratic temperature swings before a sustained rise. SHAP converts a black-box model into an interpretable diagnostic tool that plant engineers can trust.

7

Sliding Window Detection

In a real plant, data arrives hour by hour. We simulate a real-time fault detector that processes each hour as it arrives, maintains a sliding window of recent predictions, and raises an alarm when a fault type is consistently predicted.

# Sliding window fault detector WINDOW_SIZE = 6 # hours of recent predictions to consider ALARM_THRESHOLD = 4 # if a fault type appears 4+ times in 6 hours, alarm # Predict all hours all_predictions = rf.predict(model_data[feature_cols]) # Simulate streaming detection detector_alarms = [] for i in range(WINDOW_SIZE, len(all_predictions)): window = all_predictions[i - WINDOW_SIZE:i] for fault_type in fault_classes: count = sum(1 for p in window if p == fault_type) if count >= ALARM_THRESHOLD: detector_alarms.append({ "timestamp": model_data.index[i], "fault_type": fault_type, "confidence": count / WINDOW_SIZE }) alarm_df = pd.DataFrame(detector_alarms) print(f"Sliding window detector raised {len(alarm_df)} alarms") if len(alarm_df) > 0: print("\nAlarm breakdown:") print(alarm_df["fault_type"].value_counts()) print("\nFirst 10 alarms:") print(alarm_df.head(10).to_string(index=False))
# Visualize detector alarms on timeline fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True) # Panel 1: Sensor data temp_col = [c for c in bfpa_bear][0] axes[0].plot(model_data.index, model_data[temp_col], alpha=0.6, color="#5FCCDB", linewidth=0.5) axes[0].set_ylabel("Bearing Temp (F)") axes[0].set_title("Sliding Window Fault Detection") # Overlay detector alarms if len(alarm_df) > 0: for fault in fault_classes: fault_alarms = alarm_df[alarm_df["fault_type"] == fault] if len(fault_alarms) > 0: axes[0].scatter(fault_alarms["timestamp"], [model_data.loc[t, temp_col] if t in model_data.index else np.nan for t in fault_alarms["timestamp"]], s=20, marker="x", label=fault, zorder=5) axes[0].legend(fontsize=8) # Panel 2: Prediction confidence over time pred_series = pd.Series(all_predictions, index=model_data.index) is_fault = (pred_series != "normal").astype(int) fault_rate = is_fault.rolling(WINDOW_SIZE).mean() axes[1].fill_between(model_data.index, fault_rate, alpha=0.5, color="#D69E2E") axes[1].axhline(y=ALARM_THRESHOLD / WINDOW_SIZE, color="red", linestyle="--", label="Alarm threshold") axes[1].set_ylabel("Fault Rate (6h window)") axes[1].set_xlabel("Date") axes[1].legend() plt.tight_layout() plt.show()

What You Built and Next Steps

  1. Loaded BFP time-series, alarm, and trip data from the SP&L generation dataset
  2. Engineered features from rolling statistics, rate-of-change, and cross-sensor ratios
  3. Labeled fault windows from alarm events with automatic fault-type classification
  4. Trained a Random Forest multi-class classifier with class-weight balancing
  5. Evaluated with confusion matrix and per-class precision/recall/F1
  6. Used SHAP to explain which sensors drive each fault type's prediction
  7. Built a sliding-window real-time fault detector with configurable alarm thresholds

Ideas to Try Next

  • XGBoost comparison: Replace Random Forest with XGBoost and compare accuracy and SHAP explanations
  • Hyperparameter tuning: Use GridSearchCV to optimize tree depth, number of estimators, and leaf size
  • Temporal features: Add hour-of-day and day-of-week features to capture operational cycle effects
  • BFP B fusion: Include BFP B sensors as additional features for cross-pump fault detection
  • LIME comparison: Use lime for local explanations and compare with SHAP

Key Terms Glossary

  • Random Forest — an ensemble of decision trees that votes on the final prediction; robust to overfitting and handles class imbalance
  • Multi-class classification — predicting one of several discrete categories (not just binary yes/no)
  • SHAP (SHapley Additive exPlanations) — a game-theory-based method for explaining individual model predictions by assigning contribution values to each feature
  • Confusion matrix — a table showing how many samples of each true class were predicted as each class; the diagonal represents correct predictions
  • Feature engineering — creating new input variables from raw data to capture patterns that improve model performance
  • Sliding window — a fixed-size buffer of recent observations that advances as new data arrives; used for real-time monitoring
  • Class weight balancing — adjusting the loss function to penalize misclassification of rare classes more heavily
  • Sensor fusion — combining data from multiple sensors to achieve more reliable and informative measurements than any single sensor provides

Ready to Level Up?

In the next guide, you'll build a physics-informed digital twin that compares actual BFP performance against OEM pump curves.

Go to BFP Digital Twin & Performance Tracking →