What You Will Learn
AMI (Advanced Metering Infrastructure) meters generate millions of data points every day. Hidden in that data are anomalies: unusual voltage readings that signal equipment problems, meter tampering, or phase imbalances. In this guide you will:
- Load 15-minute AMI voltage data from the SP&L dataset
- Explore what "normal" voltage patterns look like
- Train an Isolation Forest model to detect anomalies without labeled data
- Build a simple autoencoder in PyTorch that learns to reconstruct normal patterns
- Flag anomalies based on reconstruction error and evaluate both approaches
What is unsupervised anomaly detection? In Guides 01 and 04, we had labels—we knew which events were outages or failures. But anomaly detection often works without labels. The model learns what "normal" looks like and flags anything that deviates significantly. This is powerful because you don't need to have seen every type of anomaly before—the model catches anything unusual.
SP&L Data You Will Use
- timeseries/ami_15min_sample.parquet — 15-minute voltage and consumption readings from 2,400 service points (includes realistic noise, gaps, and meter errors)
- weather/hourly_observations.csv — temperature and weather conditions for context
Additional Libraries
pip install torch pyarrow
torch (PyTorch) is used for the autoencoder in the second half. You can complete the Isolation Forest section without it.
Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.
PyTorch on Windows: The command pip install torch installs the CPU-only version, which is all you need for this guide. If you have an NVIDIA GPU and want GPU acceleration, visit pytorch.org/get-started for the platform-specific install command with CUDA support. The CPU version works identically on Windows, macOS, and Linux.
Part A: Isolation Forest
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
DATA_DIR = "sisyphean-power-and-light/"
ami = pd.read_parquet(DATA_DIR + "timeseries/ami_15min_sample.parquet")
print(f"AMI records: {len(ami):,}")
print(f"Columns: {list(ami.columns)}")
print(f"Meters: {ami['meter_id'].nunique()}")
print(ami.head())
meter = ami[ami["meter_id"] == ami["meter_id"].unique()[0]].copy()
meter["timestamp"] = pd.to_datetime(meter["timestamp"])
meter = meter.sort_values("timestamp")
one_month = meter[(meter["timestamp"] >= "2024-06-01") &
(meter["timestamp"] < "2024-07-01")]
fig, ax = plt.subplots(figsize=(14, 4))
ax.plot(one_month["timestamp"], one_month["voltage"], linewidth=0.5, color="#2D6A7A")
ax.axhline(y=126, color="red", linestyle="--", alpha=0.5, label="ANSI upper (126V)")
ax.axhline(y=114, color="red", linestyle="--", alpha=0.5, label="ANSI lower (114V)")
ax.set_title("AMI Voltage Readings — June 2024")
ax.set_ylabel("Voltage (V)")
ax.legend()
plt.tight_layout()
plt.show()
You should see voltage oscillating in a daily pattern. Occasional spikes or dips are the anomalies we want to detect.
ami["timestamp"] = pd.to_datetime(ami["timestamp"])
ami["hour"] = ami["timestamp"].dt.hour
hourly = ami.groupby(["meter_id", ami["timestamp"].dt.floor("h")]).agg(
voltage_mean=("voltage", "mean"),
voltage_std=("voltage", "std"),
voltage_min=("voltage", "min"),
voltage_max=("voltage", "max"),
consumption_kwh=("consumption_kwh", "sum"),
).reset_index()
hourly["voltage_range"] = hourly["voltage_max"] - hourly["voltage_min"]
hourly = hourly.fillna(0)
print(f"Hourly feature rows: {len(hourly):,}")
print(hourly.describe())
feature_cols = ["voltage_mean", "voltage_std", "voltage_range",
"voltage_min", "voltage_max", "consumption_kwh"]
X = hourly[feature_cols]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
iso_forest = IsolationForest(
n_estimators=200,
contamination=0.01,
random_state=42
)
iso_forest.fit(X_scaled)
print("Isolation Forest training complete.")
hourly["anomaly"] = iso_forest.predict(X_scaled)
hourly["anomaly_score"] = iso_forest.decision_function(X_scaled)
n_anomalies = (hourly["anomaly"] == -1).sum()
print(f"Anomalies detected: {n_anomalies} ({n_anomalies/len(hourly)*100:.2f}%)")
How does Isolation Forest work? It builds random decision trees that try to isolate each data point. Normal points are similar to many others and take many splits to isolate. Anomalies are rare and different, so they get isolated quickly with fewer splits. The "anomaly score" reflects how easy a point was to isolate.
anomalies = hourly[hourly["anomaly"] == -1]
normal = hourly[hourly["anomaly"] == 1]
fig, ax = plt.subplots(figsize=(14, 5))
ax.scatter(normal["voltage_mean"], normal["voltage_std"],
c="#5FCCDB", s=5, alpha=0.3, label="Normal")
ax.scatter(anomalies["voltage_mean"], anomalies["voltage_std"],
c="red", s=30, marker="x", label="Anomaly")
ax.set_xlabel("Mean Voltage (V)")
ax.set_ylabel("Voltage Std Dev")
ax.set_title("Isolation Forest: Anomaly Detection in AMI Voltage Data")
ax.legend()
plt.tight_layout()
plt.show()
Part B: Autoencoder (PyTorch)
An autoencoder is a neural network that learns to compress data into a small representation and then reconstruct it. If the network is trained on normal data, it will reconstruct normal patterns well but struggle with anomalies—producing high reconstruction error.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
class VoltageAutoencoder(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 3),
)
self.decoder = nn.Sequential(
nn.Linear(3, 8),
nn.ReLU(),
nn.Linear(8, 16),
nn.ReLU(),
nn.Linear(16, input_dim),
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
input_dim = len(feature_cols)
model = VoltageAutoencoder(input_dim)
print(model)
What is a bottleneck? The bottleneck layer (size 3) forces the network to compress 6 input features into just 3 numbers. This compression forces the model to learn the most important patterns in the data. When an anomaly comes through, it doesn't fit the learned compression pattern, and the reconstruction will be poor.
normal_data = hourly[hourly["anomaly"] == 1][feature_cols]
normal_scaled = scaler.fit_transform(normal_data)
split = int(len(normal_scaled) * 0.8)
train_data = torch.FloatTensor(normal_scaled[:split])
val_data = torch.FloatTensor(normal_scaled[split:])
train_loader = DataLoader(TensorDataset(train_data, train_data),
batch_size=64, shuffle=True)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
losses = []
for epoch in range(50):
model.train()
epoch_loss = 0
for batch_x, batch_y in train_loader:
output = model(batch_x)
loss = criterion(output, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len(train_loader)
losses.append(avg_loss)
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1:>3}/50 Loss: {avg_loss:.6f}")
plt.figure(figsize=(8, 4))
plt.plot(losses, color="#5FCCDB")
plt.title("Autoencoder Training Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.tight_layout()
plt.show()
model.eval()
all_scaled = scaler.transform(hourly[feature_cols])
all_tensor = torch.FloatTensor(all_scaled)
with torch.no_grad():
reconstructed = model(all_tensor)
recon_error = torch.mean((all_tensor - reconstructed) ** 2, dim=1)
hourly["recon_error"] = recon_error.numpy()
threshold = hourly["recon_error"].quantile(0.99)
hourly["ae_anomaly"] = (hourly["recon_error"] > threshold).astype(int)
print(f"Reconstruction error threshold: {threshold:.4f}")
print(f"Autoencoder anomalies: {hourly['ae_anomaly'].sum()}")
hourly["iso_anomaly"] = (hourly["anomaly"] == -1).astype(int)
both = (hourly["iso_anomaly"] & hourly["ae_anomaly"]).sum()
iso_only = (hourly["iso_anomaly"] & ~hourly["ae_anomaly"]).sum()
ae_only = (~hourly["iso_anomaly"] & hourly["ae_anomaly"]).sum()
print(f"Flagged by both methods: {both}")
print(f"Isolation Forest only: {iso_only}")
print(f"Autoencoder only: {ae_only}")
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(hourly["recon_error"], bins=100, color="#5FCCDB", edgecolor="white")
ax.axvline(x=threshold, color="red", linestyle="--",
label=f"Threshold ({threshold:.4f})")
ax.set_xlabel("Reconstruction Error")
ax.set_ylabel("Frequency")
ax.set_title("Autoencoder Reconstruction Error Distribution")
ax.set_yscale("log")
ax.legend()
plt.tight_layout()
plt.show()
high_confidence = hourly[(hourly["iso_anomaly"] == 1) & (hourly["ae_anomaly"] == 1)]
high_confidence = high_confidence.sort_values("recon_error", ascending=False)
print("Top 10 highest-confidence anomalies:\n")
print(high_confidence[["meter_id", "timestamp", "voltage_mean",
"voltage_std", "voltage_range", "recon_error"]].head(10).to_string(index=False))
Anomalies flagged by both methods are the most trustworthy. Look for patterns: are they clustered on specific meters (possible equipment issue), specific times (possible load event), or specific voltages (possible tap changer malfunction)?
- Loaded and explored 15-minute AMI voltage data from 2,400 meters
- Engineered hourly statistical features from raw voltage readings
- Trained an Isolation Forest for unsupervised anomaly detection
- Built and trained a PyTorch autoencoder on normal voltage patterns
- Flagged anomalies using reconstruction error thresholds
- Compared both methods and investigated high-confidence detections
Ideas to Try Next
- Correlate with outages: Check whether detected anomalies preceded actual outage events in
outages/outage_events.csv
- Meter tampering detection: Look for meters with sudden consumption drops but normal voltage (possible bypass)
- Phase imbalance detection: Compare voltage patterns across phases to detect phase-level issues
- Real-time sliding window: Implement a streaming version that processes data in 1-hour windows
- Variational autoencoder: Replace the basic autoencoder with a VAE for probabilistic anomaly scoring
Key Terms Glossary
- Anomaly detection — identifying data points that deviate significantly from normal patterns
- Isolation Forest — an algorithm that detects anomalies by how easily data points can be isolated with random splits
- Autoencoder — a neural network that compresses data and reconstructs it; anomalies have high reconstruction error
- Reconstruction error — the difference between the input and the autoencoder's output; higher = more anomalous
- Unsupervised learning — learning from data without labels; the model discovers structure on its own
- AMI — Advanced Metering Infrastructure; smart meters that report voltage and consumption at 15-minute intervals
- Bottleneck layer — the smallest hidden layer in an autoencoder, forcing information compression
Ready to Level Up?
In the advanced guide, you'll build a Variational Autoencoder for probabilistic anomaly scoring and implement real-time streaming detection.
Go to Advanced Anomaly Detection →