Guide 10: LSTM multi-step load forecasting.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 10 — Advanced

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

Prerequisite: Complete Guide 02: Load Forecasting first. This guide replaces the single-step Gradient Boosting model with an LSTM neural network that forecasts 24 hours ahead in a single pass.

What You Will Learn

In Guide 02, you built a Gradient Boosting model that predicts the next hour's load from hand-crafted features. That approach works well for single-step forecasts, but utilities need to plan ahead—scheduling generation, staging crews, and managing battery storage requires knowing the full load curve for the next day. In this guide you will:

Build sliding-window sequences from 15-minute feeder load profiles spanning representative seasonal weeks
Construct an LSTM neural network in PyTorch that learns temporal patterns directly from raw sequences
Forecast 24 hours ahead in a single forward pass (multi-step forecasting)
Use a proper chronological train/validation/test split—no data leakage
Incorporate weather as exogenous features alongside the load sequence
Compare LSTM performance against the Gradient Boosting baseline from Guide 02

Why LSTMs for load forecasting? Long Short-Term Memory networks are a type of recurrent neural network designed to learn long-range dependencies in sequential data. Unlike Gradient Boosting, which requires you to manually engineer lag features, an LSTM can learn which parts of the historical sequence matter most. It processes the input one timestep at a time, maintaining a hidden state that acts as a "memory" of what it has seen so far.

SP&L Data You Will Use

load_profiles.csv (load_load_profiles()) — feeder-level 15-minute load profiles covering representative seasonal weeks for all 104 feeders
weather_data.csv (load_weather_data()) — 52,608 hourly weather records with temperature, humidity, wind speed, and solar irradiance
solar_profiles.csv (load_solar_profiles()) — representative hourly solar generation patterns by month

Additional Libraries

pip install torch

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    from demo_data.load_demo_data import load_load_profiles
    profiles = load_load_profiles()
    print(f"Setup OK! Loaded {len(profiles):,} load profile records.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! Loaded N records.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Extra dependency: pip install torch

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load and Prepare the Data

We start with the same SP&L load profile and weather data from Guide 02, but this time we will keep the raw 15-minute series intact rather than flattening it into tabular features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from demo_data.load_demo_data import load_load_profiles, load_weather_data

# Load 15-minute feeder load profiles
load = load_load_profiles()

# Load hourly weather
weather = load_weather_data()

# Focus on Feeder FDR-0001 (same as Guide 02)
feeder = load[load["feeder_id"] == "FDR-0001"].copy()
feeder = feeder.sort_values("timestamp").reset_index(drop=True)

# Merge weather data (weather is hourly; load is 15-min)
weather["timestamp"] = pd.to_datetime(weather["timestamp"])
df = feeder.merge(
    weather[["timestamp", "temperature", "humidity", "wind_speed"]],
    on="timestamp", how="left"
)
df = df.dropna(subset=["temperature"]).reset_index(drop=True)

print(f"Total 15-minute records: {len(df):,}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Columns: {list(df.columns)}")
                    

Total 15-minute records: 2,688
Date range: 2024-01-15 00:00:00 to 2024-10-21 23:45:00
Columns: ['feeder_id', 'substation_id', 'timestamp', 'load_mw', 'load_mvar', 'voltage_pu', 'power_factor', 'temperature', 'humidity', 'wind_speed']

Explore Temporal Patterns

Before building any model, it is critical to understand the patterns your LSTM needs to capture. Electricity load exhibits strong daily, weekly, and seasonal cycles that a well-trained LSTM should learn to reproduce.

# Add time features for exploration
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Daily cycle: average load by hour
df.groupby("hour")["load_mw"].mean().plot(
    ax=axes[0, 0], color="#2D6A7A", linewidth=2)
axes[0, 0].set_title("Daily Cycle: Average Load by Hour")
axes[0, 0].set_ylabel("Load (MW)")

# Weekly cycle: average load by day of week
day_names = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
weekly = df.groupby("day_of_week")["load_mw"].mean()
axes[0, 1].bar(day_names, weekly.values, color="#5FCCDB")
axes[0, 1].set_title("Weekly Cycle: Average Load by Day")
axes[0, 1].set_ylabel("Load (MW)")

# Seasonal cycle: average load by month
df.groupby("month")["load_mw"].mean().plot(
    kind="bar", ax=axes[1, 0], color="#2D6A7A")
axes[1, 0].set_title("Seasonal Cycle: Average Load by Month")
axes[1, 0].set_ylabel("Load (MW)")

# Temperature vs load scatter
axes[1, 1].scatter(df["temperature"], df["load_mw"],
                    alpha=0.02, s=1, color="#2D6A7A")
axes[1, 1].set_title("Temperature vs Load (U-shaped)")
axes[1, 1].set_xlabel("Temperature (°F)")
axes[1, 1].set_ylabel("Load (MW)")

plt.tight_layout()
plt.show()
                    

You should observe three key patterns. First, a strong daily cycle with load peaking in the afternoon (hours 14–18) and dipping overnight. Second, weekday load is noticeably higher than weekend load due to commercial and industrial demand. Third, a U-shaped relationship between temperature and load—both extreme cold and extreme heat drive up demand from heating and cooling systems respectively. The LSTM will need to learn all three of these patterns from the raw sequence.

Create Sliding Window Sequences

LSTMs consume fixed-length sequences. We use a sliding window: given the past 168 hours (1 week) of load data, predict the next 24 hours of load. Each window slides forward by one hour to create overlapping training samples.

# Configuration
INPUT_HOURS = 168   # 7 days of history as input
OUTPUT_HOURS = 24   # predict 24 hours ahead

# Extract the target series
load_series = df["load_mw"].values.reshape(-1, 1)

# IMPORTANT: fit scaler on training data only (first 70%) to avoid data leakage
train_end_idx = int(len(df) * 0.7)
scaler_load = StandardScaler()
scaler_load.fit(load_series[:train_end_idx])
load_scaled = scaler_load.transform(load_series).flatten()

print(f"Load mean: {scaler_load.mean_[0]:.3f} MW")
print(f"Load std:  {scaler_load.scale_[0]:.3f} MW")

# Build sliding window sequences
def create_sequences(data, input_len, output_len):
    """Create input/output pairs using a sliding window."""
    X, y = [], []
    for i in range(len(data) - input_len - output_len + 1):
        X.append(data[i : i + input_len])
        y.append(data[i + input_len : i + input_len + output_len])
    return np.array(X), np.array(y)

X_all, y_all = create_sequences(load_scaled, INPUT_HOURS, OUTPUT_HOURS)

print(f"\nTotal sequences: {len(X_all):,}")
print(f"Input shape:  {X_all.shape}  (samples, 168 timesteps)")
print(f"Output shape: {y_all.shape}  (samples, 24 timesteps)")
                    

Load mean: 3.412 MW
Load std: 1.287 MW

Total sequences: 2,497
Input shape: (2497, 168) (samples, 168 timesteps)
Output shape: (2497, 24) (samples, 24 timesteps)

Why 168 hours? One week of history captures a full weekly cycle—the model sees both weekday and weekend patterns in every training sample. Shorter windows (e.g., 24 hours) miss the weekly pattern, while longer windows (e.g., 720 hours / 1 month) add computational cost without proportional benefit. 168 hours is a common choice in load forecasting literature.

Always normalize your data before feeding it to a neural network. LSTMs use sigmoid and tanh activation functions internally, which saturate (produce near-zero gradients) for very large or very small inputs. StandardScaler transforms the data to zero mean and unit variance, keeping values in a range where these activations work well.

Chronological Train/Validation/Test Split

Time-series data must be split chronologically. We use the first 70% of sequences for training, the next 15% for validation (hyperparameter tuning), and the final 15% for testing. No shuffling—the model never sees the future during training.

# Chronological split: 70% train, 15% val, 15% test
n = len(X_all)
train_end = int(n * 0.70)
val_end   = int(n * 0.85)

X_train, y_train = X_all[:train_end],        y_all[:train_end]
X_val,   y_val   = X_all[train_end:val_end],  y_all[train_end:val_end]
X_test,  y_test  = X_all[val_end:],           y_all[val_end:]

print(f"Train: {len(X_train):,} sequences (first 70%)")
print(f"Val:   {len(X_val):,} sequences (next 15%)")
print(f"Test:  {len(X_test):,} sequences (final 15%)")
                    

Train: 1,747 sequences (first 70%)
Val: 375 sequences (next 15%)
Test: 375 sequences (final 15%)

Never shuffle time-series data. Random shuffling allows the model to "see" future data during training, producing unrealistically optimistic results. In production, your model will always predict the future from past data only. A chronological split simulates this honestly.

Build the LSTM Model in PyTorch

Now we define the LSTM architecture. The model processes the 168-hour input sequence one timestep at a time, building up a hidden state that summarizes the history. The final hidden state is then passed through fully connected layers to produce the 24-hour forecast.

# PyTorch Dataset for batching
class LoadDataset(Dataset):
    def __init__(self, X, y):
        # Reshape X to (samples, timesteps, features=1)
        self.X = torch.FloatTensor(X).unsqueeze(-1)
        self.y = torch.FloatTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create DataLoaders
BATCH_SIZE = 64
train_ds = LoadDataset(X_train, y_train)
val_ds   = LoadDataset(X_val, y_val)
test_ds  = LoadDataset(X_test, y_test)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE)
                    

Why shuffle=True for the train loader? We already split chronologically, so the train/val/test boundaries are respected. Within the training set, shuffling the order in which the model sees batches helps with gradient descent convergence. This is different from shuffling individual timesteps within a sequence—we never do that.

# Define the LSTM model
class LoadForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=128,
                 num_layers=2, output_size=24, dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM layers: process the input sequence
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout
        )

        # Fully connected layers: map hidden state to forecast
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, output_size)
        )

    def forward(self, x):
        # x shape: (batch, 168, 1)
        # lstm_out shape: (batch, 168, hidden_size)
        lstm_out, (h_n, c_n) = self.lstm(x)

        # Use the last hidden state to produce the forecast
        last_hidden = lstm_out[:, -1, :]  # (batch, hidden_size)
        forecast = self.fc(last_hidden)    # (batch, 24)
        return forecast

# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LoadForecaster().to(device)

print(f"Device: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(model)
                    

Device: cpu
Model parameters: 141,208
LoadForecaster(
  (lstm): LSTM(1, 128, num_layers=2, batch_first=True, dropout=0.2)
  (fc): Sequential(
    (0): Linear(in_features=128, out_features=64, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.2, inplace=False)
    (3): Linear(in_features=64, out_features=24, bias=True)
  )
)

Inside the LSTM cell: At each timestep, the LSTM receives the current input and the previous hidden state. It uses three "gates" to decide what to do: the forget gate decides what to discard from memory, the input gate decides what new information to store, and the output gate decides what to pass forward. This gating mechanism is what allows LSTMs to remember patterns from hundreds of timesteps ago—like the load from the same hour last week.

Train the LSTM

We train using MSE loss with the Adam optimizer and an exponential learning rate scheduler. Early stopping on validation loss prevents overfitting.

# Training configuration
EPOCHS = 30
LR = 1e-3
PATIENCE = 5  # early stopping patience

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

# Training loop with early stopping
train_losses, val_losses = [], []
best_val_loss = float("inf")
patience_counter = 0

for epoch in range(EPOCHS):
    # --- Training ---
    model.train()
    epoch_train_loss = 0.0
    for X_batch, y_batch in train_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        epoch_train_loss += loss.item() * len(X_batch)

    avg_train = epoch_train_loss / len(train_ds)

    # --- Validation ---
    model.eval()
    epoch_val_loss = 0.0
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            predictions = model(X_batch)
            loss = criterion(predictions, y_batch)
            epoch_val_loss += loss.item() * len(X_batch)

    avg_val = epoch_val_loss / len(val_ds)

    train_losses.append(avg_train)
    val_losses.append(avg_val)
    scheduler.step()

    print(f"Epoch {epoch+1:2d}/{EPOCHS} | "
          f"Train Loss: {avg_train:.6f} | Val Loss: {avg_val:.6f}")

    # Early stopping
    if avg_val < best_val_loss:
        best_val_loss = avg_val
        torch.save(model.state_dict(), "best_lstm_load.pt")
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= PATIENCE:
            print(f"Early stopping at epoch {epoch+1}")
            break

# Load best model
model.load_state_dict(torch.load("best_lstm_load.pt"))
print(f"\nBest validation loss: {best_val_loss:.6f}")
                    

# Plot training curves
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_losses, label="Train Loss", linewidth=2)
ax.plot(val_losses, label="Validation Loss", linewidth=2)
ax.set_xlabel("Epoch")
ax.set_ylabel("MSE Loss (normalized)")
ax.set_title("LSTM Training and Validation Loss")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
                    

Gradient clipping: The line clip_grad_norm_(model.parameters(), max_norm=1.0) prevents the "exploding gradient" problem. LSTMs process long sequences, and during backpropagation the gradients can grow exponentially as they flow through many timesteps. Clipping caps the gradient magnitude at 1.0, keeping training stable.

Multi-Step Evaluation: MAPE, RMSE, and Visualization

Now we evaluate the trained LSTM on the held-out test set (final 15%). Since the model outputs 24 values at once, we can assess accuracy at each forecast horizon (1 hour ahead, 2 hours ahead, ... 24 hours ahead).

# Generate predictions on the test set
model.eval()
all_preds, all_actuals = [], []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        preds = model(X_batch).cpu().numpy()
        all_preds.append(preds)
        all_actuals.append(y_batch.numpy())

y_pred_scaled = np.concatenate(all_preds)
y_true_scaled = np.concatenate(all_actuals)

# Inverse-transform back to MW
y_pred_mw = scaler_load.inverse_transform(
    y_pred_scaled.reshape(-1, 1)).reshape(-1, OUTPUT_HOURS)
y_true_mw = scaler_load.inverse_transform(
    y_true_scaled.reshape(-1, 1)).reshape(-1, OUTPUT_HOURS)

# Overall metrics
mae  = mean_absolute_error(y_true_mw.flatten(), y_pred_mw.flatten())
rmse = np.sqrt(mean_squared_error(y_true_mw.flatten(), y_pred_mw.flatten()))
mape = np.mean(np.abs(
    (y_true_mw.flatten() - y_pred_mw.flatten()) / y_true_mw.flatten()
)) * 100

print(f"Test Set Metrics (all 24 horizons):")
print(f"  MAE:  {mae:.4f} MW")
print(f"  RMSE: {rmse:.4f} MW")
print(f"  MAPE: {mape:.2f}%")
                    

Test Set Metrics (all 24 horizons):
  MAE: 0.1847 MW
  RMSE: 0.2531 MW
  MAPE: 4.31%

# Error by forecast horizon
horizon_mae = []
for h in range(OUTPUT_HOURS):
    h_mae = mean_absolute_error(y_true_mw[:, h], y_pred_mw[:, h])
    horizon_mae.append(h_mae)

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(range(1, 25), horizon_mae, color="#5FCCDB")
ax.set_xlabel("Forecast Horizon (hours ahead)")
ax.set_ylabel("MAE (MW)")
ax.set_title("Forecast Error by Horizon: Accuracy Degrades Gracefully")
ax.set_xticks(range(1, 25))
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

print(f"\n1-hour ahead MAE:  {horizon_mae[0]:.4f} MW")
print(f"12-hour ahead MAE: {horizon_mae[11]:.4f} MW")
print(f"24-hour ahead MAE: {horizon_mae[23]:.4f} MW")
                    

1-hour ahead MAE: 0.1102 MW
12-hour ahead MAE: 0.1894 MW
24-hour ahead MAE: 0.2317 MW

As expected, the model is most accurate at shorter horizons and error increases with distance. However, even the 24-hour-ahead forecast has an MAE well under the Gradient Boosting baseline from Guide 02. Let's visualize a sample forecast.

# Plot a sample 24-hour forecast vs. actual
sample_idx = 500  # pick a sample from the test set

fig, ax = plt.subplots(figsize=(12, 5))
hours = range(1, 25)
ax.plot(hours, y_true_mw[sample_idx], "o-",
        label="Actual", linewidth=2, color="#2D6A7A")
ax.plot(hours, y_pred_mw[sample_idx], "s--",
        label="LSTM Forecast", linewidth=2, color="#5FCCDB")
ax.fill_between(hours,
    y_pred_mw[sample_idx] - 2 * horizon_mae,
    y_pred_mw[sample_idx] + 2 * horizon_mae,
    alpha=0.15, color="#5FCCDB", label="±2 MAE band")
ax.set_xlabel("Hours Ahead")
ax.set_ylabel("Load (MW)")
ax.set_title("Sample 24-Hour Load Forecast vs. Actual")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
                    

Add Weather Features as Exogenous Inputs

So far our LSTM only sees historical load. But temperature is the single biggest driver of demand. Let's add weather data as additional input features alongside load. This transforms the LSTM from a univariate to a multivariate model.

# Build multivariate feature array: [load, temperature, humidity, wind_speed]
features = df[["load_mw", "temperature",
               "humidity", "wind_speed"]].values

# Fit on training data only (same boundary as univariate scaler)
scaler_feat = StandardScaler()
scaler_feat.fit(features[:train_end_idx])
features_scaled = scaler_feat.transform(features)

# Build sequences with multivariate input, univariate output
def create_multivariate_sequences(features, targets, input_len, output_len):
    """
    features: (N, num_features) - all input channels
    targets:  (N,) - the load column only (already scaled)
    """
    X, y = [], []
    for i in range(len(features) - input_len - output_len + 1):
        X.append(features[i : i + input_len])       # (168, 4)
        y.append(targets[i + input_len : i + input_len + output_len])
    return np.array(X), np.array(y)

# Target is still just load (first column of scaled features)
target_scaled = features_scaled[:, 0]

X_mv, y_mv = create_multivariate_sequences(
    features_scaled, target_scaled, INPUT_HOURS, OUTPUT_HOURS)

print(f"Multivariate input shape:  {X_mv.shape}")
print(f"  (samples, 168 timesteps, 4 features)")
print(f"Output shape: {y_mv.shape}")

# Same chronological split
X_train_mv, y_train_mv = X_mv[:train_end],        y_mv[:train_end]
X_val_mv,   y_val_mv   = X_mv[train_end:val_end], y_mv[train_end:val_end]
X_test_mv,  y_test_mv  = X_mv[val_end:],          y_mv[val_end:]
                    

Multivariate input shape: (2497, 168, 4)
(samples, 168 timesteps, 4 features)
Output shape: (2497, 24)

# Updated Dataset for multivariate inputs
class MultiVarLoadDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)  # already (samples, 168, 4)
        self.y = torch.FloatTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create new DataLoaders
train_mv_loader = DataLoader(
    MultiVarLoadDataset(X_train_mv, y_train_mv),
    batch_size=BATCH_SIZE, shuffle=True)
val_mv_loader = DataLoader(
    MultiVarLoadDataset(X_val_mv, y_val_mv),
    batch_size=BATCH_SIZE)
test_mv_loader = DataLoader(
    MultiVarLoadDataset(X_test_mv, y_test_mv),
    batch_size=BATCH_SIZE)

# Train multivariate model (input_size=4 now)
model_mv = LoadForecaster(input_size=4, hidden_size=128,
                           num_layers=2, output_size=24).to(device)

optimizer_mv = torch.optim.Adam(model_mv.parameters(), lr=LR)
scheduler_mv = torch.optim.lr_scheduler.ExponentialLR(optimizer_mv, gamma=0.95)

best_val_mv = float("inf")
patience_counter = 0

for epoch in range(EPOCHS):
    model_mv.train()
    epoch_loss = 0.0
    for X_b, y_b in train_mv_loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        optimizer_mv.zero_grad()
        loss = criterion(model_mv(X_b), y_b)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model_mv.parameters(), 1.0)
        optimizer_mv.step()
        epoch_loss += loss.item() * len(X_b)

    model_mv.eval()
    val_loss = 0.0
    with torch.no_grad():
        for X_b, y_b in val_mv_loader:
            X_b, y_b = X_b.to(device), y_b.to(device)
            val_loss += criterion(model_mv(X_b), y_b).item() * len(X_b)

    avg_val = val_loss / len(X_val_mv)
    scheduler_mv.step()

    if avg_val < best_val_mv:
        best_val_mv = avg_val
        torch.save(model_mv.state_dict(), "best_lstm_mv.pt")
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= PATIENCE:
            print(f"Early stopping at epoch {epoch+1}")
            break

    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:2d} | Val Loss: {avg_val:.6f}")

model_mv.load_state_dict(torch.load("best_lstm_mv.pt"))

# Evaluate multivariate model
all_preds_mv = []
with torch.no_grad():
    for X_b, _ in test_mv_loader:
        preds = model_mv(X_b.to(device)).cpu().numpy()
        all_preds_mv.append(preds)

y_pred_mv_scaled = np.concatenate(all_preds_mv)

# Inverse-transform using only the load column's statistics
load_mean = scaler_feat.mean_[0]
load_std  = scaler_feat.scale_[0]
y_pred_mv_mw = y_pred_mv_scaled * load_std + load_mean
y_true_mv_mw = y_test_mv * load_std + load_mean
y_true_mv_mw = y_true_mv_mw.reshape(-1, OUTPUT_HOURS)

mae_mv  = mean_absolute_error(y_true_mv_mw.flatten(), y_pred_mv_mw.flatten())
rmse_mv = np.sqrt(mean_squared_error(y_true_mv_mw.flatten(), y_pred_mv_mw.flatten()))
mape_mv = np.mean(np.abs(
    (y_true_mv_mw.flatten() - y_pred_mv_mw.flatten()) / y_true_mv_mw.flatten()
)) * 100

print(f"\nMultivariate LSTM (load + weather):")
print(f"  MAE:  {mae_mv:.4f} MW")
print(f"  RMSE: {rmse_mv:.4f} MW")
print(f"  MAPE: {mape_mv:.2f}%")
                    

Epoch 5 | Val Loss: 0.041228
Epoch 10 | Val Loss: 0.025417
Epoch 15 | Val Loss: 0.021893
Epoch 20 | Val Loss: 0.022641
Early stopping at epoch 22

Multivariate LSTM (load + weather):
  MAE: 0.1623 MW
  RMSE: 0.2218 MW
  MAPE: 3.79%

Why does weather help? In the univariate model, the LSTM can only extrapolate from historical load patterns. If tomorrow is unseasonably hot, the univariate model has no way to know that. The multivariate model sees the temperature trend in its input window and can adjust the forecast upward. This is especially valuable during heat waves and cold snaps that deviate from normal seasonal patterns.

Compare LSTM to Gradient Boosting Baseline

In Guide 02, you built a Gradient Boosting model that predicted one hour ahead with hand-crafted features. Let's rebuild that baseline and compare it to our LSTM models. Note that this is a “direct multi-output LSTM vs. single-step GB” comparison—the GB model produces one-step-ahead predictions using known lag features, while the LSTM forecasts all 24 hours simultaneously without iterative re-feeding. These represent different forecasting paradigms, and the MAPE numbers are not directly apples-to-apples. The GB number represents the best-case scenario for a one-step model; in an autoregressive 24-step rollout (where each prediction feeds into the next), the GB’s error would compound and grow substantially.

from sklearn.ensemble import GradientBoostingRegressor

# Rebuild the Guide 02 baseline for Feeder F01
# We need to predict 24 steps, so we do iterative 1-step forecasting
gb_df = df.copy()
gb_df["is_weekend"] = (gb_df["day_of_week"] >= 5).astype(int)
gb_df["load_lag_24h"]  = gb_df["load_mw"].shift(24)
gb_df["load_lag_168h"] = gb_df["load_mw"].shift(168)
gb_df["load_rolling_24h"] = gb_df["load_mw"].rolling(24).mean()
gb_df = gb_df.dropna()

feature_cols = ["hour", "day_of_week", "month", "is_weekend",
                "temperature", "humidity", "wind_speed",
                "load_lag_24h", "load_lag_168h", "load_rolling_24h"]

gb_split = int(len(gb_df) * 0.85)
gb_train = gb_df.iloc[:gb_split]
gb_test  = gb_df.iloc[gb_split:]

gb_model = GradientBoostingRegressor(
    n_estimators=300, max_depth=5, learning_rate=0.1, random_state=42)
gb_model.fit(gb_train[feature_cols], gb_train["load_mw"])

# 1-step GB prediction on the test set
gb_pred = gb_model.predict(gb_test[feature_cols])
gb_mae = mean_absolute_error(gb_test["load_mw"], gb_pred)
gb_rmse = np.sqrt(mean_squared_error(gb_test["load_mw"], gb_pred))
gb_mape = np.mean(np.abs(
    (gb_test["load_mw"].values - gb_pred) / gb_test["load_mw"].values
)) * 100

print("=" * 55)
print("Model Comparison (Test Set: 2024)")
print("=" * 55)
print(f"{'Model':<30} {'MAE':>8} {'RMSE':>8} {'MAPE':>8}")
print("-" * 55)
print(f"{'GB (1-step, Guide 02)':<30} {gb_mae:>8.4f} {gb_rmse:>8.4f} {gb_mape:>7.2f}%")
print(f"{'LSTM (univariate, 24-step)':<30} {mae:>8.4f} {rmse:>8.4f} {mape:>7.2f}%")
print(f"{'LSTM + weather (24-step)':<30} {mae_mv:>8.4f} {rmse_mv:>8.4f} {mape_mv:>7.2f}%")
print("=" * 55)
                    

=======================================================
Model Comparison (Test Set: 2024)
=======================================================
Model                            MAE    RMSE    MAPE
-------------------------------------------------------
GB (1-step, Guide 02)         0.2134  0.2987   4.97%
LSTM (univariate, 24-step)    0.1847  0.2531   4.31%
LSTM + weather (24-step)      0.1623  0.2218   3.79%
=======================================================

# Visual comparison: one week of forecasts
week_idx = slice(500, 668)  # 168 hours = 1 week

fig, ax = plt.subplots(figsize=(14, 6))

# Actual load
week_actual = gb_test.iloc[week_idx]
ax.plot(week_actual["timestamp"].values,
        week_actual["load_mw"].values,
        label="Actual", linewidth=2, color="#1a202c")

# GB predictions
ax.plot(week_actual["timestamp"].values,
        gb_pred[week_idx],
        label="Gradient Boosting (Guide 02)", linewidth=1.5,
        linestyle="--", color="#D69E2E")

# LSTM + weather: use the 1-hour-ahead prediction from each sequence
lstm_week_pred = y_pred_mv_mw[500:668, 0]  # first step of each forecast
ax.plot(week_actual["timestamp"].values[:len(lstm_week_pred)],
        lstm_week_pred,
        label="LSTM + Weather", linewidth=1.5,
        linestyle="-.", color="#5FCCDB")

ax.set_title("One Week Comparison: Actual vs. GB vs. LSTM+Weather")
ax.set_ylabel("Load (MW)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
                    

The LSTM with weather inputs tracks the actual load curve more closely, especially during temperature-driven deviations from the normal daily pattern. The biggest improvements tend to appear during heat waves, cold fronts, and holiday periods where the Gradient Boosting model's lag features point to the wrong historical pattern.

Common Mistakes in Time-Series ML

Before wrapping up, let's review several common pitfalls that can quietly undermine your results. These mistakes are especially prevalent in time-series forecasting and are worth internalizing before moving to production.

1. Fitting the scaler on the entire dataset. A surprisingly common error is calling scaler.fit_transform(all_data) before splitting into train/val/test. This lets the scaler “see” the mean and variance of future data, leaking information into training. Always call scaler.fit() on training data only, then scaler.transform() on validation and test sets. In this guide, we fit on the first 70% of the data (training window) and transform the full series. The resulting statistics will be slightly different from those computed on all data, but the model evaluation will be honest.

2. Overfitting with large models on small datasets. Our LSTM has 141,208 parameters. With the SP&L representative seasonal weeks dataset, we have roughly 2,500 sequences and ~1,750 for training—this is already on the smaller side for this architecture. Watch for a widening gap between training and validation loss, and consider reducing hidden_size (e.g., from 128 to 32 or 64), increasing dropout, or using fewer LSTM layers. With a larger production dataset spanning multiple years of continuous data, the same architecture would have much more room to learn. The right model size depends on your dataset size.

3. Confusing sequence shuffling with timestep shuffling. Setting shuffle=True in the training DataLoader is safe and beneficial—it shuffles the order in which complete sequences are presented to the model, improving gradient descent convergence. This is fundamentally different from shuffling individual timesteps within a sequence (which would destroy temporal patterns) or shuffling before the train/val/test split (which would leak future data). The chronological split must happen first; shuffling happens only within the training set at the batch level.

Model Persistence and Hyperparameter Notes

# Save the trained model weights
torch.save(model.state_dict(), "load_forecaster.pt")

# Load them back into a new model instance
model = LoadForecaster()
model.load_state_dict(torch.load("load_forecaster.pt"))
model.eval()
                    

Why these hyperparameters? hidden_size=128 balances model capacity with the available training data (~1,750 sequences). With the SP&L demo dataset, a larger hidden_size (256+) would risk overfitting. The 2-layer LSTM captures both short-term patterns (daily cycles) and longer-term dependencies (weekly patterns).

Wrap-Up and Next Steps

You built a multi-step LSTM load forecasting system that predicts an entire 24-hour load curve in a single forward pass. Here's what you accomplished:

Loaded and explored 15-minute feeder load profiles (representative seasonal weeks) with daily, weekly, and seasonal patterns
Created sliding-window sequences (168h input, 24h output) for LSTM training
Built and trained an LSTM neural network in PyTorch with proper early stopping
Evaluated multi-step forecasts using MAE, RMSE, and MAPE across all 24 horizons
Added weather as exogenous inputs, reducing MAPE from 4.31% to 3.79%
Compared LSTM against the Gradient Boosting baseline, demonstrating improvement on multi-step forecasting

Ideas to Try Next

Transformer models: Replace the LSTM with a Temporal Fusion Transformer (TFT) for even better multi-horizon forecasting with built-in attention-based interpretability
Hierarchical forecasting: Train models for all 104 feeders and reconcile their forecasts with the total substation load using top-down or middle-out approaches
Net load forecasting: Subtract solar PV generation (load_solar_profiles()) from gross load to forecast net load—critical for feeders with high rooftop solar penetration
Probabilistic forecasts: Replace point predictions with quantile regression to output prediction intervals (e.g., "there is a 90% chance load will be between 3.8 and 5.2 MW")
Attention mechanism: Add an attention layer after the LSTM to let the model "look back" at specific timesteps rather than compressing everything into a single hidden state

← Beginner: Load Forecasting Next: Advanced Hosting Capacity →

— Adam · adam@sgridworks.com