Prerequisite: Complete Guide 02: Load Forecasting first. This guide replaces the single-step Gradient Boosting model with an LSTM neural network that forecasts 24 hours ahead in a single pass.
What You Will Learn
In Guide 02, you built a Gradient Boosting model that predicts the next hour's load from hand-crafted features. That approach works well for single-step forecasts, but utilities need to plan ahead—scheduling generation, staging crews, and managing battery storage requires knowing the full load curve for the next day. In this guide you will:
- Build sliding-window sequences from 15-minute feeder load profiles spanning representative seasonal weeks
- Construct an LSTM neural network in PyTorch that learns temporal patterns directly from raw sequences
- Forecast 24 hours ahead in a single forward pass (multi-step forecasting)
- Use a proper chronological train/validation/test split—no data leakage
- Incorporate weather as exogenous features alongside the load sequence
- Compare LSTM performance against the Gradient Boosting baseline from Guide 02
Why LSTMs for load forecasting? Long Short-Term Memory networks are a type of recurrent neural network designed to learn long-range dependencies in sequential data. Unlike Gradient Boosting, which requires you to manually engineer lag features, an LSTM can learn which parts of the historical sequence matter most. It processes the input one timestep at a time, maintaining a hidden state that acts as a "memory" of what it has seen so far.
SP&L Data You Will Use
- load_profiles.csv (
load_load_profiles()) — feeder-level 15-minute load profiles covering representative seasonal weeks for all 65 feeders - weather_data.csv (
load_weather_data()) — 52,608 hourly weather records with temperature, humidity, wind speed, and solar irradiance - solar_profiles.csv (
load_solar_profiles()) — representative hourly solar generation patterns by month
Additional Libraries
Verify Your Setup
Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.
Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab
Extra dependency: pip install torch
Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.
Load and Prepare the Data
We start with the same SP&L load profile and weather data from Guide 02, but this time we will keep the raw 15-minute series intact rather than flattening it into tabular features.
Date range: 2024-01-15 00:00:00 to 2024-10-21 23:45:00
Columns: ['feeder_id', 'substation_id', 'timestamp', 'load_mw', 'load_mvar', 'voltage_pu', 'power_factor', 'temperature', 'humidity', 'wind_speed']
Explore Temporal Patterns
Before building any model, it is critical to understand the patterns your LSTM needs to capture. Electricity load exhibits strong daily, weekly, and seasonal cycles that a well-trained LSTM should learn to reproduce.
You should observe three key patterns. First, a strong daily cycle with load peaking in the afternoon (hours 14–18) and dipping overnight. Second, weekday load is noticeably higher than weekend load due to commercial and industrial demand. Third, a U-shaped relationship between temperature and load—both extreme cold and extreme heat drive up demand from heating and cooling systems respectively. The LSTM will need to learn all three of these patterns from the raw sequence.
Create Sliding Window Sequences
LSTMs consume fixed-length sequences. We use a sliding window: given the past 168 hours (1 week) of load data, predict the next 24 hours of load. Each window slides forward by one hour to create overlapping training samples.
Load std: 1.287 MW
Total sequences: 2,497
Input shape: (2497, 168) (samples, 168 timesteps)
Output shape: (2497, 24) (samples, 24 timesteps)
Why 168 hours? One week of history captures a full weekly cycle—the model sees both weekday and weekend patterns in every training sample. Shorter windows (e.g., 24 hours) miss the weekly pattern, while longer windows (e.g., 720 hours / 1 month) add computational cost without proportional benefit. 168 hours is a common choice in load forecasting literature.
Always normalize your data before feeding it to a neural network. LSTMs use sigmoid and tanh activation functions internally, which saturate (produce near-zero gradients) for very large or very small inputs. StandardScaler transforms the data to zero mean and unit variance, keeping values in a range where these activations work well.
Chronological Train/Validation/Test Split
Time-series data must be split chronologically. We use the first 70% of sequences for training, the next 15% for validation (hyperparameter tuning), and the final 15% for testing. No shuffling—the model never sees the future during training.
Val: 375 sequences (next 15%)
Test: 375 sequences (final 15%)
Never shuffle time-series data. Random shuffling allows the model to "see" future data during training, producing unrealistically optimistic results. In production, your model will always predict the future from past data only. A chronological split simulates this honestly.
Build the LSTM Model in PyTorch
Now we define the LSTM architecture. The model processes the 168-hour input sequence one timestep at a time, building up a hidden state that summarizes the history. The final hidden state is then passed through fully connected layers to produce the 24-hour forecast.
Why shuffle=True for the train loader? We already split chronologically, so the train/val/test boundaries are respected. Within the training set, shuffling the order in which the model sees batches helps with gradient descent convergence. This is different from shuffling individual timesteps within a sequence—we never do that.
Model parameters: 141,208
LoadForecaster(
(lstm): LSTM(1, 128, num_layers=2, batch_first=True, dropout=0.2)
(fc): Sequential(
(0): Linear(in_features=128, out_features=64, bias=True)
(1): ReLU()
(2): Dropout(p=0.2, inplace=False)
(3): Linear(in_features=64, out_features=24, bias=True)
)
)
Inside the LSTM cell: At each timestep, the LSTM receives the current input and the previous hidden state. It uses three "gates" to decide what to do: the forget gate decides what to discard from memory, the input gate decides what new information to store, and the output gate decides what to pass forward. This gating mechanism is what allows LSTMs to remember patterns from hundreds of timesteps ago—like the load from the same hour last week.
Train the LSTM
We train using MSE loss with the Adam optimizer and an exponential learning rate scheduler. Early stopping on validation loss prevents overfitting.
Epoch 2/30 | Train Loss: 0.078612 | Val Loss: 0.067841
Epoch 3/30 | Train Loss: 0.059304 | Val Loss: 0.054219
...
Epoch 18/30 | Train Loss: 0.021847 | Val Loss: 0.028163
Epoch 19/30 | Train Loss: 0.020592 | Val Loss: 0.028701
Epoch 20/30 | Train Loss: 0.019483 | Val Loss: 0.029115
...
Epoch 23/30 | Train Loss: 0.017921 | Val Loss: 0.030284
Early stopping at epoch 23
Best validation loss: 0.028163
Gradient clipping: The line clip_grad_norm_(model.parameters(), max_norm=1.0) prevents the "exploding gradient" problem. LSTMs process long sequences, and during backpropagation the gradients can grow exponentially as they flow through many timesteps. Clipping caps the gradient magnitude at 1.0, keeping training stable.
Multi-Step Evaluation: MAPE, RMSE, and Visualization
Now we evaluate the trained LSTM on the held-out test set (final 15%). Since the model outputs 24 values at once, we can assess accuracy at each forecast horizon (1 hour ahead, 2 hours ahead, ... 24 hours ahead).
MAE: 0.1847 MW
RMSE: 0.2531 MW
MAPE: 4.31%
12-hour ahead MAE: 0.1894 MW
24-hour ahead MAE: 0.2317 MW
As expected, the model is most accurate at shorter horizons and error increases with distance. However, even the 24-hour-ahead forecast has an MAE well under the Gradient Boosting baseline from Guide 02. Let's visualize a sample forecast.
Add Weather Features as Exogenous Inputs
So far our LSTM only sees historical load. But temperature is the single biggest driver of demand. Let's add weather data as additional input features alongside load. This transforms the LSTM from a univariate to a multivariate model.
(samples, 168 timesteps, 4 features)
Output shape: (2497, 24)
Epoch 10 | Val Loss: 0.025417
Epoch 15 | Val Loss: 0.021893
Epoch 20 | Val Loss: 0.022641
Early stopping at epoch 22
Multivariate LSTM (load + weather):
MAE: 0.1623 MW
RMSE: 0.2218 MW
MAPE: 3.79%
Why does weather help? In the univariate model, the LSTM can only extrapolate from historical load patterns. If tomorrow is unseasonably hot, the univariate model has no way to know that. The multivariate model sees the temperature trend in its input window and can adjust the forecast upward. This is especially valuable during heat waves and cold snaps that deviate from normal seasonal patterns.
Compare LSTM to Gradient Boosting Baseline
In Guide 02, you built a Gradient Boosting model that predicted one hour ahead with hand-crafted features. Let's rebuild that baseline and compare it to our LSTM models. Note that this is a “direct multi-output LSTM vs. single-step GB” comparison—the GB model produces one-step-ahead predictions using known lag features, while the LSTM forecasts all 24 hours simultaneously without iterative re-feeding. These represent different forecasting paradigms, and the MAPE numbers are not directly apples-to-apples. The GB number represents the best-case scenario for a one-step model; in an autoregressive 24-step rollout (where each prediction feeds into the next), the GB’s error would compound and grow substantially.
Model Comparison (Test Set: 2024)
=======================================================
Model MAE RMSE MAPE
-------------------------------------------------------
GB (1-step, Guide 02) 0.2134 0.2987 4.97%
LSTM (univariate, 24-step) 0.1847 0.2531 4.31%
LSTM + weather (24-step) 0.1623 0.2218 3.79%
=======================================================
The LSTM with weather inputs tracks the actual load curve more closely, especially during temperature-driven deviations from the normal daily pattern. The biggest improvements tend to appear during heat waves, cold fronts, and holiday periods where the Gradient Boosting model's lag features point to the wrong historical pattern.
Common Mistakes in Time-Series ML
Before wrapping up, let's review several common pitfalls that can quietly undermine your results. These mistakes are especially prevalent in time-series forecasting and are worth internalizing before moving to production.
1. Fitting the scaler on the entire dataset. A surprisingly common error is calling scaler.fit_transform(all_data) before splitting into train/val/test. This lets the scaler “see” the mean and variance of future data, leaking information into training. Always call scaler.fit() on training data only, then scaler.transform() on validation and test sets. In this guide, we fit on the first 70% of the data (training window) and transform the full series. The resulting statistics will be slightly different from those computed on all data, but the model evaluation will be honest.
2. Overfitting with large models on small datasets. Our LSTM has 141,208 parameters. With the SP&L representative seasonal weeks dataset, we have roughly 2,500 sequences and ~1,750 for training—this is already on the smaller side for this architecture. Watch for a widening gap between training and validation loss, and consider reducing hidden_size (e.g., from 128 to 32 or 64), increasing dropout, or using fewer LSTM layers. With a larger production dataset spanning multiple years of continuous data, the same architecture would have much more room to learn. The right model size depends on your dataset size.
3. Confusing sequence shuffling with timestep shuffling. Setting shuffle=True in the training DataLoader is safe and beneficial—it shuffles the order in which complete sequences are presented to the model, improving gradient descent convergence. This is fundamentally different from shuffling individual timesteps within a sequence (which would destroy temporal patterns) or shuffling before the train/val/test split (which would leak future data). The chronological split must happen first; shuffling happens only within the training set at the batch level.
Model Persistence and Hyperparameter Notes
Why these hyperparameters? hidden_size=128 balances model capacity with the available training data (~1,750 sequences). With the SP&L demo dataset, a larger hidden_size (256+) would risk overfitting. The 2-layer LSTM captures both short-term patterns (daily cycles) and longer-term dependencies (weekly patterns).
Wrap-Up and Next Steps
You built a multi-step LSTM load forecasting system that predicts an entire 24-hour load curve in a single forward pass. Here's what you accomplished:
- Loaded and explored 15-minute feeder load profiles (representative seasonal weeks) with daily, weekly, and seasonal patterns
- Created sliding-window sequences (168h input, 24h output) for LSTM training
- Built and trained an LSTM neural network in PyTorch with proper early stopping
- Evaluated multi-step forecasts using MAE, RMSE, and MAPE across all 24 horizons
- Added weather as exogenous inputs, reducing MAPE from 4.31% to 3.79%
- Compared LSTM against the Gradient Boosting baseline, demonstrating improvement on multi-step forecasting
Ideas to Try Next
- Transformer models: Replace the LSTM with a Temporal Fusion Transformer (TFT) for even better multi-horizon forecasting with built-in attention-based interpretability
- Hierarchical forecasting: Train models for all 65 feeders and reconcile their forecasts with the total substation load using top-down or middle-out approaches
- Net load forecasting: Subtract solar PV generation (
load_solar_profiles()) from gross load to forecast net load—critical for feeders with high rooftop solar penetration - Probabilistic forecasts: Replace point predictions with quantile regression to output prediction intervals (e.g., "there is a 90% chance load will be between 3.8 and 5.2 MW")
- Attention mechanism: Add an attention layer after the LSTM to let the model "look back" at specific timesteps rather than compressing everything into a single hidden state
Key Terms Glossary
- LSTM (Long Short-Term Memory) — a recurrent neural network architecture with gates that control information flow, enabling it to learn long-range dependencies in sequences
- Hidden state — the LSTM's internal memory vector, updated at each timestep, that summarizes everything the network has seen so far
- Multi-step forecasting — predicting multiple future timesteps at once (e.g., 24 hours ahead) rather than just the next single step
- Sliding window — a fixed-size window that moves across the time series to create overlapping input/output training pairs
- Exogenous variables — external inputs (like weather) that influence the target but are not predicted by the model
- MAPE (Mean Absolute Percentage Error) — forecast error expressed as a percentage of actual values; useful for comparing across different load magnitudes
- RMSE (Root Mean Squared Error) — like MAE but penalizes large errors more heavily; sensitive to outlier forecast misses
- Early stopping — halting training when validation loss stops improving to prevent overfitting
- Gradient clipping — capping gradient magnitudes during backpropagation to prevent the exploding gradient problem in RNNs
- StandardScaler — transforms data to zero mean and unit variance; essential for neural network inputs