Guide 02: Load forecasting.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 02

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

Utilities need to know how much electricity their customers will use tomorrow so they can schedule generation, manage equipment, and avoid overloads. In this guide you will:

Load 15-minute feeder load profiles from the SP&L dataset
Visualize load patterns by hour, day, and season
Build a simple "persistence" baseline forecast
Train a Gradient Boosting regression model that beats the baseline
Evaluate forecast accuracy using standard error metrics

What is Gradient Boosting? Gradient Boosting builds many small decision trees one at a time, where each new tree tries to correct the mistakes of the previous ones. It is one of the most popular algorithms in applied machine learning because it handles tabular data extremely well and requires minimal tuning to produce good results.

SP&L Data You Will Use

load_profiles.csv (load_load_profiles()) — feeder-level 15-minute load profiles with representative seasonal weeks (~2,688 intervals per feeder)
weather_data.csv (load_weather_data()) — 52,608 hourly records (6 years) with temperature, humidity, wind, and storm flags

Load profiles are 15-minute intervals (not hourly).

Beyond the base prerequisites, this guide needs nothing extra.

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestRegressor
    from demo_data.load_demo_data import load_load_profiles
    profiles = load_load_profiles()
    print(f"Setup OK! Loaded {len(profiles):,} load profile records.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! Loaded N records.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from demo_data.load_demo_data import load_load_profiles, load_weather_data

# Load feeder-level 15-minute load profiles
load = load_load_profiles()

# Load hourly weather
weather = load_weather_data()

print(f"Load rows:    {len(load):,}")
print(f"Weather rows: {len(weather):,}")
print(f"Load columns: {list(load.columns)}")
                    

Load rows: 873,600
Weather rows: 52,608
Load columns: ['feeder_id', 'substation_id', 'timestamp', 'load_mw', 'load_mvar', 'voltage_pu', 'power_factor']

Pick a Feeder and Explore

The SP&L dataset contains 104 feeders. To keep things simple, pick one feeder and work with it throughout this guide. You can repeat the process for other feeders later.

# Pick Feeder 1
feeder = load[load["feeder_id"] == "FDR-0001"].copy()
feeder["timestamp"] = pd.to_datetime(feeder["timestamp"])
feeder = feeder.sort_values("timestamp").reset_index(drop=True)

# Plot one month of data to see the daily pattern
one_month = feeder[(feeder["timestamp"] >= "2024-07-01") &
                   (feeder["timestamp"] < "2024-08-01")]

plt.figure(figsize=(14, 4))
plt.plot(one_month["timestamp"], one_month["load_mw"], linewidth=0.8)
plt.title("Feeder FDR-0001 — July 2024 15-Minute Load")
plt.ylabel("Load (MW)")
plt.xlabel("Date")
plt.tight_layout()
plt.show()
                    

You should see a clear daily cycle: load dips at night and peaks in the afternoon, especially on hot days. This pattern is the foundation of our forecast.

Build Time Features

The load pattern depends heavily on the time of day, day of week, and season. Let's extract those from the timestamp.

# Time-based features
feeder["hour"]        = feeder["timestamp"].dt.hour
feeder["day_of_week"] = feeder["timestamp"].dt.dayofweek
feeder["month"]       = feeder["timestamp"].dt.month
feeder["is_weekend"]  = (feeder["day_of_week"] >= 5).astype(int)

# Show the average load by hour of day
feeder.groupby("hour")["load_mw"].mean().plot(
    kind="bar", color="#5FCCDB", title="Average Load by Hour of Day"
)
plt.ylabel("Load (MW)")
plt.tight_layout()
plt.show()
                    

Merge Weather Data

Temperature is the single biggest driver of electricity demand. On hot days, air conditioners run at full blast. On cold days, electric heating spikes. Let's join weather data to our load table.

# Merge weather on the nearest hour
weather["timestamp"] = pd.to_datetime(weather["timestamp"])
df = feeder.merge(
    weather[["timestamp", "temperature_f", "humidity_pct", "wind_speed_mph"]],
    on="timestamp",
    how="left"
)

# Drop rows with missing weather
df = df.dropna(subset=["temperature_f"])

print(f"Merged rows: {len(df):,}")
print(df[["timestamp", "load_mw", "temperature_f", "hour"]].head())
                    

Add Lag Features

What was the load 24 hours ago? That is often the best predictor of what load will be now. These "lag" features give the model a sense of recent history.

# Load from the same interval yesterday and one week ago (15-min data: 96 intervals/day)
df["load_lag_24h"]  = df["load_mw"].shift(96)
df["load_lag_168h"] = df["load_mw"].shift(672)  # 7 days * 96 intervals

# Rolling average over the past 24 hours (96 intervals)
df["load_rolling_24h"] = df["load_mw"].rolling(96).mean()

# Drop rows where lags are not available (first 672 intervals)
df = df.dropna()

print(f"Rows after adding lags: {len(df):,}")
                    

What is a lag feature? A lag feature is simply a past value of the target variable, shifted forward in time. load_lag_24h is "what was the load exactly 24 hours ago." This helps the model because electricity demand is strongly autocorrelated—today's pattern usually looks a lot like yesterday's.

Build a Baseline Forecast

Before training an ML model, build a simple baseline. A "persistence" forecast says: "Tomorrow's load at 2 PM will be the same as today's load at 2 PM." This gives you a bar to beat.

# Use 2024 as test, everything before as train
train = df[df["timestamp"] < "2024-01-01"]
test  = df[df["timestamp"] >= "2024-01-01"]

# Baseline: predict the load from 24 hours ago
baseline_mae = mean_absolute_error(test["load_mw"], test["load_lag_24h"])

print(f"Baseline (persistence) MAE: {baseline_mae:.4f} MW")
                    

What is MAE? Mean Absolute Error is the average of the absolute differences between predicted and actual values. If MAE = 0.5 MW, it means the forecast is off by 0.5 MW on average. Lower is better. Every ML model should beat the baseline MAE to be considered useful.

Train the Gradient Boosting Model

# Define features
feature_cols = [
    "hour", "day_of_week", "month", "is_weekend",
    "temperature_f", "humidity_pct", "wind_speed_mph",
    "load_lag_24h", "load_lag_168h", "load_rolling_24h"
]

X_train = train[feature_cols]
y_train = train["load_mw"]
X_test  = test[feature_cols]
y_test  = test["load_mw"]

# Create and train the model
model = GradientBoostingRegressor(
    n_estimators=300,     # number of boosting stages
    max_depth=5,           # depth of each tree
    learning_rate=0.1,     # how much each tree contributes
    random_state=42
)

model.fit(X_train, y_train)
print("Model training complete.")
                    

Test and Compare

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate error metrics
model_mae  = mean_absolute_error(y_test, y_pred)
model_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Baseline MAE:         {baseline_mae:.4f} MW")
print(f"Gradient Boosting MAE: {model_mae:.4f} MW")
print(f"Gradient Boosting RMSE: {model_rmse:.4f} MW")
print(f"\nImprovement over baseline: {((baseline_mae - model_mae) / baseline_mae * 100):.1f}%")
                    

Baseline MAE: 0.4821 MW
Gradient Boosting MAE: 0.2134 MW
Gradient Boosting RMSE: 0.2987 MW

Improvement over baseline: 55.7%

Visualize the Forecast

Let's plot one week of predictions against actual load to see how the model performs visually.

# Plot one week of actual vs. predicted
week = test.head(672).copy()  # 7 days * 96 intervals
week["predicted"] = y_pred[:672]

fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(week["timestamp"], week["load_mw"],
        label="Actual", linewidth=1.5)
ax.plot(week["timestamp"], week["predicted"],
        label="Predicted", linewidth=1.5, linestyle="--")
ax.set_title("Load Forecast vs. Actual — First Week of Test Set")
ax.set_ylabel("Load (MW)")
ax.legend()
plt.tight_layout()
plt.show()
                    

Feature Importance

# Which features matter most?
importances = pd.Series(model.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Load?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()
                    

You will likely see that load_lag_24h and temperature_f dominate, followed by hour. This makes intuitive sense: yesterday's load at the same hour is the best starting point, adjusted for today's weather.

✓

What You Built and Next Steps

You just built a day-ahead load forecasting model that beat a persistence baseline by over 50%. Here's what you did:

Loaded 15-minute feeder load profiles and weather data from the SP&L dataset
Explored daily and seasonal load patterns
Engineered time features (hour, day, month, weekend flag)
Added lag features (24-hour, 7-day, rolling average)
Built a simple persistence baseline and measured its error
Trained a Gradient Boosting model that significantly outperformed the baseline
Visualized actual vs. predicted load and identified the most important features

Ideas to Try Next

Forecast all 104 feeders: Wrap your code in a loop and build a separate model for each feeder
Add AMI data: Use the customer interval data (load_customer_interval_data()) for finer-grained per-customer forecasts
Try an LSTM: Replace Gradient Boosting with a recurrent neural network using PyTorch or TensorFlow
Incorporate solar generation: Subtract solar generation from load_solar_profiles() to forecast net load
Evaluate peak accuracy: Utilities care most about peak-hour accuracy—filter to hours 14–18 and measure error separately

Ready to Level Up?

In the advanced guide, you'll build an LSTM neural network in PyTorch for multi-step ahead load forecasting.

Go to Advanced Load Forecasting →

← Prev: Outage Prediction Next: Hosting Capacity →

— Adam · adam@sgridworks.com