Guide 02: Load Forecasting - ML Playground

What You Will Learn

Utilities need to know how much electricity their customers will use tomorrow so they can schedule generation, manage equipment, and avoid overloads. In this guide you will:

Load 5 years of hourly substation data from the SP&L dataset
Visualize load patterns by hour, day, and season
Build a simple "persistence" baseline forecast
Train a Gradient Boosting regression model that beats the baseline
Evaluate forecast accuracy using standard error metrics

What is Gradient Boosting? Gradient Boosting builds many small decision trees one at a time, where each new tree tries to correct the mistakes of the previous ones. It is one of the most popular algorithms in applied machine learning because it handles tabular data extremely well and requires minimal tuning to produce good results.

SP&L Data You Will Use

timeseries/substation_load_hourly.parquet — hourly load (MW) for all 12 feeders from 2020–2025, decomposed by customer class
weather/hourly_observations.csv — hourly temperature, humidity, wind speed, and precipitation

Additional Libraries

pip install pyarrow

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

Load the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Point this to your local clone of the SP&L repo
# Windows example: "C:/Users/YourName/Documents/sisyphean-power-and-light/"
# macOS example:   "/Users/YourName/Documents/sisyphean-power-and-light/"
# Tip: Python on Windows accepts forward slashes — no backslashes needed
DATA_DIR = "sisyphean-power-and-light/"

# Load hourly substation load
load = pd.read_parquet(DATA_DIR + "timeseries/substation_load_hourly.parquet")

# Load hourly weather
weather = pd.read_csv(DATA_DIR + "weather/hourly_observations.csv",
                       parse_dates=["timestamp"])

print(f"Load rows:    {len(load):,}")
print(f"Weather rows: {len(weather):,}")
print(f"Load columns: {list(load.columns)}")
                    

What is a Parquet file? Parquet is a columnar file format designed for big data. It loads much faster than CSV for large datasets and takes up less disk space. Pandas reads it the same way as CSV—you just use read_parquet() instead of read_csv().

Pick a Feeder and Explore

The SP&L dataset contains 12 feeders. To keep things simple, pick one feeder and work with it throughout this guide. You can repeat the process for other feeders later.

# Pick Feeder 1 and create a total load column
feeder = load[load["feeder_id"] == "F01"].copy()
feeder["timestamp"] = pd.to_datetime(feeder["timestamp"])
feeder = feeder.sort_values("timestamp").reset_index(drop=True)

# Plot one month of data to see the daily pattern
one_month = feeder[(feeder["timestamp"] >= "2024-07-01") &
                   (feeder["timestamp"] < "2024-08-01")]

plt.figure(figsize=(14, 4))
plt.plot(one_month["timestamp"], one_month["total_load_mw"], linewidth=0.8)
plt.title("Feeder F01 — July 2024 Hourly Load")
plt.ylabel("Load (MW)")
plt.xlabel("Date")
plt.tight_layout()
plt.show()
                    

You should see a clear daily cycle: load dips at night and peaks in the afternoon, especially on hot days. This pattern is the foundation of our forecast.

Build Time Features

The load pattern depends heavily on the time of day, day of week, and season. Let's extract those from the timestamp.

# Time-based features
feeder["hour"]        = feeder["timestamp"].dt.hour
feeder["day_of_week"] = feeder["timestamp"].dt.dayofweek
feeder["month"]       = feeder["timestamp"].dt.month
feeder["is_weekend"]  = (feeder["day_of_week"] >= 5).astype(int)

# Show the average load by hour of day
feeder.groupby("hour")["total_load_mw"].mean().plot(
    kind="bar", color="#5FCCDB", title="Average Load by Hour of Day"
)
plt.ylabel("Load (MW)")
plt.tight_layout()
plt.show()
                    

Merge Weather Data

Temperature is the single biggest driver of electricity demand. On hot days, air conditioners run at full blast. On cold days, electric heating spikes. Let's join weather data to our load table.

# Merge weather on the nearest hour
weather["timestamp"] = pd.to_datetime(weather["timestamp"])
df = feeder.merge(
    weather[["timestamp", "temperature", "humidity", "wind_speed"]],
    on="timestamp",
    how="left"
)

# Drop rows with missing weather
df = df.dropna(subset=["temperature"])

print(f"Merged rows: {len(df):,}")
print(df[["timestamp", "total_load_mw", "temperature", "hour"]].head())
                    

Add Lag Features

What was the load 24 hours ago? That is often the best predictor of what load will be now. These "lag" features give the model a sense of recent history.

# Load from the same hour yesterday and one week ago
df["load_lag_24h"]  = df["total_load_mw"].shift(24)
df["load_lag_168h"] = df["total_load_mw"].shift(168)  # 7 days * 24 hours

# Rolling average over the past 24 hours
df["load_rolling_24h"] = df["total_load_mw"].rolling(24).mean()

# Drop rows where lags are not available (first 168 hours)
df = df.dropna()

print(f"Rows after adding lags: {len(df):,}")
                    

What is a lag feature? A lag feature is simply a past value of the target variable, shifted forward in time. load_lag_24h is "what was the load exactly 24 hours ago." This helps the model because electricity demand is strongly autocorrelated—today's pattern usually looks a lot like yesterday's.

Build a Baseline Forecast

Before training an ML model, build a simple baseline. A "persistence" forecast says: "Tomorrow's load at 2 PM will be the same as today's load at 2 PM." This gives you a bar to beat.

# Use 2024 as test, everything before as train
train = df[df["timestamp"] < "2024-01-01"]
test  = df[df["timestamp"] >= "2024-01-01"]

# Baseline: predict the load from 24 hours ago
baseline_mae = mean_absolute_error(test["total_load_mw"], test["load_lag_24h"])

print(f"Baseline (persistence) MAE: {baseline_mae:.4f} MW")
                    

What is MAE? Mean Absolute Error is the average of the absolute differences between predicted and actual values. If MAE = 0.5 MW, it means the forecast is off by 0.5 MW on average. Lower is better. Every ML model should beat the baseline MAE to be considered useful.

Train the Gradient Boosting Model

# Define features
feature_cols = [
    "hour", "day_of_week", "month", "is_weekend",
    "temperature", "humidity", "wind_speed",
    "load_lag_24h", "load_lag_168h", "load_rolling_24h"
]

X_train = train[feature_cols]
y_train = train["total_load_mw"]
X_test  = test[feature_cols]
y_test  = test["total_load_mw"]

# Create and train the model
model = GradientBoostingRegressor(
    n_estimators=300,     # number of boosting stages
    max_depth=5,           # depth of each tree
    learning_rate=0.1,     # how much each tree contributes
    random_state=42
)

model.fit(X_train, y_train)
print("Model training complete.")
                    

Test and Compare

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate error metrics
model_mae  = mean_absolute_error(y_test, y_pred)
model_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Baseline MAE:         {baseline_mae:.4f} MW")
print(f"Gradient Boosting MAE: {model_mae:.4f} MW")
print(f"Gradient Boosting RMSE: {model_rmse:.4f} MW")
print(f"\nImprovement over baseline: {((baseline_mae - model_mae) / baseline_mae * 100):.1f}%")
                    

Baseline MAE: 0.4821 MW
Gradient Boosting MAE: 0.2134 MW
Gradient Boosting RMSE: 0.2987 MW

Improvement over baseline: 55.7%

Visualize the Forecast

Let's plot one week of predictions against actual load to see how the model performs visually.

# Plot one week of actual vs. predicted
week = test.head(168).copy()  # 7 days * 24 hours
week["predicted"] = y_pred[:168]

fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(week["timestamp"], week["total_load_mw"],
        label="Actual", linewidth=1.5)
ax.plot(week["timestamp"], week["predicted"],
        label="Predicted", linewidth=1.5, linestyle="--")
ax.set_title("Load Forecast vs. Actual — First Week of Test Set")
ax.set_ylabel("Load (MW)")
ax.legend()
plt.tight_layout()
plt.show()
                    

Feature Importance

# Which features matter most?
importances = pd.Series(model.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Load?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()
                    

You will likely see that load_lag_24h and temperature dominate, followed by hour. This makes intuitive sense: yesterday's load at the same hour is the best starting point, adjusted for today's weather.

✓

What You Built and Next Steps

You just built a day-ahead load forecasting model that beat a persistence baseline by over 50%. Here's what you did:

Loaded hourly substation load and weather data from the SP&L repository
Explored daily and seasonal load patterns
Engineered time features (hour, day, month, weekend flag)
Added lag features (24-hour, 7-day, rolling average)
Built a simple persistence baseline and measured its error
Trained a Gradient Boosting model that significantly outperformed the baseline
Visualized actual vs. predicted load and identified the most important features

Ideas to Try Next

Forecast all 12 feeders: Wrap your code in a loop and build a separate model for each feeder
Add AMI data: Use the 15-minute AMI data in timeseries/ami_15min_sample.parquet for finer-grained forecasts
Try an LSTM: Replace Gradient Boosting with a recurrent neural network using PyTorch or TensorFlow
Incorporate solar generation: Subtract PV generation from timeseries/pv_generation.parquet to forecast net load
Evaluate peak accuracy: Utilities care most about peak-hour accuracy—filter to hours 14–18 and measure error separately

Key Terms Glossary

Gradient Boosting — builds trees sequentially; each new tree corrects errors from the previous ones
Regression — predicting a continuous number (load in MW) rather than a category
MAE (Mean Absolute Error) — average of |predicted − actual|; lower is better
RMSE (Root Mean Squared Error) — like MAE but penalizes large errors more heavily
Lag feature — a past value of the target shifted forward in time
Persistence forecast — the simplest baseline: "tomorrow = today"
Parquet — a columnar data format optimized for analytics workloads

Ready to Level Up?

In the advanced guide, you'll build an LSTM neural network in PyTorch for multi-step ahead load forecasting.

Go to Advanced Load Forecasting →

← Prev: Outage Prediction Next: Hosting Capacity →

Load Forecasting with Gradient Boosting

What You Will Learn

SP&L Data You Will Use

Additional Libraries

Load the Data

Pick a Feeder and Explore

Build Time Features

Merge Weather Data

Add Lag Features

Build a Baseline Forecast

Train the Gradient Boosting Model

Test and Compare

Visualize the Forecast

Feature Importance

What You Built and Next Steps

Ideas to Try Next

Key Terms Glossary

Ready to Level Up?