What You Will Learn
Utilities need to know how much electricity their customers will use tomorrow so they can schedule generation, manage equipment, and avoid overloads. In this guide you will:
- Load 5 years of hourly substation data from the SP&L dataset
- Visualize load patterns by hour, day, and season
- Build a simple "persistence" baseline forecast
- Train a Gradient Boosting regression model that beats the baseline
- Evaluate forecast accuracy using standard error metrics
What is Gradient Boosting? Gradient Boosting builds many small decision trees one at a time, where each new tree tries to correct the mistakes of the previous ones. It is one of the most popular algorithms in applied machine learning because it handles tabular data extremely well and requires minimal tuning to produce good results.
SP&L Data You Will Use
- timeseries/substation_load_hourly.parquet — hourly load (MW) for all 12 feeders from 2020–2025, decomposed by customer class
- weather/hourly_observations.csv — hourly temperature, humidity, wind speed, and precipitation
Additional Libraries
Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.
Load the Data
What is a Parquet file? Parquet is a columnar file format designed for big data. It loads much faster than CSV for large datasets and takes up less disk space. Pandas reads it the same way as CSV—you just use read_parquet() instead of read_csv().
Pick a Feeder and Explore
The SP&L dataset contains 12 feeders. To keep things simple, pick one feeder and work with it throughout this guide. You can repeat the process for other feeders later.
You should see a clear daily cycle: load dips at night and peaks in the afternoon, especially on hot days. This pattern is the foundation of our forecast.
Build Time Features
The load pattern depends heavily on the time of day, day of week, and season. Let's extract those from the timestamp.
Merge Weather Data
Temperature is the single biggest driver of electricity demand. On hot days, air conditioners run at full blast. On cold days, electric heating spikes. Let's join weather data to our load table.
Add Lag Features
What was the load 24 hours ago? That is often the best predictor of what load will be now. These "lag" features give the model a sense of recent history.
What is a lag feature? A lag feature is simply a past value of the target variable, shifted forward in time. load_lag_24h is "what was the load exactly 24 hours ago." This helps the model because electricity demand is strongly autocorrelated—today's pattern usually looks a lot like yesterday's.
Build a Baseline Forecast
Before training an ML model, build a simple baseline. A "persistence" forecast says: "Tomorrow's load at 2 PM will be the same as today's load at 2 PM." This gives you a bar to beat.
What is MAE? Mean Absolute Error is the average of the absolute differences between predicted and actual values. If MAE = 0.5 MW, it means the forecast is off by 0.5 MW on average. Lower is better. Every ML model should beat the baseline MAE to be considered useful.
Train the Gradient Boosting Model
Test and Compare
Gradient Boosting MAE: 0.2134 MW
Gradient Boosting RMSE: 0.2987 MW
Improvement over baseline: 55.7%
Visualize the Forecast
Let's plot one week of predictions against actual load to see how the model performs visually.
Feature Importance
You will likely see that load_lag_24h and temperature dominate, followed by hour. This makes intuitive sense: yesterday's load at the same hour is the best starting point, adjusted for today's weather.
What You Built and Next Steps
You just built a day-ahead load forecasting model that beat a persistence baseline by over 50%. Here's what you did:
- Loaded hourly substation load and weather data from the SP&L repository
- Explored daily and seasonal load patterns
- Engineered time features (hour, day, month, weekend flag)
- Added lag features (24-hour, 7-day, rolling average)
- Built a simple persistence baseline and measured its error
- Trained a Gradient Boosting model that significantly outperformed the baseline
- Visualized actual vs. predicted load and identified the most important features
Ideas to Try Next
- Forecast all 12 feeders: Wrap your code in a loop and build a separate model for each feeder
- Add AMI data: Use the 15-minute AMI data in
timeseries/ami_15min_sample.parquetfor finer-grained forecasts - Try an LSTM: Replace Gradient Boosting with a recurrent neural network using PyTorch or TensorFlow
- Incorporate solar generation: Subtract PV generation from
timeseries/pv_generation.parquetto forecast net load - Evaluate peak accuracy: Utilities care most about peak-hour accuracy—filter to hours 14–18 and measure error separately
Key Terms Glossary
- Gradient Boosting — builds trees sequentially; each new tree corrects errors from the previous ones
- Regression — predicting a continuous number (load in MW) rather than a category
- MAE (Mean Absolute Error) — average of |predicted − actual|; lower is better
- RMSE (Root Mean Squared Error) — like MAE but penalizes large errors more heavily
- Lag feature — a past value of the target shifted forward in time
- Persistence forecast — the simplest baseline: "tomorrow = today"
- Parquet — a columnar data format optimized for analytics workloads
Ready to Level Up?
In the advanced guide, you'll build an LSTM neural network in PyTorch for multi-step ahead load forecasting.
Go to Advanced Load Forecasting →