What You Will Learn
In this guide, you will build a machine learning model that predicts whether an outage is likely to occur on a given day based on weather conditions and asset characteristics. By the end, you will have:
- Loaded and explored the SP&L outage and weather datasets
- Combined multiple data sources into a single training table
- Trained a Random Forest classifier to predict outages
- Evaluated your model's accuracy on held-out test data
- Identified which features matter most for outage prediction
What is a Random Forest? A Random Forest is a collection of decision trees. Each tree looks at a random subset of your data and features, then "votes" on the answer. The final prediction is whichever answer gets the most votes. It works well for classification tasks and handles messy, real-world data gracefully.
SP&L Data You Will Use
This guide uses three files from the SP&L repository:
- outages/outage_events.csv — 3,200+ historical outage records with cause codes, timestamps, affected customers, and feeder IDs
- weather/hourly_observations.csv — hourly temperature, wind speed, precipitation, and humidity for 2020–2025
- assets/transformers.csv — transformer age, condition scores, and kVA ratings
Additional Libraries
Beyond the base prerequisites, this guide needs nothing extra. You already have everything: pandas, numpy, scikit-learn, matplotlib, and seaborn.
Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell/Command Prompt if Python is in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.
Load the Data
Open a new Jupyter notebook and run the following cell to import your libraries and load the three SP&L files. Update the DATA_DIR path to wherever you cloned the repository.
Weather rows loaded: 43,824
Transformers loaded: 86
What just happened? You used pandas.read_csv() to load each CSV file into a DataFrame—think of it as a spreadsheet inside Python. The parse_dates argument tells pandas to interpret certain columns as dates rather than plain text.
Explore the Data
Before building a model, look at what you have. Run each line below in its own cell so you can see the output.
You should see that outages have cause codes such as vegetation, equipment_failure, animal_contact, weather, and overload. The weather table includes temperature, wind speed, humidity, and precipitation measured every hour.
Build Daily Features
Outages happen on a specific day. Weather is recorded every hour. To combine them, we need to summarize weather into daily statistics (max wind, max temperature, total rainfall, etc.).
What is feature engineering? Raw data rarely comes in the shape a model needs. Feature engineering is the process of transforming raw columns into useful inputs. Here, we turned 24 hourly weather readings per day into 7 summary numbers (max temp, min temp, mean temp, etc.).
Create the Target Variable
A classification model needs a target: the thing you are predicting. Our target is "Did at least one outage happen on this day?" (yes = 1, no = 0).
Days with outages: 1,412
Days without outages: 414
Add Time-Based Features
Outages follow seasonal patterns. Let's add month-of-year and day-of-week as features so the model can learn these cycles.
Split into Training and Test Sets
We need to hold back some data the model has never seen, so we can honestly evaluate it later. The standard practice is an 80/20 split: 80% for training, 20% for testing.
Why stratify? The stratify=y argument ensures the train and test sets have the same proportion of outage/no-outage days. Without this, random chance could put most of the no-outage days in one set, giving you misleading results.
Train the Random Forest
Now the exciting part. We create a Random Forest classifier and fit it on the training data. "Fitting" means the model examines all the training rows and learns patterns that connect weather features to outage outcomes.
Number of trees: 200
Features used: 10
What does class_weight="balanced" do? Since we have more outage days than non-outage days, the model could cheat by always predicting "outage" and still get high accuracy. The balanced setting tells the model to pay more attention to the rarer class so it actually learns to distinguish the two.
Test the Model
Now we use the held-out test data—data the model has never seen—to see how well it performs in the real world.
No Outage 0.45 0.52 0.48 83
Outage 0.84 0.80 0.82 283
accuracy 0.73 366
macro avg 0.65 0.66 0.65 366
weighted avg 0.75 0.73 0.74 366
Let's also visualize the confusion matrix to see exactly where the model gets things right and wrong.
Reading the results: Precision answers "When the model predicted an outage, how often was it right?" Recall answers "Of all the actual outages, how many did the model catch?" The F1-score is the balance between the two. For utility reliability teams, recall is often more important—you'd rather have a false alarm than miss a real outage.
Understand Feature Importance
One of the best things about Random Forests: they tell you which features matter most. This is valuable for utility engineers because it shows which weather variables drive outage risk.
You will likely see that wind_max, precip_total, and temp_max rank highest—which aligns with utility operational experience. Storms with high wind and heavy rainfall are the primary outage drivers.
What You Built and Next Steps
Congratulations. You just:
- Loaded real-world-style utility data from the SP&L repository
- Engineered daily features from hourly weather records
- Created a binary classification target (outage yes/no)
- Trained a Random Forest classifier on 80% of the data
- Tested it on the remaining 20% and evaluated performance
- Identified which weather features drive outage risk
Ideas to Try Next
- Add asset features: Merge transformer age and condition scores by feeder to improve predictions
- Predict outage cause: Instead of binary (outage/no outage), predict the cause code (vegetation, weather, equipment) using a multi-class classifier
- Try XGBoost: Replace
RandomForestClassifierwithXGBClassifierfrom thexgboostlibrary for potentially better results - Benchmark against SAIFI: Compare your model's predictions to SP&L's annual SAIFI metrics in
outages/reliability_metrics.csv - Time-aware split: Instead of random splitting, train on 2020–2023 and test on 2024–2025 to simulate how the model would perform on future data
Key Terms Glossary
- Random Forest — an ensemble of decision trees that vote on the prediction
- Classification — predicting a category (outage / no outage)
- Feature — an input variable the model uses (e.g., wind speed)
- Target — the variable you are trying to predict (outage_flag)
- Training set — data the model learns from
- Test set — data held back to evaluate the model honestly
- Precision — of all positive predictions, how many were correct
- Recall — of all actual positives, how many were detected
- SAIFI — System Average Interruption Frequency Index, a standard reliability metric
Ready to Level Up?
In the advanced guide, you'll build a multi-class XGBoost classifier with SHAP explainability and time-aware validation.
Go to Advanced Outage Prediction →