Prerequisite: Complete Guide 03: Hosting Capacity Analysis first. This guide replaces the iterative capacity analysis with a trained ML surrogate model that predicts hosting capacity in milliseconds.
What You Will Learn
In Guide 03 you computed hosting capacity for each transformer by comparing rated capacity against existing solar and peak load. That works—but it does not scale well to scenario analysis. The entire SP&L service territory has 65 feeders and thousands of nodes. Utilities evaluating thousands of interconnection requests per year need something faster and more flexible. In this guide you will:
- Compute hosting capacity values from SP&L transformer, solar, and load data as training labels for an ML model
- Engineer feeder and transformer-level features from network topology, transformer ratings, and load data
- Train a LightGBM regression model to predict hosting capacity from network features alone
- Evaluate the surrogate model with R², MAE, and predicted-vs-actual scatter plots
- Run sensitivity analysis on temperature, load growth, and inverter settings
- Build probabilistic hosting capacity estimates using quantile regression
- Map hosting capacity results spatially across the feeder
- Benchmark ML screening speed against full recomputation from source data
What is a surrogate model? A surrogate model is a fast approximation of a detailed computation. You run the full hosting capacity analysis (transformer capacity minus existing solar and peak load, with voltage drop estimates) enough times to build a training dataset, then train an ML model on that data. Once trained, the ML model produces predictions in milliseconds—without recomputing from scratch. The key insight: computed results become ML training labels.
SP&L Data You Will Use
- network_nodes.csv (
load_network_nodes()) — ~44,000 nodes with latitude/longitude, equipment class, and rated capacity - network_edges.csv (
load_network_edges()) — ~44,000 conductor segments with impedance (R, X), rated amps, length, and conductor type - transformers.csv (
load_transformers()) — ~21,000 transformers withkva_rating,age_years, and location - solar_installations.csv (
load_solar_installations()) — ~17,000 solar installations withcapacity_kwper transformer - load_profiles.csv (
load_load_profiles()) — 15-minute feeder load profiles withload_mwfor peak load calculations
Additional Libraries
lightgbm is Microsoft's gradient boosting framework. It is faster than XGBoost on large datasets, supports quantile regression natively, and handles categorical features without one-hot encoding.
Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.
Load SP&L Data and Compute Hosting Capacity
We load the SP&L network, transformer, solar, and load profile data using the data loader API. Then we compute hosting capacity per transformer as a simplified estimate: rated kVA minus existing solar capacity minus peak load. These computed values become the labels for supervised learning.
Network edges: 44,119
Transformers: 21,197
Solar installations: 17,042
Load profile rows: 174,720
Hosting capacity computed for 21,197 transformers
Across 65 feeders
Hosting capacity summary (kW):
count 21197.0
mean 152.8
std 119.4
min 0.0
25% 62.3
50% 128.7
75% 215.0
max 500.0
Computed results as labels: This is the key concept of surrogate modeling. The hosting capacity values we computed (rated kVA minus existing solar minus peak load) are not features—they are the target variable. The ML model learns to predict what the full computation would produce based on network characteristics alone. This approach generalizes to any detailed analysis you want to accelerate with ML.
Build Transformer and Network Features
A good surrogate model needs features that capture the physical factors driving hosting capacity: how far the transformer is from the substation, how much load is nearby, how stiff the local network is. We engineer six groups of features from the SP&L data.
Transformers with distance: 21,197
Feeder edge features: 65 feeders
Final dataset: 21,197 rows x 20 columns
Features available: 13
Missing values in tree-based models: LightGBM (and XGBoost) can handle missing values natively—during tree construction, they learn which direction to route NaN values at each split. In this guide we use fillna(0) because for transformers without edge data, zero impedance is a reasonable default. But if your data has missing values for a different reason (e.g., a sensor failure, not the absence of equipment), keeping them as NaN and letting LightGBM learn the optimal routing is often a better approach.
Prepare Training Labels and Split
The target variable is hosting_capacity_kw—the estimated remaining capacity at each transformer after accounting for existing solar and allocated peak load. This is a regression problem: we predict a continuous value, not a category.
Test set: 4,240 transformers
Target distribution (training):
Mean: 153 kW
Std: 119 kW
Min: 0 kW
Max: 500 kW
Why not a time-aware split here? Unlike outage prediction (Guide 09) where events happen over time, hosting capacity is a property of the physical network at a point in time. We stratify by feeder instead, ensuring every feeder appears in both train and test sets so the model sees the diversity of network topologies across all 65 SP&L feeders.
Train a LightGBM Regression Model
LightGBM uses gradient-boosted decision trees with histogram-based splitting for speed. For hosting capacity prediction, regression is the right objective: we want to predict a continuous kW value, not a category.
[200] train's mae: 12.7 test's mae: 18.3
[300] train's mae: 10.2 test's mae: 16.8
Early stopping at iteration 342
Best iteration: 292
Best test MAE: 16.2 kW
Interpreting the MAE in context: The hosting capacity in the SP&L dataset ranges from 0 kW to 500 kW with a median around 129 kW. An MAE of ~16 kW means the model is off by about 10–12% relative to the typical transformer. For transformers with ample remaining capacity (>300 kW), this error is just 3–5%—excellent for screening. But for constrained transformers at the low end (HC ~0–50 kW), the same 16 kW absolute error becomes a larger relative error. In practice, these constrained locations are the ones that matter most for interconnection decisions. Use the quantile regression intervals (Step 7) to flag uncertain cases for detailed follow-up.
Evaluate with R², MAE, and Scatter Plot
A good surrogate model should show predictions tightly clustered around the 45-degree line (predicted = actual). We also check feature importance to validate the model learned physically meaningful patterns.
MAE: 16.2 kW
The top features should align with power systems physics. Transformer kva_rating typically dominates because it directly sets the local capacity ceiling. Distance from the substation matters because voltage drop and rise scale with impedance, which increases with distance. Existing solar capacity matters because it reduces remaining headroom.
Sanity check: If a feature like age_years appeared as the top predictor, that would be suspicious—transformer age does not directly determine hosting capacity. High importance of physically meaningful features (kva_rating, distance, existing_solar_kw) tells you the model learned real patterns, not spurious correlations.
Sensitivity Analysis
A surrogate model lets you explore "what-if" scenarios instantly. We vary temperature (which affects conductor ratings), load growth, and inverter power factor to see how hosting capacity shifts across the network.
Baseline: 128 kW (+0.0%)
Summer derating (-15% ampacity): 119 kW (-7.0%)
20% load growth: 108 kW (-15.6%)
2x existing PV penetration: 94 kW (-26.6%)
Why this matters: Recomputing these four scenarios from scratch would require recalculating hosting capacity for 4 × 21,197 transformers with updated load and solar values. With the surrogate model, all four scenarios complete in under one second. This is the power of ML screening: rapid scenario exploration without recomputing from the raw data.
Extrapolation warning: These sensitivity scenarios modify feature values outside the training distribution. Tree-based models like LightGBM cannot extrapolate—they can only predict values within the range of their training data. For a feature pushed beyond the training range, the model will clamp to the nearest leaf value rather than projecting a trend. This means the sensitivity analysis is directionally useful (load growth reduces HC, more PV reduces HC) but the specific magnitudes should not be trusted for extreme scenarios far from the training data. For high-stakes planning decisions under extreme conditions, always validate by recomputing from the source data.
Probabilistic Hosting Capacity with Quantile Regression
Point estimates are useful, but planners need ranges. "The hosting capacity is between 80 and 180 kW with 80% confidence" is more actionable than "the hosting capacity is 130 kW." LightGBM supports quantile regression natively.
Quantile 50% model trained (314 rounds)
Quantile 90% model trained (261 rounds)
80% interval coverage: 82.7% (target: 80%)
Average interval width: 98 kW
Coverage calibration: If the 80% interval covers significantly more or fewer than 80% of actuals, the uncertainty estimates are miscalibrated. Coverage above 80% means the intervals are conservative (wider than needed). Below 80% means the model is overconfident. Ideally, calibrate on a held-out validation set separate from the test set.
Map Hosting Capacity Spatially
Utility planners think spatially. A hosting capacity "heat map" shows at a glance where the grid can absorb more solar and where it cannot. We use the latitude and longitude from the SP&L transformer data to plot predicted hosting capacity geographically.
Green dots indicate transformers with high hosting capacity (safe for new solar). Red dots indicate constrained locations where interconnection requests should trigger a detailed engineering study. The spatial pattern typically shows hosting capacity decreasing with distance from the substation and along heavily loaded laterals.
Benchmark: ML Screening vs Full Recomputation
The whole point of a surrogate model is speed. Let us quantify how much faster ML screening is compared to recomputing hosting capacity from scratch for every transformer. The ML surrogate skips the per-transformer aggregation of solar, load allocation, and capacity arithmetic. For the specific task of screening hosting capacity values, the speed advantage is enormous.
ML surrogate (all 21,197 transformers): 4.8 ms
Pandas recomputation: 312 ms
Speedup: 65x
Transformers Full Recompute ML Model Speedup
----------------------------------------------------------
1,000 15 ms 0.2 ms 65x
5,000 74 ms 1.1 ms 65x
21,000 312 ms 4.8 ms 65x
50,000 742 ms 11.3 ms 65x
100,000 1.5 sec 22.7 ms 65x
When to use which: Use the ML surrogate for initial screening of interconnection queues, scenario planning, and real-time web applications. Use full recomputation from source data for final engineering studies, regulatory filings, and cases where the surrogate's uncertainty interval is too wide. The two approaches are complementary, not competing.
Model Persistence and Feature Engineering Notes
Feature engineering rationale: The 13 features were specifically chosen to represent the three key physical drivers of hosting capacity: electrical distance (dist_from_sub_km, avg_r_ohm_per_mi), equipment limits (kva_rating, min_rated_amps, total_line_mi), and loading conditions (peak_load_kw, load_density_kw_per_xfmr, existing_solar_kw, pv_penetration_pct). Each feature has a clear physical interpretation, which makes the model's predictions auditable and trustworthy for engineering decisions.
What You Built and Next Steps
- Computed hosting capacity from SP&L transformer, solar, and load profile data and framed the results as ML training labels
- Engineered 13 features from network topology, transformer ratings, conductor impedance, and load data
- Trained a LightGBM regression model achieving R² = 0.92 and MAE of ~16 kW
- Ran sensitivity analysis across temperature, load growth, and DER penetration scenarios in under one second
- Built probabilistic hosting capacity estimates using quantile regression with calibrated 80% intervals
- Mapped hosting capacity spatially across the SP&L service territory using transformer lat/lon coordinates
- Demonstrated significant speedup over full recomputation for the specific task of hosting capacity screening (milliseconds vs. hundreds of milliseconds for pandas)
Ideas to Try Next
- Graph neural networks: Use
load_network_nodes()andload_network_edges()to encode the feeder topology as a graph and capture adjacency effects between transformers - Active learning: Use the surrogate model to identify transformers where it is least confident, then run targeted detailed analysis only at those locations
- Time-series hosting capacity: Train on seasonal load profiles from
load_load_profiles()to predict how hosting capacity varies by time of day and season - Transfer learning: Train on one feeder and fine-tune on a new feeder with limited data
- Voltage drop features: Compute estimated voltage drop from edge impedance data (
impedance_r_ohm_per_mile,length_miles) as additional input features
Key Terms Glossary
- Surrogate model — a fast ML approximation of a detailed computation; trained on computed outputs to predict results without rerunning the full analysis
- Hosting capacity — maximum DER generation a feeder bus can accept without voltage or thermal violations
- LightGBM — Light Gradient Boosting Machine; a histogram-based gradient boosting framework optimized for speed
- Quantile regression — predicting specific percentiles (e.g., 10th, 90th) instead of the mean, yielding prediction intervals
- R² (coefficient of determination) — fraction of target variance explained by the model; 1.0 is perfect
- MAE (mean absolute error) — average absolute difference between predicted and actual values
- Ampacity — the rated current-carrying capacity of a conductor, in amps
- Per-unit (p.u.) — voltage expressed as a fraction of nominal; ANSI C84.1 range is 0.95–1.05
- DER penetration — ratio of distributed generation capacity to peak load on a feeder