Prerequisite: Complete Guide 03: Hosting Capacity Analysis first. This guide replaces the iterative OpenDSS power flow loop with a trained ML surrogate model that predicts hosting capacity in milliseconds instead of minutes.
What You Will Learn
In Guide 03 you ran OpenDSS power flow hundreds of times to find the hosting capacity at each bus. That works—but it is slow. A single feeder takes minutes; the entire SP&L service territory with 12 feeders and hundreds of buses could take hours. Utilities evaluating thousands of interconnection requests per year need something faster. In this guide you will:
- Use the power flow results from Guide 03 as training labels for an ML model
- Engineer feeder and bus-level features from network topology, transformer ratings, and load data
- Train a LightGBM regression model to predict hosting capacity without running any power flow
- Evaluate the surrogate model with R², MAE, and predicted-vs-actual scatter plots
- Run sensitivity analysis on temperature, load growth, and inverter settings
- Build probabilistic hosting capacity estimates using quantile regression
- Map hosting capacity results spatially across the feeder
- Benchmark ML screening speed against full power flow simulation
What is a surrogate model? A surrogate model is a fast approximation of a slow simulation. You run the expensive simulation (OpenDSS power flow) enough times to build a training dataset, then train an ML model on that data. Once trained, the ML model produces predictions in milliseconds—without touching the power flow engine. The key insight: simulation outputs become ML training labels.
SP&L Data You Will Use
- network/coordinates.csv — bus XY locations for spatial features and mapping
- assets/transformers.csv — transformer kVA ratings, impedance, and install year
- assets/conductors.csv — conductor ampacity, length, and resistance per mile
- timeseries/substation_load_hourly.parquet — hourly substation load profiles
- timeseries/pv_generation.parquet — solar generation profiles for existing DER penetration
Additional Libraries
lightgbm is Microsoft's gradient boosting framework. It is faster than XGBoost on large datasets, supports quantile regression natively, and handles categorical features without one-hot encoding.
Load Power Flow Results and Network Data
The hosting capacity values you computed in Guide 03 (kW per bus before voltage or thermal violation) become the labels for supervised learning. We also load the network topology and asset data that will become our features.
Coordinates: 487 buses
Transformers: 312 units
Conductors: 474 segments
Hosting capacity summary (kW):
count 487.0
mean 742.3
std 418.6
min 50.0
25% 400.0
50% 700.0
75% 1050.0
max 2000.0
Simulation as labels: This is the key concept of surrogate modeling. The OpenDSS power flow results are not features—they are the target variable. The model learns to predict what the simulation would have said based on network characteristics alone. If you have not generated hca_results.csv yet, go back to Guide 03 and run the analysis across all feeders and buses.
Build Feeder and Bus Features
A good surrogate model needs features that capture the physical factors driving hosting capacity: how far the bus is from the substation, how much load is nearby, how stiff the local network is. We engineer six groups of features from the raw data.
Features per bus: 13
Missing values in tree-based models: LightGBM (and XGBoost) can handle missing values natively—during tree construction, they learn which direction to route NaN values at each split. In this guide we use fillna(0) because for buses without a transformer, zero kVA and zero impedance are physically correct. But if your data has missing values for a different reason (e.g., a sensor failure, not the absence of equipment), keeping them as NaN and letting LightGBM learn the optimal routing is often a better approach.
Prepare Training Labels and Split
The target variable is hosting_capacity_kw—the maximum kW of solar a bus can accept before hitting a voltage (>1.05 p.u.) or thermal (>100% loading) violation. This is a regression problem: we predict a continuous value, not a category.
Test set: 98 buses
Target distribution (training):
Mean: 738 kW
Std: 421 kW
Min: 50 kW
Max: 2000 kW
Why not a time-aware split here? Unlike outage prediction (Guide 09) where events happen over time, hosting capacity is a property of the physical network at a point in time. We stratify by feeder instead, ensuring every feeder appears in both train and test sets so the model sees the diversity of network topologies.
Train a LightGBM Regression Model
LightGBM uses gradient-boosted decision trees with histogram-based splitting for speed. For hosting capacity prediction, regression is the right objective: we want to predict a continuous kW value, not a category.
[200] train's mae: 38.1 test's mae: 59.4
[300] train's mae: 31.6 test's mae: 55.8
Early stopping at iteration 342
Best iteration: 292
Best test MAE: 54.3 kW
Interpreting the MAE in context: The hosting capacity in the SP&L dataset ranges from 50 kW to 2,000 kW with a median of 700 kW and interquartile range of 400–1,050 kW. An MAE of ~54 kW means the model is off by about 7–8% relative to the typical bus. For well-served buses near the substation (HC > 1,000 kW), this error is just 3–5%—excellent for screening. But for constrained buses at the low end (HC ~50–200 kW), the same 54 kW absolute error becomes a 27–100% relative error. In practice, these constrained buses are the ones that matter most for interconnection decisions. Use the quantile regression intervals (Step 7) to flag uncertain cases for full power flow follow-up.
Evaluate with R², MAE, and Scatter Plot
A good surrogate model should show predictions tightly clustered around the 45-degree line (predicted = actual). We also check feature importance to validate the model learned physically meaningful patterns.
MAE: 54.3 kW
The top features should align with power systems physics. Distance from the substation typically dominates because voltage drop and rise scale with impedance, which increases with distance. Conductor ampacity matters because it sets the thermal limit. Transformer kVA determines the local capacity ceiling.
Sanity check: If a feature like xfmr_age_years appeared as the top predictor, that would be suspicious—transformer age does not directly determine hosting capacity. High importance of physically meaningful features (distance, impedance, ampacity) tells you the model learned real patterns, not spurious correlations.
Sensitivity Analysis
A surrogate model lets you explore "what-if" scenarios instantly. We vary temperature (which affects conductor ratings), load growth, and inverter power factor to see how hosting capacity shifts across the network.
Baseline: 712 kW (+0.0%)
Summer derating (-15% ampacity): 638 kW (-10.4%)
20% load growth: 589 kW (-17.3%)
2x existing PV penetration: 521 kW (-26.8%)
Why this matters: Running these four scenarios with OpenDSS would require 4 × 487 buses × 40 PV steps = ~78,000 power flow solves. With the surrogate model, all four scenarios complete in under one second. This is the power of ML screening: rapid scenario exploration without the computational cost of full simulation.
Extrapolation warning: These sensitivity scenarios modify feature values outside the training distribution. Tree-based models like LightGBM cannot extrapolate—they can only predict values within the range of their training data. For a feature pushed beyond the training range, the model will clamp to the nearest leaf value rather than projecting a trend. This means the sensitivity analysis is directionally useful (load growth reduces HC, more PV reduces HC) but the specific magnitudes should not be trusted for extreme scenarios far from the training data. For high-stakes planning decisions under extreme conditions, always validate with full power flow simulation.
Probabilistic Hosting Capacity with Quantile Regression
Point estimates are useful, but planners need ranges. "The hosting capacity is between 400 and 900 kW with 80% confidence" is more actionable than "the hosting capacity is 650 kW." LightGBM supports quantile regression natively.
Quantile 50% model trained (314 rounds)
Quantile 90% model trained (261 rounds)
80% interval coverage: 82.7% (target: 80%)
Average interval width: 347 kW
Coverage calibration: If the 80% interval covers significantly more or fewer than 80% of actuals, the uncertainty estimates are miscalibrated. Coverage above 80% means the intervals are conservative (wider than needed). Below 80% means the model is overconfident. Ideally, calibrate on a held-out validation set separate from the test set.
Map Hosting Capacity Spatially
Utility planners think spatially. A hosting capacity "heat map" shows at a glance where the grid can absorb more solar and where it cannot. We use bus coordinates from the SP&L network model to plot predicted hosting capacity geographically.
Green dots indicate buses with high hosting capacity (safe for new solar). Red dots indicate constrained buses where interconnection requests should trigger a full engineering study. The spatial pattern typically shows hosting capacity decreasing with distance from the substation and along heavily loaded laterals.
Benchmark: ML Screening vs Full Power Flow
The whole point of a surrogate model is speed. Let's quantify how much faster ML screening is compared to running OpenDSS for every bus. Note that this comparison is for the specific task of screening hosting capacity values—the ML surrogate does not reproduce the full richness of a power flow solution (voltages at every bus, line loadings, loss calculations). For that specific screening task, however, the speed advantage is enormous.
ML surrogate (all 487 buses): 2.3 ms
OpenDSS power flow (estimated): 9740 seconds (162.3 minutes)
Speedup: 4,234,783x
Buses OpenDSS ML Model Speedup
------------------------------------------------
100 33.3 min 0.5 ms 4,234,783x
500 2.8 hrs 2.4 ms 4,234,783x
1,000 5.6 hrs 4.7 ms 4,234,783x
5,000 27.8 hrs 23.7 ms 4,234,783x
10,000 55.6 hrs 47.3 ms 4,234,783x
When to use which: Use the ML surrogate for initial screening of interconnection queues, scenario planning, and real-time web applications. Use full OpenDSS power flow for final engineering studies, regulatory filings, and cases where the surrogate's uncertainty interval is too wide. The two approaches are complementary, not competing.
Model Persistence and Feature Engineering Notes
Feature engineering rationale: The 13 features were specifically chosen to represent the three key physical drivers of hosting capacity: electrical distance (dist_from_sub_km, avg_impedance_pct, cumulative_r_ohm), equipment limits (total_kva, min_ampacity_a, total_line_mi), and loading conditions (peak_load_kw, load_density_kw_per_bus, existing_pv_peak_kw). Each feature has a clear physical interpretation, which makes the model's predictions auditable and trustworthy for engineering decisions.
What You Built and Next Steps
- Loaded power flow simulation results from Guide 03 and framed them as ML training labels
- Engineered 13 features from network topology, transformer ratings, conductor properties, and load data
- Trained a LightGBM regression model achieving R² = 0.92 and MAE of ~54 kW
- Ran sensitivity analysis across temperature, load growth, and DER penetration scenarios in under one second
- Built probabilistic hosting capacity estimates using quantile regression with calibrated 80% intervals
- Mapped hosting capacity spatially across the SP&L service territory
- Demonstrated massive speedup over full power flow simulation for the specific task of hosting capacity screening (~20 seconds/bus for OpenDSS vs. microseconds/bus for ML)
Ideas to Try Next
- Graph neural networks: Encode the feeder topology as a graph to capture adjacency effects between buses
- Active learning: Use the surrogate model to identify buses where it is least confident, then run targeted power flow only at those locations
- Time-series hosting capacity: Train on hourly power flow results to predict how hosting capacity varies by time of day and season
- Transfer learning: Train on one feeder and fine-tune on a new feeder with limited power flow data
- Voltage sensitivity coefficients: Add
dV/dPanddV/dQsensitivity factors from the network Jacobian as input features
Key Terms Glossary
- Surrogate model — a fast ML approximation of a slow physics simulation; trained on simulation outputs
- Hosting capacity — maximum DER generation a feeder bus can accept without voltage or thermal violations
- LightGBM — Light Gradient Boosting Machine; a histogram-based gradient boosting framework optimized for speed
- Quantile regression — predicting specific percentiles (e.g., 10th, 90th) instead of the mean, yielding prediction intervals
- R² (coefficient of determination) — fraction of target variance explained by the model; 1.0 is perfect
- MAE (mean absolute error) — average absolute difference between predicted and actual values
- Ampacity — the rated current-carrying capacity of a conductor, in amps
- Per-unit (p.u.) — voltage expressed as a fraction of nominal; ANSI C84.1 range is 0.95–1.05
- DER penetration — ratio of distributed generation capacity to peak load on a feeder