Guide 11: ML-accelerated hosting capacity.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 11 — Advanced

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

Prerequisite: Complete Guide 03: Hosting Capacity Analysis first. This guide replaces the iterative capacity analysis with a trained ML surrogate model that predicts hosting capacity in milliseconds.

What You Will Learn

In Guide 03 you computed hosting capacity for each transformer by comparing rated capacity against existing solar and peak load. That works—but it does not scale well to scenario analysis. The entire SP&L service territory has 104 feeders and thousands of nodes. Utilities evaluating thousands of interconnection requests per year need something faster and more flexible. In this guide you will:

Compute hosting capacity values from SP&L transformer, solar, and load data as training labels for an ML model
Engineer feeder and transformer-level features from network topology, transformer ratings, and load data
Train a LightGBM regression model to predict hosting capacity from network features alone
Evaluate the surrogate model with R², MAE, and predicted-vs-actual scatter plots
Run sensitivity analysis on temperature, load growth, and inverter settings
Build probabilistic hosting capacity estimates using quantile regression
Map hosting capacity results spatially across the feeder
Benchmark ML screening speed against full recomputation from source data

What is a surrogate model? A surrogate model is a fast approximation of a detailed computation. You run the full hosting capacity analysis (transformer capacity minus existing solar and peak load, with voltage drop estimates) enough times to build a training dataset, then train an ML model on that data. Once trained, the ML model produces predictions in milliseconds—without recomputing from scratch. The key insight: computed results become ML training labels.

SP&L Data You Will Use

network_nodes.csv (load_network_nodes()) — ~44,000 nodes with latitude/longitude, equipment class, and rated capacity
network_edges.csv (load_network_edges()) — ~44,000 conductor segments with impedance (R, X), rated amps, length, and conductor type
transformers.csv (load_transformers()) — ~21,000 transformers with kva_rating, age_years, and location
solar_installations.csv (load_solar_installations()) — ~17,000 solar installations with capacity_kw per transformer
load_profiles.csv (load_load_profiles()) — 15-minute feeder load profiles with load_mw for peak load calculations

Additional Libraries

pip install lightgbm

lightgbm is Microsoft's gradient boosting framework. It is faster than XGBoost on large datasets, supports quantile regression natively, and handles categorical features without one-hot encoding.

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load SP&L Data and Compute Hosting Capacity

We load the SP&L network, transformer, solar, and load profile data using the data loader API. Then we compute hosting capacity per transformer as a simplified estimate: rated kVA minus existing solar capacity minus peak load. These computed values become the labels for supervised learning.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import time

from demo_data.load_demo_data import (
    load_network_nodes,
    load_network_edges,
    load_transformers,
    load_solar_installations,
    load_load_profiles,
)

# --- Load all datasets via the SP&L data loader ---
nodes = load_network_nodes()         # ~44k nodes with lat/lon
edges = load_network_edges()         # ~44k edges with impedance, rated amps
transformers = load_transformers()   # ~21k transformers with kva_rating
solar = load_solar_installations()   # ~17k solar installs with capacity_kw
load_profiles = load_load_profiles() # 15-min feeder load profiles

print(f"Network nodes:        {len(nodes):,}")
print(f"Network edges:        {len(edges):,}")
print(f"Transformers:         {len(transformers):,}")
print(f"Solar installations:  {len(solar):,}")
print(f"Load profile rows:    {len(load_profiles):,}")
                    

# --- Compute hosting capacity per transformer ---
# Existing solar capacity aggregated to each transformer
solar_by_xfmr = solar.groupby("transformer_id")["capacity_kw"].sum()
solar_by_xfmr.name = "existing_solar_kw"

# Peak load per feeder from load profiles (MW -> kW)
feeder_peak = load_profiles.groupby("feeder_id")["load_mw"].max().reset_index()
feeder_peak.columns = ["feeder_id", "peak_load_mw"]

# Count transformers per feeder to allocate feeder load
xfmr_counts = transformers.groupby("feeder_id").size().reset_index(name="n_xfmrs")
feeder_peak = feeder_peak.merge(xfmr_counts, on="feeder_id")
feeder_peak["load_per_xfmr_kw"] = (
    feeder_peak["peak_load_mw"] * 1000 / feeder_peak["n_xfmrs"]
)

# Build the HCA table: capacity minus solar minus load
hca = transformers[["feeder_id", "substation_id",
                     "kva_rating", "age_years",
                     "latitude", "longitude"]].copy()
hca = hca.merge(solar_by_xfmr, left_index=True,
                right_index=True, how="left")
hca["existing_solar_kw"] = hca["existing_solar_kw"].fillna(0)
hca = hca.merge(
    feeder_peak[["feeder_id", "load_per_xfmr_kw"]],
    on="feeder_id", how="left"
)

# Hosting capacity = rated capacity - existing solar - allocated load
hca["hosting_capacity_kw"] = (
    hca["kva_rating"] - hca["existing_solar_kw"] - hca["load_per_xfmr_kw"]
).clip(lower=0)

print(f"\nHosting capacity computed for {len(hca):,} transformers")
print(f"Across {hca['feeder_id'].nunique()} feeders")
print(f"\nHosting capacity summary (kW):")
print(hca["hosting_capacity_kw"].describe().round(1))
                    

Network nodes: 44,289
Network edges: 44,119
Transformers: 21,197
Solar installations: 17,042
Load profile rows: 174,720

Hosting capacity computed for 21,197 transformers
Across 104 feeders

Hosting capacity summary (kW):
count 21197.0
mean 152.8
std 119.4
min 0.0
25% 62.3
50% 128.7
75% 215.0
max 500.0

Computed results as labels: This is the key concept of surrogate modeling. The hosting capacity values we computed (rated kVA minus existing solar minus peak load) are not features—they are the target variable. The ML model learns to predict what the full computation would produce based on network characteristics alone. This approach generalizes to any detailed analysis you want to accelerate with ML.

Build Transformer and Network Features

A good surrogate model needs features that capture the physical factors driving hosting capacity: how far the transformer is from the substation, how much load is nearby, how stiff the local network is. We engineer six groups of features from the SP&L data.

# --- Feature Group 1: Distance from substation ---
# Compute distance from each transformer to its substation node
sub_nodes = nodes[nodes["node_type"] == "substation_bus"]
sub_coords = sub_nodes[["latitude", "longitude"]].reset_index()
sub_coords.columns = ["substation_id", "sub_lat", "sub_lon"]

hca = hca.merge(sub_coords, on="substation_id", how="left")

# Approximate distance in km using lat/lon
hca["dist_from_sub_km"] = np.sqrt(
    ((hca["latitude"] - hca["sub_lat"]) * 111) ** 2 +
    ((hca["longitude"] - hca["sub_lon"]) * 85) ** 2
)

# --- Feature Group 2: Transformer characteristics ---
# kva_rating and age_years are already in the hca table from Step 1

# --- Feature Group 3: Conductor impedance from network edges ---
# Aggregate edge data per feeder for impedance and ampacity features
feeder_edges = edges.groupby("feeder_id").agg({
    "rated_amps":              "min",    # bottleneck conductor
    "impedance_r_ohm_per_mile": "mean",  # average resistance
    "length_miles":            "sum",    # total line length
}).reset_index()
feeder_edges.columns = ["feeder_id", "min_rated_amps",
                         "avg_r_ohm_per_mi", "total_line_mi"]

print("Feature groups created:")
print(f"  Transformers with distance: {len(hca):,}")
print(f"  Feeder edge features: {len(feeder_edges):,} feeders")
                    

# --- Feature Group 4: Load density ---
# Peak load per feeder from load profiles (already computed in Step 1)
# Count transformers per feeder for load density
feeder_stats = feeder_peak[["feeder_id", "peak_load_mw", "n_xfmrs"]].copy()
feeder_stats["peak_load_kw"] = feeder_stats["peak_load_mw"] * 1000
feeder_stats["load_density_kw_per_xfmr"] = (
    feeder_stats["peak_load_kw"] / feeder_stats["n_xfmrs"]
)

# --- Feature Group 5: Existing DER penetration ---
feeder_solar = solar.groupby("feeder_id")["capacity_kw"].sum().reset_index()
feeder_solar.columns = ["feeder_id", "existing_solar_total_kw"]
feeder_stats = feeder_stats.merge(feeder_solar, on="feeder_id", how="left")
feeder_stats["existing_solar_total_kw"] = feeder_stats["existing_solar_total_kw"].fillna(0)
feeder_stats["pv_penetration_pct"] = (
    feeder_stats["existing_solar_total_kw"] / feeder_stats["peak_load_kw"] * 100
)

# --- Feature Group 6: Merge everything into one table ---
df = hca.copy()
df = df.merge(feeder_edges, on="feeder_id", how="left")
df = df.merge(
    feeder_stats[["feeder_id", "peak_load_kw", "n_xfmrs",
                  "load_density_kw_per_xfmr",
                  "existing_solar_total_kw", "pv_penetration_pct"]],
    on="feeder_id", how="left"
)

# Fill missing values with 0
# NOTE: LightGBM handles NaN natively by routing missing values to the
# optimal split direction. Using fillna(0) is a deliberate choice here
# because a transformer with no edge data genuinely has zero local
# impedance. However, be careful with this pattern: if 0 has a
# meaningful non-missing interpretation for a feature (e.g.,
# age_years=0 could imply a brand-new transformer rather than missing
# data), consider using a sentinel value like -1 or keeping NaN.
df = df.fillna(0)

print(f"\nFinal dataset: {len(df):,} rows x {len(df.columns)} columns")
print(f"Features available: {len(df.columns) - 4}")  # subtract IDs and target
                    

Feature groups created:
Transformers with distance: 21,197
Feeder edge features: 104 feeders

Final dataset: 21,197 rows x 20 columns
Features available: 13

Missing values in tree-based models: LightGBM (and XGBoost) can handle missing values natively—during tree construction, they learn which direction to route NaN values at each split. In this guide we use fillna(0) because for transformers without edge data, zero impedance is a reasonable default. But if your data has missing values for a different reason (e.g., a sensor failure, not the absence of equipment), keeping them as NaN and letting LightGBM learn the optimal routing is often a better approach.

Prepare Training Labels and Split

The target variable is hosting_capacity_kw—the estimated remaining capacity at each transformer after accounting for existing solar and allocated peak load. This is a regression problem: we predict a continuous value, not a category.

# Define features and target
feature_cols = [
    "dist_from_sub_km",
    "kva_rating", "age_years",
    "min_rated_amps", "avg_r_ohm_per_mi", "total_line_mi",
    "peak_load_kw", "n_xfmrs", "load_density_kw_per_xfmr",
    "existing_solar_kw",
    "existing_solar_total_kw", "pv_penetration_pct",
    "load_per_xfmr_kw",
]
target_col = "hosting_capacity_kw"

X = df[feature_cols]
y = df[target_col]

# 80/20 train/test split stratified by feeder to ensure all feeders
# are represented in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=df["feeder_id"]
)

print(f"Training set: {len(X_train):,} transformers")
print(f"Test set:     {len(X_test):,} transformers")
print(f"\nTarget distribution (training):")
print(f"  Mean: {y_train.mean():.0f} kW")
print(f"  Std:  {y_train.std():.0f} kW")
print(f"  Min:  {y_train.min():.0f} kW")
print(f"  Max:  {y_train.max():.0f} kW")
                    

Training set: 16,957 transformers
Test set: 4,240 transformers

Target distribution (training):
  Mean: 153 kW
  Std: 119 kW
  Min: 0 kW
  Max: 500 kW

Why not a time-aware split here? Unlike outage prediction (Guide 09) where events happen over time, hosting capacity is a property of the physical network at a point in time. We stratify by feeder instead, ensuring every feeder appears in both train and test sets so the model sees the diversity of network topologies across all 65 SP&L feeders.

Train a LightGBM Regression Model

LightGBM uses gradient-boosted decision trees with histogram-based splitting for speed. For hosting capacity prediction, regression is the right objective: we want to predict a continuous kW value, not a category.

# Create LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train,
                          feature_name=feature_cols)
test_data = lgb.Dataset(X_test, label=y_test,
                         reference=train_data)

# Model parameters
params = {
    "objective":       "regression",
    "metric":          "mae",
    "boosting_type":   "gbdt",
    "num_leaves":      31,
    "learning_rate":   0.05,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq":    5,
    "min_child_samples": 10,
    "verbose":         -1,
    "seed":            42,
}

# Train with early stopping
callbacks = [
    lgb.early_stopping(stopping_rounds=50),
    lgb.log_evaluation(period=100),
]

model = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[train_data, test_data],
    valid_names=["train", "test"],
    callbacks=callbacks,
)

print(f"\nBest iteration: {model.best_iteration}")
# LightGBM stores the metric under 'l1' internally even when you pass metric='mae'
best_scores = model.best_score.get("test", {})
best_mae = best_scores.get("mae", best_scores.get("l1", None))
print(f"Best test MAE:  {best_mae:.1f} kW")
                    

[100] train's mae: 18.4   test's mae: 22.1
[200] train's mae: 12.7   test's mae: 18.3
[300] train's mae: 10.2   test's mae: 16.8
Early stopping at iteration 342

Best iteration: 292
Best test MAE: 16.2 kW

Interpreting the MAE in context: The hosting capacity in the SP&L dataset ranges from 0 kW to 500 kW with a median around 129 kW. An MAE of ~16 kW means the model is off by about 10–12% relative to the typical transformer. For transformers with ample remaining capacity (>300 kW), this error is just 3–5%—excellent for screening. But for constrained transformers at the low end (HC ~0–50 kW), the same 16 kW absolute error becomes a larger relative error. In practice, these constrained locations are the ones that matter most for interconnection decisions. Use the quantile regression intervals (Step 7) to flag uncertain cases for detailed follow-up.

Evaluate with R², MAE, and Scatter Plot

A good surrogate model should show predictions tightly clustered around the 45-degree line (predicted = actual). We also check feature importance to validate the model learned physically meaningful patterns.

# Predict on test set
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.3f}")
print(f"MAE:       {mae:.1f} kW")

# Scatter plot: predicted vs actual
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: predicted vs actual
ax1.scatter(y_test, y_pred, alpha=0.6, c="#5FCCDB", edgecolor="#2D6A7A", s=40)
ax1.plot([0, 500], [0, 500], "r--", linewidth=1.5, label="Perfect prediction")
ax1.set_xlabel("Actual Hosting Capacity (kW)")
ax1.set_ylabel("Predicted Hosting Capacity (kW)")
ax1.set_title(f"Predicted vs Actual (R\u00b2={r2:.3f}, MAE={mae:.0f} kW)")
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: feature importance
importance = pd.DataFrame({
    "feature": feature_cols,
    "importance": model.feature_importance(importance_type="gain")
}).sort_values("importance", ascending=True)

ax2.barh(importance["feature"], importance["importance"], color="#2D6A7A")
ax2.set_xlabel("Feature Importance (Gain)")
ax2.set_title("LightGBM Feature Importance")

plt.tight_layout()
plt.show()
                    

R-squared: 0.924
MAE: 16.2 kW

The top features should align with power systems physics. Transformer kva_rating typically dominates because it directly sets the local capacity ceiling. Distance from the substation matters because voltage drop and rise scale with impedance, which increases with distance. Existing solar capacity matters because it reduces remaining headroom.

Sanity check: If a feature like age_years appeared as the top predictor, that would be suspicious—transformer age does not directly determine hosting capacity. High importance of physically meaningful features (kva_rating, distance, existing_solar_kw) tells you the model learned real patterns, not spurious correlations.

Sensitivity Analysis

A surrogate model lets you explore "what-if" scenarios instantly. We vary temperature (which affects conductor ratings), load growth, and inverter power factor to see how hosting capacity shifts across the network.

# Scenario analysis: vary one feature at a time
baseline = X_test.copy()
scenarios = {}

# Scenario 1: Summer peak (reduce ampacity by 15% due to temperature)
summer_peak = baseline.copy()
summer_peak["min_rated_amps"] *= 0.85
scenarios["Summer derating (-15% ampacity)"] = summer_peak

# Scenario 2: 20% load growth
load_growth = baseline.copy()
load_growth["peak_load_kw"] *= 1.20
load_growth["load_density_kw_per_xfmr"] *= 1.20
load_growth["load_per_xfmr_kw"] *= 1.20
scenarios["20% load growth"] = load_growth

# Scenario 3: High existing DER (double current PV penetration)
high_der = baseline.copy()
high_der["existing_solar_kw"] *= 2.0
high_der["existing_solar_total_kw"] *= 2.0
high_der["pv_penetration_pct"] *= 2.0
scenarios["2x existing PV penetration"] = high_der

# Predict hosting capacity for each scenario
results = {"Baseline": model.predict(baseline, num_iteration=model.best_iteration)}
for name, scenario_X in scenarios.items():
    results[name] = model.predict(scenario_X, num_iteration=model.best_iteration)

# Compare median hosting capacity across scenarios
fig, ax = plt.subplots(figsize=(10, 5))
medians = {k: np.median(v) for k, v in results.items()}
colors = ["#5FCCDB", "#E53E3E", "#D69E2E", "#9F7AEA"]
ax.bar(medians.keys(), medians.values(), color=colors)
ax.set_ylabel("Median Hosting Capacity (kW)")
ax.set_title("Sensitivity Analysis: Impact on Hosting Capacity")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()

# Print percentage changes
baseline_median = medians["Baseline"]
print("Scenario impact vs baseline:")
for name, med in medians.items():
    pct = (med - baseline_median) / baseline_median * 100
    print(f"  {name}: {med:.0f} kW ({pct:+.1f}%)")
                    

Scenario impact vs baseline:
  Baseline: 128 kW (+0.0%)
  Summer derating (-15% ampacity): 119 kW (-7.0%)
  20% load growth: 108 kW (-15.6%)
  2x existing PV penetration: 94 kW (-26.6%)

Why this matters: Recomputing these four scenarios from scratch would require recalculating hosting capacity for 4 × 21,197 transformers with updated load and solar values. With the surrogate model, all four scenarios complete in under one second. This is the power of ML screening: rapid scenario exploration without recomputing from the raw data.

Extrapolation warning: These sensitivity scenarios modify feature values outside the training distribution. Tree-based models like LightGBM cannot extrapolate—they can only predict values within the range of their training data. For a feature pushed beyond the training range, the model will clamp to the nearest leaf value rather than projecting a trend. This means the sensitivity analysis is directionally useful (load growth reduces HC, more PV reduces HC) but the specific magnitudes should not be trusted for extreme scenarios far from the training data. For high-stakes planning decisions under extreme conditions, always validate by recomputing from the source data.

Probabilistic Hosting Capacity with Quantile Regression

Point estimates are useful, but planners need ranges. "The hosting capacity is between 80 and 180 kW with 80% confidence" is more actionable than "the hosting capacity is 130 kW." LightGBM supports quantile regression natively.

# Train quantile models for 10th, 50th, and 90th percentiles
quantile_models = {}

for alpha in [0.10, 0.50, 0.90]:
    q_params = params.copy()
    q_params["objective"] = "quantile"
    q_params["alpha"] = alpha
    q_params["metric"] = "quantile"

    q_model = lgb.train(
        q_params,
        train_data,
        num_boost_round=500,
        valid_sets=[test_data],
        valid_names=["test"],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)],
    )
    quantile_models[alpha] = q_model
    print(f"  Quantile {alpha:.0%} model trained ({q_model.best_iteration} rounds)")

# Predict quantile ranges
q10 = quantile_models[0.10].predict(X_test)
q50 = quantile_models[0.50].predict(X_test)
q90 = quantile_models[0.90].predict(X_test)

# Plot quantile predictions for a subset of transformers
n_show = 25
idx = np.argsort(y_test.values)[:n_show]  # sort by actual HC

fig, ax = plt.subplots(figsize=(14, 6))
x_pos = np.arange(n_show)

# 80% prediction interval (10th to 90th percentile)
ax.fill_between(x_pos, q10[idx], q90[idx], alpha=0.3,
                 color="#5FCCDB", label="80% prediction interval")
ax.plot(x_pos, q50[idx], "o-", color="#2D6A7A",
        linewidth=2, markersize=5, label="Median prediction")
ax.plot(x_pos, y_test.values[idx], "s", color="#E53E3E",
        markersize=6, label="Actual (computed)")

ax.set_xlabel("Transformer (sorted by actual hosting capacity)")
ax.set_ylabel("Hosting Capacity (kW)")
ax.set_title("Probabilistic Hosting Capacity: 80% Prediction Interval")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Coverage: what fraction of actuals fall within the 80% interval?
coverage = np.mean((y_test.values >= q10) & (y_test.values <= q90))
avg_width = np.mean(q90 - q10)
print(f"\n80% interval coverage: {coverage:.1%} (target: 80%)")
print(f"Average interval width: {avg_width:.0f} kW")
                    

  Quantile 10% model trained (287 rounds)
  Quantile 50% model trained (314 rounds)
  Quantile 90% model trained (261 rounds)

80% interval coverage: 82.7% (target: 80%)
Average interval width: 98 kW

Coverage calibration: If the 80% interval covers significantly more or fewer than 80% of actuals, the uncertainty estimates are miscalibrated. Coverage above 80% means the intervals are conservative (wider than needed). Below 80% means the model is overconfident. Ideally, calibrate on a held-out validation set separate from the test set.

Map Hosting Capacity Spatially

Utility planners think spatially. A hosting capacity "heat map" shows at a glance where the grid can absorb more solar and where it cannot. We use the latitude and longitude from the SP&L transformer data to plot predicted hosting capacity geographically.

# Predict hosting capacity for ALL transformers (not just test set)
y_all_pred = model.predict(X, num_iteration=model.best_iteration)

# Build map dataframe with predictions and coordinates
map_df = df[["feeder_id", "latitude", "longitude"]].copy()
map_df["hc_predicted_kw"] = y_all_pred

# Spatial hosting capacity map
fig, ax = plt.subplots(figsize=(12, 10))

scatter = ax.scatter(
    map_df["longitude"], map_df["latitude"],
    c=map_df["hc_predicted_kw"],
    cmap="RdYlGn",
    s=20,
    edgecolor="#333",
    linewidth=0.3,
    alpha=0.7,
    vmin=0,
    vmax=500,
)

cbar = plt.colorbar(scatter, ax=ax, shrink=0.8)
cbar.set_label("Predicted Hosting Capacity (kW)", fontsize=12)

# Label substations from the nodes table
sub_locs = nodes[nodes["node_type"] == "substation_bus"]
ax.scatter(sub_locs["longitude"], sub_locs["latitude"],
           marker="^", s=200, c="black", zorder=5,
           label="Substation")

ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("SP&L Service Territory: ML-Predicted Hosting Capacity")
ax.legend(loc="upper left", fontsize=11)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# Summary by feeder
feeder_summary = map_df.groupby("feeder_id")["hc_predicted_kw"].agg(
    ["mean", "min", "max"]
).round(0)
feeder_summary.columns = ["Mean HC (kW)", "Min HC (kW)", "Max HC (kW)"]
print("\nHosting capacity by feeder (first 10):")
print(feeder_summary.head(10))
                    

Green dots indicate transformers with high hosting capacity (safe for new solar). Red dots indicate constrained locations where interconnection requests should trigger a detailed engineering study. The spatial pattern typically shows hosting capacity decreasing with distance from the substation and along heavily loaded laterals.

Benchmark: ML Screening vs Full Recomputation

The whole point of a surrogate model is speed. Let us quantify how much faster ML screening is compared to recomputing hosting capacity from scratch for every transformer. The ML surrogate skips the per-transformer aggregation of solar, load allocation, and capacity arithmetic. For the specific task of screening hosting capacity values, the speed advantage is enormous.

# Time the ML prediction for all transformers
start_ml = time.perf_counter()
for _ in range(100):  # run 100 times for stable measurement
    ml_pred = model.predict(X, num_iteration=model.best_iteration)
end_ml = time.perf_counter()
ml_time_per_run = (end_ml - start_ml) / 100

# Time the pandas-based full computation for comparison
start_pd = time.perf_counter()
for _ in range(10):
    _solar_agg = solar.groupby("transformer_id")["capacity_kw"].sum()
    _peak = load_profiles.groupby("feeder_id")["load_mw"].max()
end_pd = time.perf_counter()
pd_time_per_run = (end_pd - start_pd) / 10

print("=== Speed Comparison ===")
print(f"ML surrogate (all {len(X):,} transformers): {ml_time_per_run*1000:.1f} ms")
print(f"Pandas recomputation:  {pd_time_per_run*1000:.0f} ms")
print(f"Speedup: {pd_time_per_run / ml_time_per_run:,.0f}x")

# Scaling comparison (what-if for larger utilities)
xfmr_counts = [1000, 5000, 21000, 50000, 100000]
scale = len(X)
print(f"\n{'Transformers':>14}  {'Full Recompute':>15}  {'ML Model':>12}  {'Speedup':>10}")
print("-" * 58)
for n in xfmr_counts:
    pd_t = pd_time_per_run * (n / scale)
    ml_t = ml_time_per_run * (n / scale)
    if pd_t >= 1:
        pd_str = f"{pd_t:.1f} sec"
    else:
        pd_str = f"{pd_t*1000:.0f} ms"
    print(f"{n:>14,}  {pd_str:>15}  {ml_t*1000:>9.1f} ms  {pd_t/ml_t:>10,.0f}x")
                    

=== Speed Comparison ===
ML surrogate (all 21,197 transformers): 4.8 ms
Pandas recomputation: 312 ms
Speedup: 65x

  Transformers  Full Recompute   ML Model    Speedup
----------------------------------------------------------
       1,000         15 ms     0.2 ms       65x
       5,000         74 ms     1.1 ms       65x
      21,000        312 ms     4.8 ms       65x
      50,000        742 ms    11.3 ms       65x
     100,000       1.5 sec    22.7 ms       65x

When to use which: Use the ML surrogate for initial screening of interconnection queues, scenario planning, and real-time web applications. Use full recomputation from source data for final engineering studies, regulatory filings, and cases where the surrogate's uncertainty interval is too wide. The two approaches are complementary, not competing.

Model Persistence and Feature Engineering Notes

# Save the trained LightGBM model
model.save_model("hosting_capacity_surrogate.lgb")

# Load it back
model = lgb.Booster(model_file="hosting_capacity_surrogate.lgb")
                    

Feature engineering rationale: The 13 features were specifically chosen to represent the three key physical drivers of hosting capacity: electrical distance (dist_from_sub_km, avg_r_ohm_per_mi), equipment limits (kva_rating, min_rated_amps, total_line_mi), and loading conditions (peak_load_kw, load_density_kw_per_xfmr, existing_solar_kw, pv_penetration_pct). Each feature has a clear physical interpretation, which makes the model's predictions auditable and trustworthy for engineering decisions.

✓

What You Built and Next Steps

Computed hosting capacity from SP&L transformer, solar, and load profile data and framed the results as ML training labels
Engineered 13 features from network topology, transformer ratings, conductor impedance, and load data
Trained a LightGBM regression model achieving R² = 0.92 and MAE of ~16 kW
Ran sensitivity analysis across temperature, load growth, and DER penetration scenarios in under one second
Built probabilistic hosting capacity estimates using quantile regression with calibrated 80% intervals
Mapped hosting capacity spatially across the SP&L service territory using transformer lat/lon coordinates
Demonstrated significant speedup over full recomputation for the specific task of hosting capacity screening (milliseconds vs. hundreds of milliseconds for pandas)

Ideas to Try Next

Graph neural networks: Use load_network_nodes() and load_network_edges() to encode the feeder topology as a graph and capture adjacency effects between transformers
Active learning: Use the surrogate model to identify transformers where it is least confident, then run targeted detailed analysis only at those locations
Time-series hosting capacity: Train on seasonal load profiles from load_load_profiles() to predict how hosting capacity varies by time of day and season
Transfer learning: Train on one feeder and fine-tune on a new feeder with limited data
Voltage drop features: Compute estimated voltage drop from edge impedance data (impedance_r_ohm_per_mile, length_miles) as additional input features

← Beginner: Hosting Capacity Next: Advanced Predictive Maintenance →

— Adam · adam@sgridworks.com