Guide 11: ML-Accelerated Hosting Capacity Screening - ML Playground

Prerequisite: Complete Guide 03: Hosting Capacity Analysis first. This guide replaces the iterative OpenDSS power flow loop with a trained ML surrogate model that predicts hosting capacity in milliseconds instead of minutes.

What You Will Learn

In Guide 03 you ran OpenDSS power flow hundreds of times to find the hosting capacity at each bus. That works—but it is slow. A single feeder takes minutes; the entire SP&L service territory with 12 feeders and hundreds of buses could take hours. Utilities evaluating thousands of interconnection requests per year need something faster. In this guide you will:

Use the power flow results from Guide 03 as training labels for an ML model
Engineer feeder and bus-level features from network topology, transformer ratings, and load data
Train a LightGBM regression model to predict hosting capacity without running any power flow
Evaluate the surrogate model with R², MAE, and predicted-vs-actual scatter plots
Run sensitivity analysis on temperature, load growth, and inverter settings
Build probabilistic hosting capacity estimates using quantile regression
Map hosting capacity results spatially across the feeder
Benchmark ML screening speed against full power flow simulation

What is a surrogate model? A surrogate model is a fast approximation of a slow simulation. You run the expensive simulation (OpenDSS power flow) enough times to build a training dataset, then train an ML model on that data. Once trained, the ML model produces predictions in milliseconds—without touching the power flow engine. The key insight: simulation outputs become ML training labels.

SP&L Data You Will Use

network/coordinates.csv — bus XY locations for spatial features and mapping
assets/transformers.csv — transformer kVA ratings, impedance, and install year
assets/conductors.csv — conductor ampacity, length, and resistance per mile
timeseries/substation_load_hourly.parquet — hourly substation load profiles
timeseries/pv_generation.parquet — solar generation profiles for existing DER penetration

Additional Libraries

pip install lightgbm

lightgbm is Microsoft's gradient boosting framework. It is faster than XGBoost on large datasets, supports quantile regression natively, and handles categorical features without one-hot encoding.

Load Power Flow Results and Network Data

The hosting capacity values you computed in Guide 03 (kW per bus before voltage or thermal violation) become the labels for supervised learning. We also load the network topology and asset data that will become our features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import time

DATA_DIR = "sisyphean-power-and-light/"

# Load power flow results from Guide 03
# Each row: bus_name, feeder_id, hosting_capacity_kw,
#           voltage_at_limit, thermal_loading_at_limit
hca_results = pd.read_csv(DATA_DIR + "network/hca_results.csv")

# Load network data
coords = pd.read_csv(DATA_DIR + "network/coordinates.csv")
transformers = pd.read_csv(DATA_DIR + "assets/transformers.csv")
conductors = pd.read_csv(DATA_DIR + "assets/conductors.csv")

# Load time-series data
sub_load = pd.read_parquet(DATA_DIR + "timeseries/substation_load_hourly.parquet")
pv_gen = pd.read_parquet(DATA_DIR + "timeseries/pv_generation.parquet")

print(f"HCA results: {len(hca_results):,} buses across {hca_results['feeder_id'].nunique()} feeders")
print(f"Coordinates: {len(coords):,} buses")
print(f"Transformers: {len(transformers):,} units")
print(f"Conductors: {len(conductors):,} segments")
print(f"\nHosting capacity summary (kW):")
print(hca_results["hosting_capacity_kw"].describe().round(1))
                    

HCA results: 487 buses across 12 feeders
Coordinates: 487 buses
Transformers: 312 units
Conductors: 474 segments

Hosting capacity summary (kW):
count 487.0
mean 742.3
std 418.6
min 50.0
25% 400.0
50% 700.0
75% 1050.0
max 2000.0

Simulation as labels: This is the key concept of surrogate modeling. The OpenDSS power flow results are not features—they are the target variable. The model learns to predict what the simulation would have said based on network characteristics alone. If you have not generated hca_results.csv yet, go back to Guide 03 and run the analysis across all feeders and buses.

Build Feeder and Bus Features

A good surrogate model needs features that capture the physical factors driving hosting capacity: how far the bus is from the substation, how much load is nearby, how stiff the local network is. We engineer six groups of features from the raw data.

# --- Feature Group 1: Distance from substation ---
# The substation is at the origin (0,0) of each feeder's coordinate system
sub_coords = coords[coords["bus_name"].str.endswith("_sub")]
coords_with_sub = coords.merge(
    sub_coords[["feeder_id", "x", "y"]].rename(
        columns={"x": "sub_x", "y": "sub_y"}
    ),
    on="feeder_id", how="left"
)
coords_with_sub["dist_from_sub_km"] = np.sqrt(
    (coords_with_sub["x"] - coords_with_sub["sub_x"]) ** 2 +
    (coords_with_sub["y"] - coords_with_sub["sub_y"]) ** 2
)

# --- Feature Group 2: Transformer capacity ---
bus_xfmr = transformers.groupby("bus_name").agg({
    "kva_rating":      ["sum", "max"],
    "impedance_pct":   "mean",
    "install_year":    "min",
}).reset_index()
bus_xfmr.columns = ["bus_name", "total_kva", "max_kva",
                     "avg_impedance_pct", "oldest_install_year"]
bus_xfmr["xfmr_age_years"] = 2025 - bus_xfmr["oldest_install_year"]

# --- Feature Group 3: Conductor ampacity and impedance ---
bus_cond = conductors.groupby("from_bus").agg({
    "ampacity_a":       "min",    # bottleneck conductor
    "resistance_ohm_mi": "sum",   # cumulative impedance
    "length_mi":        "sum",    # total line length
}).reset_index()
bus_cond.columns = ["bus_name", "min_ampacity_a",
                     "cumulative_r_ohm", "total_line_mi"]

print("Feature groups created:")
print(f"  Distance features: {len(coords_with_sub):,} buses")
print(f"  Transformer features: {len(bus_xfmr):,} buses")
print(f"  Conductor features: {len(bus_cond):,} buses")
                    

# --- Feature Group 4: Load density ---
# Peak load from substation time series, allocated per feeder
feeder_peak = sub_load.groupby("feeder_id")["load_kw"].max().reset_index()
feeder_peak.columns = ["feeder_id", "peak_load_kw"]

# Count buses per feeder to compute load density
bus_count = coords.groupby("feeder_id").size().reset_index(name="n_buses")
feeder_stats = feeder_peak.merge(bus_count, on="feeder_id")
feeder_stats["load_density_kw_per_bus"] = (
    feeder_stats["peak_load_kw"] / feeder_stats["n_buses"]
)

# --- Feature Group 5: Existing DER penetration ---
feeder_pv = pv_gen.groupby("feeder_id")["generation_kw"].max().reset_index()
feeder_pv.columns = ["feeder_id", "existing_pv_peak_kw"]
feeder_stats = feeder_stats.merge(feeder_pv, on="feeder_id", how="left")
feeder_stats["existing_pv_peak_kw"] = feeder_stats["existing_pv_peak_kw"].fillna(0)
feeder_stats["pv_penetration_pct"] = (
    feeder_stats["existing_pv_peak_kw"] / feeder_stats["peak_load_kw"] * 100
)

# --- Feature Group 6: Merge everything into one table ---
df = hca_results.merge(
    coords_with_sub[["bus_name", "feeder_id", "dist_from_sub_km"]],
    on=["bus_name", "feeder_id"], how="left"
)
df = df.merge(bus_xfmr, on="bus_name", how="left")
df = df.merge(bus_cond, on="bus_name", how="left")
df = df.merge(feeder_stats, on="feeder_id", how="left")

# Fill missing transformer/conductor features for buses without them
# NOTE: LightGBM handles NaN natively by routing missing values to the
# optimal split direction. Using fillna(0) is a deliberate choice here
# because a bus with no transformer data genuinely has zero local kVA,
# zero impedance, and zero age. However, be careful with this pattern:
# if 0 has a meaningful non-missing interpretation for a feature
# (e.g., xfmr_age_years=0 could imply a brand-new transformer rather
# than the absence of a transformer), consider using a sentinel value
# like -1 or keeping NaN and letting LightGBM handle it directly.
df = df.fillna(0)

print(f"\nFinal dataset: {len(df):,} rows x {len(df.columns)} columns")
print(f"Features per bus: {len(df.columns) - 3}")  # subtract bus_name, feeder_id, target
                    

Final dataset: 487 rows x 16 columns
Features per bus: 13

Missing values in tree-based models: LightGBM (and XGBoost) can handle missing values natively—during tree construction, they learn which direction to route NaN values at each split. In this guide we use fillna(0) because for buses without a transformer, zero kVA and zero impedance are physically correct. But if your data has missing values for a different reason (e.g., a sensor failure, not the absence of equipment), keeping them as NaN and letting LightGBM learn the optimal routing is often a better approach.

Prepare Training Labels and Split

The target variable is hosting_capacity_kw—the maximum kW of solar a bus can accept before hitting a voltage (>1.05 p.u.) or thermal (>100% loading) violation. This is a regression problem: we predict a continuous value, not a category.

# Define features and target
feature_cols = [
    "dist_from_sub_km",
    "total_kva", "max_kva", "avg_impedance_pct", "xfmr_age_years",
    "min_ampacity_a", "cumulative_r_ohm", "total_line_mi",
    "peak_load_kw", "n_buses", "load_density_kw_per_bus",
    "existing_pv_peak_kw", "pv_penetration_pct",
]
target_col = "hosting_capacity_kw"

X = df[feature_cols]
y = df[target_col]

# 80/20 train/test split stratified by feeder to ensure all feeders
# are represented in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=df["feeder_id"]
)

print(f"Training set: {len(X_train):,} buses")
print(f"Test set:     {len(X_test):,} buses")
print(f"\nTarget distribution (training):")
print(f"  Mean: {y_train.mean():.0f} kW")
print(f"  Std:  {y_train.std():.0f} kW")
print(f"  Min:  {y_train.min():.0f} kW")
print(f"  Max:  {y_train.max():.0f} kW")
                    

Training set: 389 buses
Test set: 98 buses

Target distribution (training):
  Mean: 738 kW
  Std: 421 kW
  Min: 50 kW
  Max: 2000 kW

Why not a time-aware split here? Unlike outage prediction (Guide 09) where events happen over time, hosting capacity is a property of the physical network at a point in time. We stratify by feeder instead, ensuring every feeder appears in both train and test sets so the model sees the diversity of network topologies.

Train a LightGBM Regression Model

LightGBM uses gradient-boosted decision trees with histogram-based splitting for speed. For hosting capacity prediction, regression is the right objective: we want to predict a continuous kW value, not a category.

# Create LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train,
                          feature_name=feature_cols)
test_data = lgb.Dataset(X_test, label=y_test,
                         reference=train_data)

# Model parameters
params = {
    "objective":       "regression",
    "metric":          "mae",
    "boosting_type":   "gbdt",
    "num_leaves":      31,
    "learning_rate":   0.05,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq":    5,
    "min_child_samples": 10,
    "verbose":         -1,
    "seed":            42,
}

# Train with early stopping
callbacks = [
    lgb.early_stopping(stopping_rounds=50),
    lgb.log_evaluation(period=100),
]

model = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[train_data, test_data],
    valid_names=["train", "test"],
    callbacks=callbacks,
)

print(f"\nBest iteration: {model.best_iteration}")
print(f"Best test MAE:  {model.best_score['test']['mae']:.1f} kW")
                    

[100] train's mae: 52.3   test's mae: 68.7
[200] train's mae: 38.1   test's mae: 59.4
[300] train's mae: 31.6   test's mae: 55.8
Early stopping at iteration 342

Best iteration: 292
Best test MAE: 54.3 kW

Interpreting the MAE in context: The hosting capacity in the SP&L dataset ranges from 50 kW to 2,000 kW with a median of 700 kW and interquartile range of 400–1,050 kW. An MAE of ~54 kW means the model is off by about 7–8% relative to the typical bus. For well-served buses near the substation (HC > 1,000 kW), this error is just 3–5%—excellent for screening. But for constrained buses at the low end (HC ~50–200 kW), the same 54 kW absolute error becomes a 27–100% relative error. In practice, these constrained buses are the ones that matter most for interconnection decisions. Use the quantile regression intervals (Step 7) to flag uncertain cases for full power flow follow-up.

Evaluate with R², MAE, and Scatter Plot

A good surrogate model should show predictions tightly clustered around the 45-degree line (predicted = actual). We also check feature importance to validate the model learned physically meaningful patterns.

# Predict on test set
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.3f}")
print(f"MAE:       {mae:.1f} kW")

# Scatter plot: predicted vs actual
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: predicted vs actual
ax1.scatter(y_test, y_pred, alpha=0.6, c="#5FCCDB", edgecolor="#2D6A7A", s=40)
ax1.plot([0, 2000], [0, 2000], "r--", linewidth=1.5, label="Perfect prediction")
ax1.set_xlabel("Actual Hosting Capacity (kW)")
ax1.set_ylabel("Predicted Hosting Capacity (kW)")
ax1.set_title(f"Predicted vs Actual (R\u00b2={r2:.3f}, MAE={mae:.0f} kW)")
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: feature importance
importance = pd.DataFrame({
    "feature": feature_cols,
    "importance": model.feature_importance(importance_type="gain")
}).sort_values("importance", ascending=True)

ax2.barh(importance["feature"], importance["importance"], color="#2D6A7A")
ax2.set_xlabel("Feature Importance (Gain)")
ax2.set_title("LightGBM Feature Importance")

plt.tight_layout()
plt.show()
                    

R-squared: 0.924
MAE: 54.3 kW

The top features should align with power systems physics. Distance from the substation typically dominates because voltage drop and rise scale with impedance, which increases with distance. Conductor ampacity matters because it sets the thermal limit. Transformer kVA determines the local capacity ceiling.

Sanity check: If a feature like xfmr_age_years appeared as the top predictor, that would be suspicious—transformer age does not directly determine hosting capacity. High importance of physically meaningful features (distance, impedance, ampacity) tells you the model learned real patterns, not spurious correlations.

Sensitivity Analysis

A surrogate model lets you explore "what-if" scenarios instantly. We vary temperature (which affects conductor ratings), load growth, and inverter power factor to see how hosting capacity shifts across the network.

# Scenario analysis: vary one feature at a time
baseline = X_test.copy()
scenarios = {}

# Scenario 1: Summer peak (reduce ampacity by 15% due to temperature)
summer_peak = baseline.copy()
summer_peak["min_ampacity_a"] *= 0.85
scenarios["Summer derating (-15% ampacity)"] = summer_peak

# Scenario 2: 20% load growth
load_growth = baseline.copy()
load_growth["peak_load_kw"] *= 1.20
load_growth["load_density_kw_per_bus"] *= 1.20
scenarios["20% load growth"] = load_growth

# Scenario 3: High existing DER (double current PV penetration)
high_der = baseline.copy()
high_der["existing_pv_peak_kw"] *= 2.0
high_der["pv_penetration_pct"] *= 2.0
scenarios["2x existing PV penetration"] = high_der

# Predict hosting capacity for each scenario
results = {"Baseline": model.predict(baseline, num_iteration=model.best_iteration)}
for name, scenario_X in scenarios.items():
    results[name] = model.predict(scenario_X, num_iteration=model.best_iteration)

# Compare median hosting capacity across scenarios
fig, ax = plt.subplots(figsize=(10, 5))
medians = {k: np.median(v) for k, v in results.items()}
colors = ["#5FCCDB", "#E53E3E", "#D69E2E", "#9F7AEA"]
ax.bar(medians.keys(), medians.values(), color=colors)
ax.set_ylabel("Median Hosting Capacity (kW)")
ax.set_title("Sensitivity Analysis: Impact on Hosting Capacity")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()

# Print percentage changes
baseline_median = medians["Baseline"]
print("Scenario impact vs baseline:")
for name, med in medians.items():
    pct = (med - baseline_median) / baseline_median * 100
    print(f"  {name}: {med:.0f} kW ({pct:+.1f}%)")
                    

Scenario impact vs baseline:
  Baseline: 712 kW (+0.0%)
  Summer derating (-15% ampacity): 638 kW (-10.4%)
  20% load growth: 589 kW (-17.3%)
  2x existing PV penetration: 521 kW (-26.8%)

Why this matters: Running these four scenarios with OpenDSS would require 4 × 487 buses × 40 PV steps = ~78,000 power flow solves. With the surrogate model, all four scenarios complete in under one second. This is the power of ML screening: rapid scenario exploration without the computational cost of full simulation.

Extrapolation warning: These sensitivity scenarios modify feature values outside the training distribution. Tree-based models like LightGBM cannot extrapolate—they can only predict values within the range of their training data. For a feature pushed beyond the training range, the model will clamp to the nearest leaf value rather than projecting a trend. This means the sensitivity analysis is directionally useful (load growth reduces HC, more PV reduces HC) but the specific magnitudes should not be trusted for extreme scenarios far from the training data. For high-stakes planning decisions under extreme conditions, always validate with full power flow simulation.

Probabilistic Hosting Capacity with Quantile Regression

Point estimates are useful, but planners need ranges. "The hosting capacity is between 400 and 900 kW with 80% confidence" is more actionable than "the hosting capacity is 650 kW." LightGBM supports quantile regression natively.

# Train quantile models for 10th, 50th, and 90th percentiles
quantile_models = {}

for alpha in [0.10, 0.50, 0.90]:
    q_params = params.copy()
    q_params["objective"] = "quantile"
    q_params["alpha"] = alpha
    q_params["metric"] = "quantile"

    q_model = lgb.train(
        q_params,
        train_data,
        num_boost_round=500,
        valid_sets=[test_data],
        valid_names=["test"],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)],
    )
    quantile_models[alpha] = q_model
    print(f"  Quantile {alpha:.0%} model trained ({q_model.best_iteration} rounds)")

# Predict quantile ranges
q10 = quantile_models[0.10].predict(X_test)
q50 = quantile_models[0.50].predict(X_test)
q90 = quantile_models[0.90].predict(X_test)

# Plot quantile predictions for a subset of buses
n_show = 25
idx = np.argsort(y_test.values)[:n_show]  # sort by actual HC

fig, ax = plt.subplots(figsize=(14, 6))
x_pos = np.arange(n_show)

# 80% prediction interval (10th to 90th percentile)
ax.fill_between(x_pos, q10[idx], q90[idx], alpha=0.3,
                 color="#5FCCDB", label="80% prediction interval")
ax.plot(x_pos, q50[idx], "o-", color="#2D6A7A",
        linewidth=2, markersize=5, label="Median prediction")
ax.plot(x_pos, y_test.values[idx], "s", color="#E53E3E",
        markersize=6, label="Actual (OpenDSS)")

ax.set_xlabel("Bus (sorted by actual hosting capacity)")
ax.set_ylabel("Hosting Capacity (kW)")
ax.set_title("Probabilistic Hosting Capacity: 80% Prediction Interval")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Coverage: what fraction of actuals fall within the 80% interval?
coverage = np.mean((y_test.values >= q10) & (y_test.values <= q90))
avg_width = np.mean(q90 - q10)
print(f"\n80% interval coverage: {coverage:.1%} (target: 80%)")
print(f"Average interval width: {avg_width:.0f} kW")
                    

  Quantile 10% model trained (287 rounds)
  Quantile 50% model trained (314 rounds)
  Quantile 90% model trained (261 rounds)

80% interval coverage: 82.7% (target: 80%)
Average interval width: 347 kW

Coverage calibration: If the 80% interval covers significantly more or fewer than 80% of actuals, the uncertainty estimates are miscalibrated. Coverage above 80% means the intervals are conservative (wider than needed). Below 80% means the model is overconfident. Ideally, calibrate on a held-out validation set separate from the test set.

Map Hosting Capacity Spatially

Utility planners think spatially. A hosting capacity "heat map" shows at a glance where the grid can absorb more solar and where it cannot. We use bus coordinates from the SP&L network model to plot predicted hosting capacity geographically.

# Predict hosting capacity for ALL buses (not just test set)
y_all_pred = model.predict(X, num_iteration=model.best_iteration)

# Merge predictions with coordinates
map_df = df[["bus_name", "feeder_id"]].copy()
map_df["hc_predicted_kw"] = y_all_pred
map_df = map_df.merge(coords[["bus_name", "x", "y"]], on="bus_name")

# Spatial hosting capacity map
fig, ax = plt.subplots(figsize=(12, 10))

scatter = ax.scatter(
    map_df["x"], map_df["y"],
    c=map_df["hc_predicted_kw"],
    cmap="RdYlGn",
    s=50,
    edgecolor="#333",
    linewidth=0.5,
    alpha=0.85,
    vmin=0,
    vmax=2000,
)

cbar = plt.colorbar(scatter, ax=ax, shrink=0.8)
cbar.set_label("Predicted Hosting Capacity (kW)", fontsize=12)

# Label substations
subs = map_df[map_df["bus_name"].str.endswith("_sub")]
ax.scatter(subs["x"], subs["y"], marker="^", s=200,
           c="black", zorder=5, label="Substation")

ax.set_xlabel("X Coordinate (km)")
ax.set_ylabel("Y Coordinate (km)")
ax.set_title("SP&L Service Territory: ML-Predicted Hosting Capacity")
ax.legend(loc="upper left", fontsize=11)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# Summary by feeder
feeder_summary = map_df.groupby("feeder_id")["hc_predicted_kw"].agg(
    ["mean", "min", "max"]
).round(0)
feeder_summary.columns = ["Mean HC (kW)", "Min HC (kW)", "Max HC (kW)"]
print("\nHosting capacity by feeder:")
print(feeder_summary)
                    

Green dots indicate buses with high hosting capacity (safe for new solar). Red dots indicate constrained buses where interconnection requests should trigger a full engineering study. The spatial pattern typically shows hosting capacity decreasing with distance from the substation and along heavily loaded laterals.

Benchmark: ML Screening vs Full Power Flow

The whole point of a surrogate model is speed. Let's quantify how much faster ML screening is compared to running OpenDSS for every bus. Note that this comparison is for the specific task of screening hosting capacity values—the ML surrogate does not reproduce the full richness of a power flow solution (voltages at every bus, line loadings, loss calculations). For that specific screening task, however, the speed advantage is enormous.

# Time the ML prediction for all 487 buses
start_ml = time.perf_counter()
for _ in range(100):  # run 100 times for stable measurement
    ml_pred = model.predict(X, num_iteration=model.best_iteration)
end_ml = time.perf_counter()
ml_time_per_run = (end_ml - start_ml) / 100

# Estimated OpenDSS time (from Guide 03 experience)
# ~0.5 seconds per power flow solve, ~40 PV steps per bus
pf_time_per_bus = 0.5 * 40  # 20 seconds per bus
pf_total = pf_time_per_bus * len(X)

print("=== Speed Comparison ===")
print(f"ML surrogate (all {len(X)} buses): {ml_time_per_run*1000:.1f} ms")
print(f"OpenDSS power flow (estimated):    {pf_total:.0f} seconds ({pf_total/60:.1f} minutes)")
print(f"Speedup: {pf_total / ml_time_per_run:,.0f}x")

# Scaling comparison
bus_counts = [100, 500, 1000, 5000, 10000]
print(f"\n{'Buses':>8}  {'OpenDSS':>12}  {'ML Model':>12}  {'Speedup':>10}")
print("-" * 48)
for n in bus_counts:
    pf = n * pf_time_per_bus
    ml = ml_time_per_run * (n / len(X))
    if pf >= 3600:
        pf_str = f"{pf/3600:.1f} hrs"
    elif pf >= 60:
        pf_str = f"{pf/60:.1f} min"
    else:
        pf_str = f"{pf:.0f} sec"
    print(f"{n:>8,}  {pf_str:>12}  {ml*1000:>9.1f} ms  {pf/ml:>10,.0f}x")
                    

=== Speed Comparison ===
ML surrogate (all 487 buses): 2.3 ms
OpenDSS power flow (estimated): 9740 seconds (162.3 minutes)
Speedup: 4,234,783x

   Buses     OpenDSS     ML Model    Speedup
------------------------------------------------
     100    33.3 min     0.5 ms  4,234,783x
     500     2.8 hrs     2.4 ms  4,234,783x
   1,000     5.6 hrs     4.7 ms  4,234,783x
   5,000    27.8 hrs    23.7 ms  4,234,783x
  10,000    55.6 hrs    47.3 ms  4,234,783x

When to use which: Use the ML surrogate for initial screening of interconnection queues, scenario planning, and real-time web applications. Use full OpenDSS power flow for final engineering studies, regulatory filings, and cases where the surrogate's uncertainty interval is too wide. The two approaches are complementary, not competing.

Model Persistence and Feature Engineering Notes

# Save the trained LightGBM model
model.save_model("hosting_capacity_surrogate.lgb")

# Load it back
model = lgb.Booster(model_file="hosting_capacity_surrogate.lgb")
                    

Feature engineering rationale: The 13 features were specifically chosen to represent the three key physical drivers of hosting capacity: electrical distance (dist_from_sub_km, avg_impedance_pct, cumulative_r_ohm), equipment limits (total_kva, min_ampacity_a, total_line_mi), and loading conditions (peak_load_kw, load_density_kw_per_bus, existing_pv_peak_kw). Each feature has a clear physical interpretation, which makes the model's predictions auditable and trustworthy for engineering decisions.

✓

What You Built and Next Steps

Loaded power flow simulation results from Guide 03 and framed them as ML training labels
Engineered 13 features from network topology, transformer ratings, conductor properties, and load data
Trained a LightGBM regression model achieving R² = 0.92 and MAE of ~54 kW
Ran sensitivity analysis across temperature, load growth, and DER penetration scenarios in under one second
Built probabilistic hosting capacity estimates using quantile regression with calibrated 80% intervals
Mapped hosting capacity spatially across the SP&L service territory
Demonstrated massive speedup over full power flow simulation for the specific task of hosting capacity screening (~20 seconds/bus for OpenDSS vs. microseconds/bus for ML)

Ideas to Try Next

Graph neural networks: Encode the feeder topology as a graph to capture adjacency effects between buses
Active learning: Use the surrogate model to identify buses where it is least confident, then run targeted power flow only at those locations
Time-series hosting capacity: Train on hourly power flow results to predict how hosting capacity varies by time of day and season
Transfer learning: Train on one feeder and fine-tune on a new feeder with limited power flow data
Voltage sensitivity coefficients: Add dV/dP and dV/dQ sensitivity factors from the network Jacobian as input features

Key Terms Glossary

Surrogate model — a fast ML approximation of a slow physics simulation; trained on simulation outputs
Hosting capacity — maximum DER generation a feeder bus can accept without voltage or thermal violations
LightGBM — Light Gradient Boosting Machine; a histogram-based gradient boosting framework optimized for speed
Quantile regression — predicting specific percentiles (e.g., 10th, 90th) instead of the mean, yielding prediction intervals
R² (coefficient of determination) — fraction of target variance explained by the model; 1.0 is perfect
MAE (mean absolute error) — average absolute difference between predicted and actual values
Ampacity — the rated current-carrying capacity of a conductor, in amps
Per-unit (p.u.) — voltage expressed as a fraction of nominal; ANSI C84.1 range is 0.95–1.05
DER penetration — ratio of distributed generation capacity to peak load on a feeder

← Beginner: Hosting Capacity Next: Advanced Predictive Maintenance →