Lesson 4

Introduction to Model Validation

You've learned to fit models, compare them statistically, and check diagnostic assumptions. Now comes the ultimate test: How well do your models predict new, unseen data? This lesson teaches validation techniques that separate truly useful models from overfitted ones.

Model validation answers the crucial question: "Will this model work in the real world?" Diagnostic plots tell you if assumptions are met, but only testing on new data reveals true predictive performance.

What You'll Learn

  • Implement leave-one-out and k-fold cross-validation
  • Assess prediction accuracy with multiple metrics
  • Use bootstrap methods for uncertainty quantification
  • Understand the bias-variance trade-off in model complexity
  • Distinguish between confidence and prediction intervals
  • Apply learning curves to assess data sufficiency
  • Make informed decisions about model deployment
Before You Begin:
  • Completed modeling, comparison, and diagnostic tutorials
  • Understanding of overfitting and generalization
  • Same packages: tidyverse, glmmTMB, broom.mixed
  • Basic knowledge of sampling and resampling concepts
The Central Challenge: More complex models fit training data better but may perform worse on new data. Validation helps find the sweet spot between underfitting (high bias) and overfitting (high variance).
Step 1

Cross-Validation: The Gold Standard

Cross-validation tests model performance by repeatedly fitting on subsets of data and predicting on held-out observations. This simulates real-world deployment where you must predict new cases.

# Implement Leave-One-Out Cross-Validation
library(tidyverse)
library(glmmTMB)
library(broom.mixed)

data(mtcars)

# Define our competing models from previous lessons
loo_cv <- function(model_formula, data) {
  n <- nrow(data)
  predictions <- numeric(n)
  
  for(i in 1:n) {
    # Fit model without observation i
    train_data <- data[-i, ]
    test_data <- data[i, ]
    
    # Fit model and predict held-out observation
    temp_model <- glmmTMB(model_formula, data = train_data)
    predictions[i] <- predict(temp_model, newdata = test_data)
  }
  
  return(predictions)
}

# Test all our competing models
cv_formulas <- list(
  simple = mpg ~ wt,
  additive = mpg ~ wt + hp,
  interaction = mpg ~ wt * hp,
  complex = mpg ~ wt * hp + cyl
)

# Fit all models to the full dataset for AIC calculation
# Note: These fitted models are needed for AIC comparison and bootstrap methods
models <- list(
  simple = glmmTMB(cv_formulas$simple, data = mtcars),
  additive = glmmTMB(cv_formulas$additive, data = mtcars),
  interaction = glmmTMB(cv_formulas$interaction, data = mtcars),
  complex = glmmTMB(cv_formulas$complex, data = mtcars)
)

# Perform cross-validation for each model
cv_results <- map_dfr(names(cv_formulas), function(model_name) {
  predictions <- loo_cv(cv_formulas[[model_name]], mtcars)
  
  data.frame(
    model = model_name,
    observed = mtcars$mpg,
    predicted = predictions,
    residual = mtcars$mpg - predictions,
    car = rownames(mtcars)
  )
})

# Calculate performance metrics
cv_metrics <- cv_results %>%
  group_by(model) %>%
  summarise(
    rmse = sqrt(mean(residual^2)),
    mae = mean(abs(residual)),
    r_squared = cor(observed, predicted)^2,
    .groups = 'drop'
  )

print(cv_metrics)
Model RMSE MAE R-squared
Simple (wt only) 3.20 2.52 0.710
Additive (wt + hp) 2.78 2.12 0.783
Interaction (wt × hp) 2.34 2.03 0.845
Complex (wt × hp + cyl) 2.38 2.06 0.840
The interaction model achieves the best cross-validation performance, confirming our previous analysis. Importantly, adding cylinders to the interaction model actually hurts predictive performance - a sign of overfitting that traditional metrics might miss.
Cross-Validation RMSE Comparison
Cross-validation RMSE comparison across models

Understanding the Results

  • Simple model (RMSE: 3.20): Underfit - misses important relationships
  • Additive model (RMSE: 2.78): Better but still missing interaction effects
  • Interaction model (RMSE: 2.34): Best balance of complexity and generalization
  • Complex model (RMSE: 2.38): Slight overfitting - worse than simpler interaction model
# Visualize prediction accuracy for best model
best_model_cv <- cv_results %>% filter(model == "interaction")

ggplot(best_model_cv, aes(x = observed, y = predicted)) +
  geom_point(size = 3, alpha = 0.7, color = "#2563eb") +
  geom_smooth(method = "lm", se = TRUE, color = "#dc2626", linewidth = 1.2) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_text(aes(label = car), size = 2.5, nudge_y = 0.8, alpha = 0.7) +
  labs(
    title = "Cross-Validation: Predicted vs Observed MPG",
    subtitle = "Interaction model - Points closer to diagonal indicate better predictions",
    x = "Observed MPG",
    y = "Predicted MPG"
  ) +
  theme_minimal() +
  coord_equal()
Predicted vs Observed Values
Cross-validation predicted vs observed plot
Points close to the diagonal line indicate accurate predictions. The strong linear relationship (nearly overlapping the dashed 1:1 line) confirms excellent predictive performance for the interaction model.
Step 2

Prediction Error Analysis

Beyond overall metrics, examining the distribution of prediction errors reveals which types of observations are difficult to predict and whether errors are systematic.

# Analyze prediction error patterns
ggplot(cv_results, aes(x = model, y = abs(residual), fill = model)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1.5) +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Cross-Validation Residual Magnitude by Model",
    subtitle = "Distribution of absolute prediction errors",
    x = "Model Complexity",
    y = "Absolute Residual (MPG)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")
Prediction Error Distributions
Box plots of cross-validation residuals by model

Error Pattern Analysis

  • Simple model: Largest errors and widest spread - consistent underfitting
  • Additive model: Reduced error spread but still some large residuals
  • Interaction model: Smallest median error and tightest distribution
  • Complex model: Slightly worse than interaction - overfitting penalty

Identifying Difficult-to-Predict Cases

# Find cars with consistently large prediction errors
error_analysis <- cv_results %>%
  group_by(car) %>%
  summarise(
    mean_abs_error = mean(abs(residual)),
    worst_model = model[which.max(abs(residual))],
    best_model = model[which.min(abs(residual))],
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_abs_error))

# Top 5 hardest to predict cars
print(head(error_analysis, 5))

# Cars that benefit most from complex models
complexity_benefit <- cv_results %>%
  select(car, model, residual) %>%
  pivot_wider(names_from = model, values_from = residual) %>%
  mutate(
    simple_to_interaction = abs(simple) - abs(interaction),
    interaction_benefit = simple_to_interaction > 1  # More than 1 MPG improvement
  )

print(filter(complexity_benefit, interaction_benefit))
Key findings: Performance cars (Toyota Corolla, Honda Civic) and heavy luxury vehicles (Chrysler Imperial) show the largest prediction errors across models. These represent extreme cases where weight-horsepower relationships differ from typical patterns.
Step 3

Bootstrap Uncertainty Quantification

Bootstrap resampling provides another perspective on model uncertainty by repeatedly fitting models to random samples of your data, revealing how stable your predictions are.

# Bootstrap confidence intervals for predictions
bootstrap_predictions <- function(model, newdata, n_boot = 100) {
  boot_preds <- replicate(n_boot, {
    # Bootstrap sample
    boot_indices <- sample(nrow(mtcars), replace = TRUE)
    boot_data <- mtcars[boot_indices, ]
    
    # Fit model to bootstrap sample
    boot_model <- glmmTMB(mpg ~ wt * hp, data = boot_data)
    
    # Predict on new data
    predict(boot_model, newdata = newdata)
  })
  
  # Calculate confidence intervals
  data.frame(
    predicted = rowMeans(boot_preds),
    lower = apply(boot_preds, 1, quantile, 0.025),
    upper = apply(boot_preds, 1, quantile, 0.975)
  )
}

# Create prediction grid
pred_grid <- expand_grid(
  wt = seq(2, 5, by = 0.5),
  hp = c(100, 200, 300)
)

# Get bootstrap predictions
boot_results <- bootstrap_predictions(models$interaction, pred_grid)
pred_grid <- bind_cols(pred_grid, boot_results)

# Visualize bootstrap uncertainty
ggplot(pred_grid, aes(x = wt, y = predicted, color = factor(hp))) +
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = factor(hp)), 
              alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.2) +
  geom_point(data = mtcars, aes(x = wt, y = mpg), 
             color = "black", size = 2, alpha = 0.7, inherit.aes = FALSE) +
  scale_color_viridis_d(name = "Horsepower", option = "plasma") +
  scale_fill_viridis_d(name = "Horsepower", option = "plasma") +
  labs(
    title = "Bootstrap Confidence Intervals for Predictions",
    subtitle = "Model uncertainty quantified through resampling",
    x = "Weight (1000 lbs)",
    y = "Predicted MPG"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
Bootstrap Prediction Uncertainty
Bootstrap confidence intervals for model predictions
Bootstrap confidence bands show where model predictions are most uncertain. Notice wider bands at the extremes (very light or very heavy cars) where we have less training data. This uncertainty quantification is crucial for responsible deployment.
Bootstrap vs Cross-Validation:
  • Cross-validation: Tests predictive performance on truly unseen data
  • Bootstrap: Quantifies uncertainty in parameter estimates and predictions
  • Complementary: Use both for comprehensive model assessment
Step 4

Model Complexity and Learning Curves

Understanding how model performance changes with complexity and sample size helps optimize both model choice and data collection strategies.

# Analyze complexity vs performance trade-off
complexity_data <- data.frame(
  model = c("Simple", "Additive", "Interaction", "Complex"),
  parameters = c(3, 4, 5, 6),  # Including intercept and error
  cv_rmse = cv_metrics$rmse,
  cv_r2 = cv_metrics$r_squared,
  aic = c(AIC(models$simple), AIC(models$additive), 
          AIC(models$interaction), AIC(models$complex))
)

ggplot(complexity_data, aes(x = parameters, y = cv_rmse)) +
  geom_line(color = "#dc2626", linewidth = 1.2) +
  geom_point(size = 4, color = "#2563eb") +
  geom_text(aes(label = model), vjust = -1, size = 3.5, fontface = "bold") +
  labs(
    title = "Model Complexity vs Cross-Validation Performance",
    subtitle = "Bias-variance trade-off in action",
    x = "Number of Parameters",
    y = "Cross-Validation RMSE"
  ) +
  theme_minimal()
Complexity vs Performance Trade-off
Model complexity vs cross-validation performance
The Sweet Spot: The interaction model (5 parameters) achieves optimal performance. Beyond this point, additional complexity hurts generalization - a classic demonstration of the bias-variance trade-off.

Learning Curves: Sample Size Effects

# Generate learning curve for interaction model
create_learning_curve <- function(formula, data, sizes = seq(10, nrow(data), by = 2)) {
  results <- map_dfr(sizes, function(n) {
    # Sample multiple times for each size
    replications <- 10
    
    metrics <- map_dfr(1:replications, function(rep) {
      # Random sample of size n
      sample_indices <- sample(nrow(data), n)
      sample_data <- data[sample_indices, ]
      
      # Fit model and calculate metrics
      model <- glmmTMB(formula, data = sample_data)
      model_data <- augment(model)
      
      data.frame(
        n = n,
        rep = rep,
        rmse = sqrt(mean(model_data$.resid^2)),
        r_squared = 1 - var(model_data$.resid) / var(sample_data$mpg)
      )
    })
    
    # Average across replications
    metrics %>%
      group_by(n) %>%
      summarise(
        rmse_mean = mean(rmse),
        rmse_se = sd(rmse) / sqrt(n()),
        .groups = 'drop'
      )
  })
}

learning_data <- create_learning_curve(mpg ~ wt * hp, mtcars)

ggplot(learning_data, aes(x = n)) +
  geom_ribbon(aes(ymin = rmse_mean - rmse_se, ymax = rmse_mean + rmse_se), 
              alpha = 0.3, fill = "#2563eb") +
  geom_line(aes(y = rmse_mean), color = "#2563eb", linewidth = 1.2) +
  geom_point(aes(y = rmse_mean), color = "#2563eb", size = 2) +
  labs(
    title = "Learning Curve: Model Performance vs Sample Size",
    subtitle = "How prediction accuracy improves with more data",
    x = "Sample Size",
    y = "RMSE (miles per gallon)"
  ) +
  theme_minimal()
Learning Curve Analysis
Learning curve showing performance vs sample size
The learning curve shows diminishing returns: performance improves rapidly with initial data points but plateaus around 20-25 observations. For this model complexity, our dataset size (32 cars) appears adequate.
Step 5

Confidence vs Prediction Intervals

Different types of uncertainty require different interval types. Understanding when to use confidence intervals vs prediction intervals is crucial for proper uncertainty communication.

# Compare confidence and prediction intervals
pred_grid_simple <- data.frame(wt = seq(2, 5, 0.1), hp = 200)

# Confidence intervals (for mean response)
conf_preds <- predict(models$interaction, newdata = pred_grid_simple, se.fit = TRUE)
pred_grid_simple$conf_fit <- conf_preds$fit
pred_grid_simple$conf_se <- conf_preds$se.fit
pred_grid_simple$conf_lower <- pred_grid_simple$conf_fit - 1.96 * pred_grid_simple$conf_se
pred_grid_simple$conf_upper <- pred_grid_simple$conf_fit + 1.96 * pred_grid_simple$conf_se

# Prediction intervals (for individual observations)
sigma <- sigma(models$interaction)
pred_grid_simple$pred_lower <- pred_grid_simple$conf_fit - 1.96 * sqrt(pred_grid_simple$conf_se^2 + sigma^2)
pred_grid_simple$pred_upper <- pred_grid_simple$conf_fit + 1.96 * sqrt(pred_grid_simple$conf_se^2 + sigma^2)

ggplot(pred_grid_simple, aes(x = wt)) +
  geom_ribbon(aes(ymin = pred_lower, ymax = pred_upper), 
              alpha = 0.2, fill = "#dc2626") +
  geom_ribbon(aes(ymin = conf_lower, ymax = conf_upper), 
              alpha = 0.4, fill = "#2563eb") +
  geom_line(aes(y = conf_fit), color = "black", linewidth = 1.2) +
  geom_point(data = filter(mtcars, hp >= 180 & hp <= 220), 
             aes(x = wt, y = mpg), size = 2, alpha = 0.8) +
  labs(
    title = "Confidence vs Prediction Intervals",
    subtitle = "Blue: confidence interval for mean | Red: prediction interval for individuals",
    x = "Weight (1000 lbs)",
    y = "MPG",
    caption = "Shown for cars with ~200 hp"
  ) +
  theme_minimal()
Confidence vs Prediction Intervals
Comparison of confidence and prediction intervals

When to Use Each Interval Type

  • Confidence intervals (blue): Uncertainty about the average MPG for cars with specific weight/HP combinations
  • Prediction intervals (red): Expected range for an individual car's MPG
  • Key difference: Prediction intervals account for both model uncertainty AND individual variation
  • Practical use: Use prediction intervals when forecasting individual cases
Notice how prediction intervals are much wider than confidence intervals. This reflects the reality that predicting individual observations is much harder than estimating population averages.
Step 6

Validation Best Practices & Summary

Complete Validation Workflow

Comprehensive Validation Dashboard
Four-panel validation summary
Our final assessment: The weight-horsepower interaction model demonstrates excellent validation performance across multiple criteria:
  • Best cross-validation RMSE: 2.34 MPG
  • Strong prediction accuracy: R² = 0.845 on unseen data
  • Optimal complexity: Best bias-variance trade-off
  • Reliable uncertainty: Bootstrap intervals capture true variability

Validation Checklist for Any Model

Validation Step Purpose Key Metric
Cross-validation Test predictive performance RMSE, MAE, R²
Predicted vs Observed plot Visualize prediction accuracy Points near diagonal
Residual analysis Check systematic errors Random scatter
Bootstrap intervals Quantify uncertainty Reasonable band width
Complexity analysis Avoid overfitting Optimal parameter count
Learning curves Assess data sufficiency Performance plateau

Key Validation Principles

  • Never tune on test data: Use separate validation sets or cross-validation for model selection
  • Multiple metrics matter: RMSE, MAE, and R² provide different perspectives on performance
  • Visualize predictions: Plots reveal patterns that summary statistics miss
  • Quantify uncertainty: Provide intervals, not just point predictions
  • Consider complexity costs: Simpler models often generalize better
  • Validate assumptions: Combine validation with diagnostic checking
Your Complete Modeling Toolkit: You now have the full pipeline for responsible statistical modeling: fitting → comparison → diagnostics → validation. These skills ensure your models are not just statistically sound but practically useful.

What You've Mastered

✅ Advanced Validation Skills

  • Implementing cross-validation for unbiased performance assessment
  • Using multiple metrics to evaluate prediction quality
  • Applying bootstrap methods for uncertainty quantification
  • Understanding and managing the bias-variance trade-off
  • Creating learning curves to guide data collection
  • Distinguishing confidence from prediction intervals
  • Making evidence-based decisions about model deployment

Advanced Topics for Continued Learning

  • Nested cross-validation: Proper model selection with unbiased performance estimates
  • Time series validation: Forward-chaining and walk-forward analysis
  • Stratified sampling: Ensuring representative validation splits
  • Bayesian model averaging: Combining multiple models for better predictions
  • Out-of-distribution detection: Identifying when new data differs from training
  • Calibration analysis: Ensuring predicted probabilities match observed frequencies