Step 1

Introduction to Statistical Modeling

Statistical modeling is the process of using mathematical relationships to understand and predict real-world phenomena. In this tutorial, we'll explore how car weight affects fuel efficiency using the classic mtcars dataset.

What You'll Learn

  • Understand the basic workflow of statistical modeling in R
  • Fit a simple linear model using glmmTMB
  • Tidy and interpret model results with broom.mixed
  • Visualize model predictions and make new predictions
  • Appreciate model assumptions and advanced modeling potential
Before You Begin:
  • Basic R knowledge (data frames, functions)
  • R packages: tidyverse, glmmTMB, broom.mixed
  • Understanding of correlation concepts
We'll predict miles per gallon (mpg) from car weight (wt) using a linear relationship. This is a classic problem that demonstrates core statistical concepts while remaining intuitive.
Step 2

Data Exploration

Before modeling, we need to understand our data. The mtcars dataset contains information about 32 car models from the 1970s.

# Load required libraries
library(tidyverse)
library(glmmTMB)
library(broom.mixed)

# Load and examine the data
data(mtcars)
head(mtcars)
str(mtcars)

# Focus on our variables of interest
summary(mtcars[c('mpg', 'wt')])
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

      mpg              wt       
 Min.   :10.40   Min.   :1.513  
 1st Qu.:15.43   1st Qu.:2.581  
 Median :19.20   Median :3.325  
 Mean   :20.09   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:3.610  
 Max.   :33.90   Max.   :5.424

Key Observations

  • MPG range: 10.4 to 33.9 miles per gallon
  • Weight range: 1.5 to 5.4 (thousands of pounds)
  • Sample size: 32 observations
  • Expectation: Heavier cars should have lower fuel efficiency
Quick Check: Look at the data summary above. Do you notice any patterns? What would you expect the relationship between weight and mpg to be?
Step 3

Model Fitting with glmmTMB

Now we'll fit a linear model to quantify the relationship between car weight and fuel efficiency. We use glmmTMB because it provides a unified interface for many types of models.

# Fit a simple linear model
model1 <- glmmTMB(mpg ~ wt, data = mtcars)

# View model summary
summary(model1)
 Family: gaussian  ( identity )
Formula:          mpg ~ wt
Data: mtcars

      AIC       BIC    logLik -2*log(L)  df.resid 
    166.0     170.4     -80.0     160.0        29 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  37.2851     1.8180  20.509   <2e-16 ***
wt           -5.3445     0.5413  -9.873   <2e-16 ***

Understanding the Model

Our model equation is: mpg = 37.29 - 5.34 ร— weight

The negative coefficient (-5.34) confirms our intuition: for every 1000 pounds of additional weight, fuel efficiency decreases by about 5.3 miles per gallon. The tiny p-values (< 2e-16) indicate this relationship is statistically significant.
Step 4

Tidy Model Output with broom.mixed

The broom.mixed package provides clean, consistent output for statistical models, making them easier to work with in data analysis pipelines.

# Get tidy coefficient estimates
tidy_coefs <- tidy(model1)
print(tidy_coefs)

# Add predictions and residuals to original data
model_data <- augment(model1)
print(model_data)

# Get overall model statistics
model_stats <- glance(model1)
print(model_stats)
# Coefficients
# A tibble: 2 ร— 7
  effect component term        estimate std.error statistic  p.value
                                 
1 fixed  cond      (Intercept)    37.3      1.82      20.5  1.80e-93
2 fixed  cond      wt             -5.34     0.541     -9.87 5.48e-23

# Augmented data (first 10 rows)
# A tibble: 32 ร— 6
   .rownames           mpg    wt .fitted .se.fit .resid
                         
 1 Mazda RX4          21    2.62    23.3   0.613 -2.28 
 2 Mazda RX4 Wag      21    2.88    21.9   0.553 -0.920
 3 Datsun 710         22.8  2.32    24.9   0.713 -2.09 

# Model statistics
# A tibble: 1 ร— 7
   nobs sigma logLik   AIC   BIC deviance df.residual
                  
1    32  2.95  -80.0  166.  170.     278.          29

Key Components Explained

  • .fitted: Model predictions for each car
  • .resid: Difference between observed and predicted mpg
  • sigma: Standard deviation of residuals (2.95 mpg)
  • AIC: Model fit statistic (lower is better for model comparison)
Step 5

Visualization

Visualizing our model helps us understand the relationship and assess model quality.

# Create a visualization of the model
library(ggplot2)

# Plot observed data and model predictions
ggplot(model_data, aes(x = wt)) +
  geom_point(aes(y = mpg), color = 'steelblue', size = 2, alpha = 0.7) +
  geom_line(aes(y = .fitted), color = 'red', linewidth = 1) +
  labs(
    title = 'Relationship between Car Weight and Fuel Efficiency',
    subtitle = 'Observed data (blue points) and model predictions (red line)',
    x = 'Weight (1000 lbs)',
    y = 'Miles per Gallon (mpg)'
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = 'bold'),
    plot.subtitle = element_text(size = 12, color = 'gray60')
  )
The visualization should show a clear downward trend. Points close to the red line indicate good predictions, while points far from the line show where our simple model doesn't capture all the complexity.
Interpretation Challenge: Look for patterns in the residuals. Are there cars that consistently over-perform or under-perform relative to their weight? This might suggest other important factors we haven't included.
Step 6

Making Predictions

One of the primary uses of statistical models is making predictions for new data points.

# Create new data for predictions
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5, 4.0))

# Generate predictions
predictions <- predict(model1, newdata = new_cars)

# Combine with original data for easy viewing
prediction_results <- data.frame(
  weight_1000lbs = new_cars$wt,
  predicted_mpg = round(predictions, 1)
)

print(prediction_results)
  weight_1000lbs predicted_mpg
1            2.5          23.9
2            3.0          21.3
3            3.5          18.6
4            4.0          15.9

Real-World Application

These predictions tell us:

  • A 2,500 lb car should get about 24 mpg
  • A 4,000 lb car should get about 16 mpg
  • Each additional 500 lbs reduces efficiency by ~2.7 mpg
Remember: these predictions assume the relationship we observed in 1970s cars still holds. Modern cars have different engines, aerodynamics, and materials that might change this relationship.
Step 7

Interpreting Results

Model Coefficients

  • Intercept (37.29): Theoretical mpg for a weightless car (not realistic, but mathematically necessary)
  • Weight coefficient (-5.34): Each 1000 lb increase decreases mpg by 5.34
  • Statistical significance: P-values < 0.001 indicate very strong evidence for these relationships

Model Quality Assessment

# Calculate R-squared manually
total_variance <- var(mtcars$mpg)
residual_variance <- var(model_data$.resid)
r_squared <- 1 - (residual_variance / total_variance)

cat("R-squared:", round(r_squared, 3), "\n")
cat("This means", round(r_squared * 100, 1), "% of mpg variation is explained by weight")
R-squared: 0.753
This means 75.3% of mpg variation is explained by weight

Practical Implications

Our model suggests that car weight is a strong predictor of fuel efficiency, explaining about 75% of the variation. However, 25% remains unexplained, indicating other important factors like:

  • Engine type and displacement
  • Number of cylinders
  • Transmission type
  • Aerodynamics and design
Step 8

Model Assumptions and Limitations

Every statistical model makes assumptions. Understanding these helps us know when our model is reliable and when it might fail.

Key Assumptions of Linear Models

Linearity: We assume the relationship between weight and mpg is truly linear. Our visualization suggests this is reasonable for this data range.
# Quick diagnostic: plot residuals vs fitted values
ggplot(model_data, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, color = 'red', linetype = 'dashed') +
  labs(
    title = 'Residuals vs Fitted Values',
    subtitle = 'Random scatter around zero suggests good model fit',
    x = 'Fitted Values (Predicted MPG)',
    y = 'Residuals'
  ) +
  theme_minimal()

Why glmmTMB?

We used glmmTMB instead of basic lm() because it:

  • Handles more complex models (mixed effects, non-normal distributions)
  • Provides consistent syntax across model types
  • Integrates well with tidyverse workflows
  • Prepares you for advanced modeling techniques
Step 9

Summary and Next Steps

What You've Accomplished

โœ… Completed Learning Objectives

  • Fitted your first statistical model using modern R tools
  • Learned the data โ†’ model โ†’ tidy results โ†’ visualization workflow
  • Made predictions and interpreted coefficients
  • Understood basic model assumptions and diagnostics
  • Gained foundation for more complex modeling

Immediate Next Steps

Practice Exercises:
  1. Try predicting mpg using horsepower (hp) instead of weight
  2. Fit a model with both weight and horsepower: mpg ~ wt + hp
  3. Compare model performance using AIC values
  4. Explore the relationship between engine displacement (disp) and mpg

Advanced Topics to Explore

  • Multiple regression: Including multiple predictors
  • Interaction effects: When predictors affect each other
  • Mixed-effects models: Handling grouped data
  • Non-linear relationships: Polynomial and spline models
  • Model validation: Cross-validation and bootstrap methods

Additional Resources