9: Model Fit, Matrix Form, and Multicollinearity.

class: center, middle, inverse, title-slide

.title[
# 9: Model Fit, Matrix Form, and Multicollinearity.
]
.subtitle[
## Linear Models
]
.author[
### <large>Jaye Seawright</large>
]
.institute[
### <small>Northwestern Political Science</small>
]
.date[
### Feb. 4, 2026
]

---

class: center, middle

pre[class] {
  max-height: 200px;
}
</style>

Our linear estimates are going to "miss" in various ways at various levels.

---

When we're at the population level, the prediction won't exactly equal the CEF for every level of `$X$`, which is why we include `$\epsilon$`, which we often call the modeling error, or the error term.

---

When we're at the level of estimation, the linear regression estimate won't exactly equal the observed value of our data `$Y_i$` for every data point `$i$`, which is why we include `$e_i$`, the difference between our regression and the data, which we often call the residual.

---

The residual is not the same thing as the modeling error!

---

``` r
library(tidyverse)
```

```
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ nlme::collapse()         masks dplyr::collapse()
## ✖ mice::filter()           masks dplyr::filter(), stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

``` r
library(rosdata)
data("hibbs")
```

---

``` r
econvote.lm <- lm(vote ~ growth, data = hibbs)
summary(econvote.lm)
```

```
## 
## Call:
## lm(formula = vote ~ growth, data = hibbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9929 -0.6674  0.2556  2.3225  5.3094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.2476     1.6219  28.514 8.41e-14 ***
## growth        3.0605     0.6963   4.396  0.00061 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.763 on 14 degrees of freedom
## Multiple R-squared:  0.5798,	Adjusted R-squared:  0.5498 
## F-statistic: 19.32 on 1 and 14 DF,  p-value: 0.00061
```

---

``` r
econvoteplot.line <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) +
     geom_text(size = 3) +
     scale_x_continuous(labels = function(x) paste0(x, "%")) +
     scale_y_continuous(labels = function(x) paste0(x, "%")) +
     labs(title = "Forecasting the Election from the Economy",
          x = "Average recent growth in personal income",
          y = "Incumbent party's vote share") +
     geom_smooth(method = "lm") +
     theme_minimal()
```

---

```
## `geom_smooth()` using formula = 'y ~ x'
```

---

``` r
econvoteplot.resids <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) +
    geom_smooth(method = "lm", se = TRUE) +
    # Add vertical residual lines
    geom_segment(aes(xend = growth, yend = predict(lm(vote ~ growth, data = hibbs))), 
                 color = "maroon", alpha = 1, linetype = "dashed", linewidth=1) +
    geom_text(size = 3) +
    scale_x_continuous(labels = function(x) paste0(x, "%")) +
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Forecasting the Election from the Economy",
         x = "Average recent growth in personal income",
         y = "Incumbent party's vote share") +
    theme_minimal()
```

---

```
## `geom_smooth()` using formula = 'y ~ x'
```

---

If we add up all the residuals from a regression, is that a useful measure of the model's fit?

---

``` r
sum(econvote.lm$resid)
```

```
## [1] -8.437695e-15
```

---

``` r
sum(econvote.lm$resid^2)
```

```
## [1] 198.2727
```

But what does this number *mean*?

---

Consider a regression with no explanatory variables.

---

``` r
econvoteplot.constant <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) +
    geom_smooth(method = "lm", formula = y ~ 1, se = TRUE) +
    # Add vertical residual lines
    geom_segment(aes(xend = growth, yend = predict(lm(vote ~ 1, data = hibbs))), 
                 color = "navy", alpha = 1, linetype = "dashed", linewidth=1) +
    geom_text(size = 3) +
    scale_x_continuous(labels = function(x) paste0(x, "%")) +
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Forecasting the Election from the Economy",
         x = "Average recent growth in personal income",
         y = "Incumbent party's vote share") +
    theme_minimal()
```

---

---

``` r
econvoteconstant.lm <- lm(vote ~ 1, data = hibbs)
sum(econvoteconstant.lm$resid^2)
```

```
## [1] 471.905
```

---

``` r
sum(econvote.lm$resid^2)/sum(econvoteconstant.lm$resid^2)
```

```
## [1] 0.4201538
```

---

``` r
1 - sum(econvote.lm$resid^2)/sum(econvoteconstant.lm$resid^2)
```

```
## [1] 0.5798462
```

---

``` r
summary(econvote.lm)
```

---

`$$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$$`

---

Recall the matrix form of the regression model:

`$$Y = \mathbb{X} \beta + \epsilon$$`

---

`$$Y = \begin{pmatrix}
Y_1 \\ Y_2 \\ \vdots \\ Y_N
\end{pmatrix}$$`

---

`$$\mathbb{X} = \begin{pmatrix}
1 & X_{11} & X_{12} & \cdots & X_{1K} \\ 1 & X_{21} & X_{22} & \cdots & X_{2K} \\ \vdots & \vdots & \vdots & \vdots & \vdots  \\ 1 & X_{N1} & X_{N2} & \cdots & X_{NK}
\end{pmatrix}$$`

---

`$$\beta = \begin{pmatrix}
\text{Intercept} \\ \beta_1 \\ \vdots \\ \beta_{K}
\end{pmatrix}$$`

---

`$$e = \begin{pmatrix}
e_1 \\ e_2 \\ \vdots \\ e_n
\end{pmatrix}$$`

---

`$$\hat\beta_{OLS} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^TY$$`

---

`$$\hat Y = \mathbb{X} \hat\beta_{OLS}$$`

---

`$$Y - \hat Y = e$$`

---

In order for `$\hat\beta_{OLS}$` to be defined, we need to be able to compute `$(\mathbb{X}^T\mathbb{X})^{-1}$`. A matrix can be inverted when it is *positive definite*, which is a property from matrix algebra.

A necessary condition for a matrix to be positive definite is that the columns of the matrix are *linearly independent*.

---

If we can multiply or divide one or more of the `$X$` variables by some constant and then add them together, and the result exactly equals another `$X$` variable, then we have a matrix that is linearly dependent, and regression will fail.

---

``` r
hibbs$doublegrowth <- hibbs$growth * 2
econvote.fail <- lm(vote ~ growth + doublegrowth, data = hibbs)
summary(econvote.fail)
```

```
## 
## Call:
## lm(formula = vote ~ growth + doublegrowth, data = hibbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9929 -0.6674  0.2556  2.3225  5.3094 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.2476     1.6219  28.514 8.41e-14 ***
## growth         3.0605     0.6963   4.396  0.00061 ***
## doublegrowth       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.763 on 14 degrees of freedom
## Multiple R-squared:  0.5798,	Adjusted R-squared:  0.5498 
## F-statistic: 19.32 on 1 and 14 DF,  p-value: 0.00061
```

---

``` r
hibbs$earlyyear <- hibbs$year <1980
hibbs$lateyear <- hibbs$year >= 1980
econvote.fail <- lm(vote ~ growth + earlyyear + lateyear, data = hibbs)
summary(econvote.fail)
```

```
## 
## Call:
## lm(formula = vote ~ growth + earlyyear + lateyear, data = hibbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2182 -0.9894 -0.0878  2.4180  4.9154 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.4518     1.6703  27.810 5.74e-13 ***
## growth          3.3250     0.7906   4.206  0.00103 ** 
## earlyyearTRUE  -1.6135     2.1534  -0.749  0.46702    
## lateyearTRUE        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.824 on 13 degrees of freedom
## Multiple R-squared:  0.5972,	Adjusted R-squared:  0.5353 
## F-statistic: 9.639 on 2 and 13 DF,  p-value: 0.002709
```

---

![Multicollinearity 1](Images/Folke1.png)

---

![Multicollinearity 2](Images/Folke2.png)

---

![Multicollinearity 3](Images/Folke3.png)