class: center, middle, inverse, title-slide .title[ # 9: Model Fit, Matrix Form, and Multicollinearity. ] .subtitle[ ## Linear Models ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### Feb. 4, 2026 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> Our linear estimates are going to "miss" in various ways at various levels. --- When we're at the population level, the prediction won't exactly equal the CEF for every level of `\(X\)`, which is why we include `\(\epsilon\)`, which we often call the modeling error, or the error term. --- When we're at the level of estimation, the linear regression estimate won't exactly equal the observed value of our data `\(Y_i\)` for every data point `\(i\)`, which is why we include `\(e_i\)`, the difference between our regression and the data, which we often call the residual. --- The residual is not the same thing as the modeling error! --- ``` r library(tidyverse) ``` ``` ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ forcats 1.0.1 ✔ tibble 3.3.0 ## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ## ✔ purrr 1.2.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ nlme::collapse() masks dplyr::collapse() ## ✖ mice::filter() masks dplyr::filter(), stats::filter() ## ✖ kableExtra::group_rows() masks dplyr::group_rows() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` ``` r library(rosdata) data("hibbs") ``` --- ``` r econvote.lm <- lm(vote ~ growth, data = hibbs) summary(econvote.lm) ``` ``` ## ## Call: ## lm(formula = vote ~ growth, data = hibbs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.9929 -0.6674 0.2556 2.3225 5.3094 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 46.2476 1.6219 28.514 8.41e-14 *** ## growth 3.0605 0.6963 4.396 0.00061 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.763 on 14 degrees of freedom ## Multiple R-squared: 0.5798, Adjusted R-squared: 0.5498 ## F-statistic: 19.32 on 1 and 14 DF, p-value: 0.00061 ``` --- ``` r econvoteplot.line <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + geom_smooth(method = "lm") + theme_minimal() ``` --- ``` ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="ModelFit_files/figure-html/unnamed-chunk-5-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r econvoteplot.resids <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_smooth(method = "lm", se = TRUE) + # Add vertical residual lines geom_segment(aes(xend = growth, yend = predict(lm(vote ~ growth, data = hibbs))), color = "maroon", alpha = 1, linetype = "dashed", linewidth=1) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + theme_minimal() ``` --- ``` ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="ModelFit_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" /> --- If we add up all the residuals from a regression, is that a useful measure of the model's fit? --- ``` r sum(econvote.lm$resid) ``` ``` ## [1] -8.437695e-15 ``` --- ``` r sum(econvote.lm$resid^2) ``` ``` ## [1] 198.2727 ``` But what does this number *mean*? --- Consider a regression with no explanatory variables. --- ``` r econvoteplot.constant <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_smooth(method = "lm", formula = y ~ 1, se = TRUE) + # Add vertical residual lines geom_segment(aes(xend = growth, yend = predict(lm(vote ~ 1, data = hibbs))), color = "navy", alpha = 1, linetype = "dashed", linewidth=1) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + theme_minimal() ``` --- <img src="ModelFit_files/figure-html/unnamed-chunk-11-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r econvoteconstant.lm <- lm(vote ~ 1, data = hibbs) sum(econvoteconstant.lm$resid^2) ``` ``` ## [1] 471.905 ``` --- ``` r sum(econvote.lm$resid^2)/sum(econvoteconstant.lm$resid^2) ``` ``` ## [1] 0.4201538 ``` --- ``` r 1 - sum(econvote.lm$resid^2)/sum(econvoteconstant.lm$resid^2) ``` ``` ## [1] 0.5798462 ``` --- ``` r summary(econvote.lm) ``` ``` ## ## Call: ## lm(formula = vote ~ growth, data = hibbs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.9929 -0.6674 0.2556 2.3225 5.3094 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 46.2476 1.6219 28.514 8.41e-14 *** ## growth 3.0605 0.6963 4.396 0.00061 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.763 on 14 degrees of freedom ## Multiple R-squared: 0.5798, Adjusted R-squared: 0.5498 ## F-statistic: 19.32 on 1 and 14 DF, p-value: 0.00061 ``` --- `$$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$$` --- Recall the matrix form of the regression model: `$$Y = \mathbb{X} \beta + \epsilon$$` --- `$$Y = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{pmatrix}$$` --- `$$\mathbb{X} = \begin{pmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1K} \\ 1 & X_{21} & X_{22} & \cdots & X_{2K} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & X_{N1} & X_{N2} & \cdots & X_{NK} \end{pmatrix}$$` --- `$$\beta = \begin{pmatrix} \text{Intercept} \\ \beta_1 \\ \vdots \\ \beta_{K} \end{pmatrix}$$` --- `$$e = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix}$$` --- `$$\hat\beta_{OLS} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^TY$$` --- `$$\hat Y = \mathbb{X} \hat\beta_{OLS}$$` --- `$$Y - \hat Y = e$$` --- In order for `\(\hat\beta_{OLS}\)` to be defined, we need to be able to compute `\((\mathbb{X}^T\mathbb{X})^{-1}\)`. A matrix can be inverted when it is *positive definite*, which is a property from matrix algebra. A necessary condition for a matrix to be positive definite is that the columns of the matrix are *linearly independent*. --- If we can multiply or divide one or more of the `\(X\)` variables by some constant and then add them together, and the result exactly equals another `\(X\)` variable, then we have a matrix that is linearly dependent, and regression will fail. --- ``` r hibbs$doublegrowth <- hibbs$growth * 2 econvote.fail <- lm(vote ~ growth + doublegrowth, data = hibbs) summary(econvote.fail) ``` ``` ## ## Call: ## lm(formula = vote ~ growth + doublegrowth, data = hibbs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.9929 -0.6674 0.2556 2.3225 5.3094 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 46.2476 1.6219 28.514 8.41e-14 *** ## growth 3.0605 0.6963 4.396 0.00061 *** ## doublegrowth NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.763 on 14 degrees of freedom ## Multiple R-squared: 0.5798, Adjusted R-squared: 0.5498 ## F-statistic: 19.32 on 1 and 14 DF, p-value: 0.00061 ``` --- ``` r hibbs$earlyyear <- hibbs$year <1980 hibbs$lateyear <- hibbs$year >= 1980 econvote.fail <- lm(vote ~ growth + earlyyear + lateyear, data = hibbs) summary(econvote.fail) ``` ``` ## ## Call: ## lm(formula = vote ~ growth + earlyyear + lateyear, data = hibbs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.2182 -0.9894 -0.0878 2.4180 4.9154 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 46.4518 1.6703 27.810 5.74e-13 *** ## growth 3.3250 0.7906 4.206 0.00103 ** ## earlyyearTRUE -1.6135 2.1534 -0.749 0.46702 ## lateyearTRUE NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.824 on 13 degrees of freedom ## Multiple R-squared: 0.5972, Adjusted R-squared: 0.5353 ## F-statistic: 9.639 on 2 and 13 DF, p-value: 0.002709 ``` ---  ---  --- 