class: center, middle, inverse, title-slide .title[ # 10: Projection, Outliers, and Influence. ] .subtitle[ ## Linear Models ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### Feb. 9, 2026 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> ``` ## ## Attaching package: 'plotly' ``` ``` ## The following object is masked from 'package:ggplot2': ## ## last_plot ``` ``` ## The following object is masked from 'package:stats': ## ## filter ``` ``` ## The following object is masked from 'package:graphics': ## ## layout ``` ---
--- `$$\hat Y = \mathbb{X} \hat\beta$$` `$$\hat\beta= (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^TY$$` `$$\hat Y = \mathbb{X} (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^TY$$` We call `\(\mathbb{X} (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\)` the *projection matrix* (`\(\mathbb{P_X}\)`) or the *hat matrix* (`\(\mathbb{H}\)`). --- One common concern we have in regression is that one or another observation may be throwing the results off (due to measurement error or other idiosyncracies). We can try to operationalize this by looking at how much regression results change if we leave out one observation. --- Define `\(\hat \beta_{-i}\)` to be the regression results omitting observation `\(i\)`. Then, we are interested in cases with a large `\(|\hat \beta_{-i} - \hat \beta|\)`. --- When will cases have large scores of `\(|\hat \beta_{-i} - \hat \beta|\)`? --- 1\. Cases with `\(\mathbf{X}_i\)` near `\(\bar{\mathbf{X}}\)` have low *leverage* because regression with an intercept minimizes squared residuals subject to passing through `\((\bar{\mathbf{X}},\bar{\mathbf{y}})\)`. Imagine a seesaw pivoted at its center—weights near the pivot have less effect on its tilt than weights at the ends. Similarly, observations near the multidimensional mean provide less 'leverage' to change regression coefficients. --- 2\. The size of the residual in the full model is also a key component of the influence an observation has on the regression. Observations with large residuals are often called *outliers*. --- We can measure the leverage of an observation as `\(h_{ii}\)`, which is the `\(i\)`th diagonal entry in the projection matrix `\(\mathbb{P_X}\)`. --- We can just use residuals as measures of outliers, but it won't get us very far because our model is built to minimize them. --- Instead, we can calculate the leave-one-out prediction error, where we run a regression dropping a case and see how the error looks: `$$\hat \beta_{-i} = \hat \beta - (\mathbb{X}^T\mathbb{X})^{-1}\mathbf{X}_i \tilde e_{i}$$` where `\(\tilde e_{i} = \frac{\hat e_{i}}{1 - h_{ii}}\)`. --- A common question is how much the regression coefficient(s) change when dropping an observation, which we measure with `\(DFBETA_i\)`: `$$\hat \beta - \hat \beta_{-i} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbf{X}_i \tilde e_{i}$$` --- ``` r turnouttrim <- na.omit(turnout) turnoutlm <- lm(Turnout ~ Temperature + GDP, data = turnouttrim) dfbeta(turnoutlm) ``` ``` ## (Intercept) Temperature GDP ## 1 -0.0354547495 8.320240e-04 -6.045881e-10 ## 2 0.3517720750 -6.738982e-03 9.300914e-10 ## 3 0.0276483992 -4.287551e-04 -2.640793e-10 ## 4 -0.2156248840 4.334440e-03 -1.250548e-09 ## 5 0.0043337671 5.194103e-06 -2.831327e-10 ## 6 -0.0418004002 9.347027e-04 -5.578496e-10 ## 7 -0.2189355904 4.358779e-03 -1.128435e-09 ## 8 -0.0103370360 2.206132e-04 -1.026639e-10 ## 9 0.0232199873 -4.334897e-04 2.369789e-11 ## 10 -0.0082551280 1.379988e-04 4.510268e-11 ## 11 -0.0003286742 4.309212e-06 5.724999e-12 ## 12 0.0034821259 -1.618035e-04 3.225179e-10 ## 13 0.0225963609 -5.354617e-04 3.984230e-10 ## 14 -0.0211259560 3.755740e-04 3.963523e-11 ## 15 -0.2243833412 4.310089e-03 -6.338325e-10 ## 16 0.0364781532 -7.510860e-04 2.694571e-10 ## 17 0.0702585219 -1.399843e-03 3.643730e-10 ## 18 -0.1278084284 2.429909e-03 -2.838752e-10 ## 19 -0.0506879751 9.187643e-04 3.003035e-11 ## 20 0.0012316429 -1.845522e-05 -1.277669e-11 ## 21 0.0026707660 -6.736015e-05 5.737264e-11 ## 22 -0.0070742344 1.505187e-04 -6.505698e-11 ## 23 0.0293962186 -5.650572e-04 8.765838e-11 ## 24 0.0086766318 -1.621923e-04 1.345977e-11 ## 25 0.0292205969 -6.037926e-04 1.992400e-10 ## 26 0.0001733888 -3.937453e-05 7.905849e-11 ## 27 0.0475260177 -9.670696e-04 2.407704e-10 ## 28 -0.0421146396 7.899745e-04 -1.486486e-10 ## 29 0.0771053712 -1.555237e-03 2.651505e-10 ## 30 0.0352213437 -7.071941e-04 8.765526e-11 ## 31 0.1100542420 -2.187261e-03 2.250519e-10 ## 32 0.0268843315 -5.453733e-04 -1.200903e-10 ## 33 -0.0138853673 2.685754e-04 -1.106134e-10 ## 34 0.0143877381 -2.804290e-04 1.091595e-10 ## 35 0.0393107276 -7.680867e-04 -1.339142e-11 ## 36 0.0028499286 -5.536782e-05 -1.330099e-11 ## 37 -0.0120599270 2.128907e-04 8.826479e-10 ## 38 0.0825497800 -1.657823e-03 1.467265e-09 ``` --- ``` r summary(turnoutlm) ``` ``` ## ## Call: ## lm(formula = Turnout ~ Temperature + GDP, data = turnouttrim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.13323 -0.05916 -0.01721 0.02748 0.19650 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.751e-01 5.458e-01 1.054 0.299 ## Temperature 9.409e-04 1.066e-02 0.088 0.930 ## GDP -3.525e-09 3.006e-09 -1.173 0.249 ## ## Residual standard error: 0.09201 on 35 degrees of freedom ## Multiple R-squared: 0.06614, Adjusted R-squared: 0.01278 ## F-statistic: 1.239 on 2 and 35 DF, p-value: 0.3019 ``` --- ``` r dfbetas_vals <- data.frame(dfbetas(turnoutlm)) # Plot library(ggplot2) dfbetasplot <- ggplot(dfbetas_vals) + geom_point(aes(x = 1:nrow(dfbetas_vals), y = Temperature)) + geom_segment(aes(x = 1:nrow(dfbetas_vals), xend = 1:nrow(dfbetas_vals), y = 0, yend = Temperature), color = 'cornflowerblue') + geom_hline(yintercept = c(2, -2) / sqrt(nrow(dfbetas_vals)), color = 'salmon') + labs(x = 'Observation index', y = 'DFBETAS', title = paste0('DFBETAS values for coefficient of ', colnames(dfbetas_vals)[2]), subtitle = 'Thresholds are at \u00B1(2\u00F7\u221An)') + geom_text(aes(x = 1:nrow(dfbetas_vals), y = ifelse(Temperature > 0, Temperature + .05, Temperature - .05), label = ifelse(abs(Temperature) > (2/sqrt(nrow(dfbetas_vals))), paste0(round(Temperature, digits = 2), ' (', 1:nrow(dfbetas_vals), ')'), ''))) ``` --- <img src="ProjectionOutliersInfluence_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r print(turnouttrim,n=15) ``` ``` ## # A tibble: 38 × 4 ## Year Turnout Temperature GDP ## <dbl> <dbl> <dbl> <dbl> ## 1 1876 0.82 51.5 7910 ## 2 1880 0.79 48.6 10510 ## 3 1884 0.78 51 10970 ## 4 1888 0.79 53 12420 ## 5 1892 0.75 51.2 14300 ## 6 1896 0.79 51.6 13300 ## 7 1900 0.73 53.9 18700 ## 8 1904 0.65 51.8 22900 ## 9 1908 0.65 50.1 27700 ## 10 1912 0.59 50.9 39400 ## 11 1916 0.62 51.1 48300 ## 12 1920 0.49 51.3 91500 ## 13 1924 0.49 51.5 84700 ## 14 1928 0.57 50.7 97000 ## 15 1932 0.53 48.2 58500 ## # ℹ 23 more rows ``` ---  ---  ---  --- 