class: center, middle, inverse, title-slide .title[ # 3 and 4: Regression as a Model ] .subtitle[ ## Linear Models ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### Jan. 12 and 14, 2026 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> A good goal in data analysis is summarizing the average value we would expect an outcome variable to take on given the values of a set of predictor variables. --- Let's go back to the election and economy example from last time. We might ask: what is the average vote share for different levels of economic growth? --- ``` ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ forcats 1.0.1 ✔ stringr 1.6.0 ## ✔ lubridate 1.9.4 ✔ tibble 3.3.0 ## ✔ purrr 1.2.0 ✔ tidyr 1.3.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ nlme::collapse() masks dplyr::collapse() ## ✖ mice::filter() masks dplyr::filter(), stats::filter() ## ✖ kableExtra::group_rows() masks dplyr::group_rows() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` ``` r econvoteplot <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + geom_vline(xintercept = 1, color = "gray50", linetype = "dashed", alpha = 0.7) + geom_vline(xintercept = 2, color = "gray50", linetype = "dashed", alpha = 0.7) + geom_vline(xintercept = 3, color = "gray50", linetype = "dashed", alpha = 0.7) + theme_minimal() ``` --- <img src="RegressionModel_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" /> --- Imagine that we have an unlimited number of elections, such that we can make the vertical lines closer and closer together. Eventually, we can generate a smooth function relating the growth in income to an average vote share. --- ##Conditional Expectation Function (CEF) `$$\mu(x_{i}) = E(y|x_{i})$$` --- We don't know the CEF relating economic growth to incumbent vote share, because there aren't infinite past US elections. --- ##Approximating the CEF 1. Stepwise means --- ``` r means <- c( mean(hibbs$vote[hibbs$growth <= 1], na.rm = TRUE), mean(hibbs$vote[hibbs$growth > 1 & hibbs$growth <= 2], na.rm = TRUE), mean(hibbs$vote[hibbs$growth > 2 & hibbs$growth <= 3], na.rm = TRUE), mean(hibbs$vote[hibbs$growth > 3], na.rm = TRUE) ) ``` --- ``` r econvoteplot <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + geom_vline(xintercept = 1, color = "gray50", linetype = "dashed", alpha = 0.7) + geom_vline(xintercept = 2, color = "gray50", linetype = "dashed", alpha = 0.7) + geom_vline(xintercept = 3, color = "gray50", linetype = "dashed", alpha = 0.7) + geom_segment(aes(x = c(-Inf), xend = c(1), y = means[1], yend = means[1]), color = "blue", alpha = 0.7)+ geom_segment(aes(x = c(1), xend = c(2), y = means[2], yend = means[2]), color = "blue", alpha = 0.7)+ geom_segment(aes(x = c(2), xend = c(3), y = means[3], yend = means[3]), color = "blue", alpha = 0.7)+ geom_segment(aes(x = c(3), xend = c(Inf), y = means[4], yend = means[4]), color = "blue", alpha = 0.7)+ theme_minimal() ``` --- <img src="RegressionModel_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" /> --- ``` r means ``` ``` ## [1] 47.89600 51.64333 51.64250 57.97500 ``` --- ##Approximating the CEF 1. Stepwise means 2. Assumptions --- Suppose that we somehow knew: `$$\mu(x_{i}) = E(y|x_{i}) = a + b x_{i}$$` --- What's a good way to figure out the line that we should use to fill in the two unknowns? --- ``` r econvoteplot <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + geom_abline(intercept = 58, slope = -5, color = "purple") + theme_minimal() ``` --- <img src="RegressionModel_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /> --- ``` r econvoteplot <- ggplot(hibbs, aes(x = growth, y = vote, label = year)) + geom_text(size = 3) + scale_x_continuous(labels = function(x) paste0(x, "%")) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Forecasting the Election from the Economy", x = "Average recent growth in personal income", y = "Incumbent party's vote share") + geom_abline(intercept = 46, slope = 4, color = "salmon") + theme_minimal() ``` --- <img src="RegressionModel_files/figure-html/unnamed-chunk-12-1.png" width="80%" style="display: block; margin: auto;" /> --- We don't want to just be arbitrarily drawing lines. We want the *best linear approximation of the CEF.* --- `$$m(x_{i}) = \beta_{0} + \beta_{1} x_{i}$$` where: `$$(\beta_{0}, \beta_{1}) = \text{arg min}_{\beta_{0}, \beta_{1}\in \mathbb{R}^{2}}E(y_{i} - \beta_{0} - \beta_{1} x_{i})^2$$` --- The version of `\(m(x_{i})\)` from the last slide is often called the *best linear predictor (BLP) of Y on X.* --- If the CEF is linear, then the BLP will equal the CEF. If not, they will be different, sometimes very different. --- ``` r #devtools::install_github("ropengov/rqog") library(rqog) qogts <- read_qog(which_data = "standard", data_type = "time-series") ``` ``` ## Local file not found. ## Downloading QoG qog_std_ts_jan23.csv data ## from http://www.qogdata.pol.gu.se/data/qog_std_ts_jan23.csv ## in file: C:\Users\jws780\AppData\Local\Temp\RtmpcfQ0gp/rqog/qog_std_ts_jan23.csv ``` ``` ## Reading cache file C:\Users\jws780\AppData\Local\Temp\RtmpcfQ0gp/rqog/qog_std_ts_jan23.csv ``` --- ``` r demeconplot <- ggplot(qogts, aes(x = wdi_gdpcappppcon2017, y = vdem_libdem)) + geom_point() + scale_x_continuous() + scale_y_continuous() + labs(title = "Modernization Theory and Democracy", x = "Per Capita GDP in Constant 2017 Dollars", y = "VDem Liberal Democracy Score") + theme_minimal() ``` --- <img src="RegressionModel_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- ``` r demeconplot.shapes <- ggplot(qogts, aes(x = wdi_gdpcappppcon2017, y = vdem_libdem)) + geom_point() + scale_x_continuous() + scale_y_continuous() + labs(title = "Modernization Theory and Democracy", x = "Per Capita GDP in Constant 2017 Dollars", y = "VDem Liberal Democracy Score") + geom_smooth(method=lm, color="purple") + geom_smooth(method=loess, color="blue") + theme_minimal() ``` --- ``` ## `geom_smooth()` using formula = 'y ~ x' ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="RegressionModel_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> --- We can generalize the best linear predictor to multivariate situations: `$$m(\mathbb{X}) = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k}$$` --- ``` r coords <- read.csv("https://gist.githubusercontent.com/metal3d/5b925077e66194551df949de64e910f6/raw/c5f20a037409d96958553e2eb6b8251265c6fd63/country-coord.csv") qogts$latitude <- NA for (i in 1:nrow(qogts)){ if (qogts$ccodealp[i] %in% coords$Alpha.3.code) qogts$latitude[i] <- coords$Latitude..average.[coords$Alpha.3.code==qogts$ccodealp[i]] } ``` --- ``` r demlatplot.shapes <- ggplot(qogts, aes(x = latitude, y = vdem_libdem)) + geom_point() + scale_x_continuous() + scale_y_continuous() + labs(title = "Geography and Democracy", x = "Latitude", y = "VDem Liberal Democracy Score") + geom_smooth(method=lm, color="purple") + geom_smooth(method=loess, color="blue") + theme_minimal() ``` --- ``` ## `geom_smooth()` using formula = 'y ~ x' ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="RegressionModel_files/figure-html/unnamed-chunk-20-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r dem.lm <- lm(vdem_libdem ~ latitude + wdi_gdpcappppcon2017, data = qogts) ``` --- <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto; font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;" class="table table table-striped table-hover table-condensed"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Modernization and Latitude </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 0.29919*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.00466) </td> </tr> <tr> <td style="text-align:left;"> latitude </td> <td style="text-align:center;"> 0.00042** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.00015) </td> </tr> <tr> <td style="text-align:left;"> wdi_gdpcappppcon2017 </td> <td style="text-align:center;"> 0.00001*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.00000) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 5030 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.226 </td> </tr> <tr> <td style="text-align:left;"> R2 Adj. </td> <td style="text-align:center;"> 0.226 </td> </tr> <tr> <td style="text-align:left;"> AIC </td> <td style="text-align:center;"> −280.3 </td> </tr> <tr> <td style="text-align:left;"> BIC </td> <td style="text-align:center;"> −254.2 </td> </tr> <tr> <td style="text-align:left;"> Log.Lik. </td> <td style="text-align:center;"> 144.127 </td> </tr> <tr> <td style="text-align:left;"> RMSE </td> <td style="text-align:center;"> 0.24 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> --- ###Population * The CEF and the BLP are at the population level. * A good (rough) way to think about this is that they are what you would get if you had infinite cases measured perfectly. --- ###Model * Even at the population level, we sometimes use models. * A *model* is a simplified mathematical representation that stands in for and helps explore a more complex, often unknown, and harder to understand true descriptive relationship. --- * If the CEF involves multiple predictor variables, the model will involve a relationship through a scatterplot in a space that is more than two-dimensional. --- ``` r library(plotly) ``` ``` ## ## Attaching package: 'plotly' ``` ``` ## The following object is masked from 'package:ggplot2': ## ## last_plot ``` ``` ## The following object is masked from 'package:stats': ## ## filter ``` ``` ## The following object is masked from 'package:graphics': ## ## layout ``` ``` r iv.1 <- seq(-3, 3, length.out = 50) iv.2 <- seq(-3, 3, length.out = 50) dv <- outer(iv.1, iv.2, function(iv.1, iv.2) iv.1^3 - iv.2^2) ourplot3d <- plot_ly(x = ~iv.1, y = ~iv.2, z = ~dv, type = "surface") %>% layout(scene = list( xaxis = list(title = "First IV"), yaxis = list(title = "Second IV"), zaxis = list(title = "DV") )) ``` ---
--- ``` r df <- expand.grid(iv.1 = iv.1, iv.2 = iv.2) df$dv <- as.vector(dv) lm_model <- lm(dv ~ iv.1 + iv.2, data = df) dv_blp <- matrix(predict(lm_model, newdata = df), nrow = length(iv.1), ncol = length(iv.2)) ``` --- ``` r ourplot3d <- plot_ly() %>% # Original surface add_surface(x = ~iv.1, y = ~iv.2, z = ~dv, colorscale = "Blues", opacity = 0.8, name = "Original Function") %>% # BLP surface add_surface(x = ~iv.1, y = ~iv.2, z = ~dv_blp, colorscale = "Reds", opacity = 0.8, name = "Best Linear Predictor") %>% layout( scene = list( xaxis = list(title = "First IV"), yaxis = list(title = "Second IV"), zaxis = list(title = "DV") ), legend = list(x = 0.8, y = 0.9) ) ``` ---
--- ###Regularity Conditions For the BLP to exist, the following must be true: * `\(E[Y^2] < \infty\)` * `\(E ||\mathbb{X}||^2 < \infty\)` * The columns of `\(\mathbb{X}\)` are linearly independent. --- It may seem weird to think of distributions with non-finite means/variances, but they do exist! The classic example is the Cauchy distribution, which is also the Student's T distribution with 1 degree of freedom. --- ``` r rcauchy(100) ``` ``` ## [1] 101.33637551 -20.94673976 0.48675717 -28.74199020 0.28324341 ## [6] 3.02319061 -0.69188778 -0.89587526 0.16984163 -0.41896682 ## [11] -0.01933092 -0.65683256 -2.42876525 -0.72531965 -2.19129340 ## [16] 16.59442818 9.60488813 -2.76820466 -4.03504129 -1.09807100 ## [21] -19.80211151 -0.38752697 0.21396960 5.96461360 -2.44220680 ## [26] -4.47208532 -0.02042096 -0.03683082 4.70831664 -0.60184240 ## [31] -212.04196352 -0.37178356 0.60360803 0.59500936 0.85762694 ## [36] 0.39131709 -0.73231646 2.73130186 0.96058210 -2.13941611 ## [41] -13.69917356 0.43534824 0.18095131 -150.44789776 0.85090413 ## [46] 2.05575846 4.18498108 4.05078612 -1.44145419 0.93549964 ## [51] -0.02685282 -0.23582261 -8.52340366 -5.02406196 0.58549198 ## [56] -1.25503927 5.77955018 0.98982200 -0.07887711 -2.15321006 ## [61] -0.20661371 -0.01242668 -8.03294024 1.47929947 -5.42638836 ## [66] 0.82696426 0.72470192 -1.05235382 -0.63486826 0.52197199 ## [71] -1.78328363 -0.28883178 0.37767367 1.17931372 1.48075792 ## [76] 1.05443301 -1.18684950 0.26651125 -1.53378823 0.31971810 ## [81] -1.37373689 0.89205552 0.52339569 -4.50300060 2.43229343 ## [86] -23.64005052 0.71723965 2.49080550 -0.94268686 1.67979731 ## [91] 0.35767551 9.86436893 0.27310610 -1.87827537 0.84039585 ## [96] 0.51945361 5.08238671 0.22852281 4.40180569 0.15480686 ``` ---  ---  ---  --- 