class: center, middle, inverse, title-slide .title[ # 7: Omitted variable bias. Nonlinearity ] .subtitle[ ## Linear Models ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### Jan. 28, 2026 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> Consider the following two linear predictors: `$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + u$$` `$$y = \beta^*_0 + \beta^*_1x_1 + u^*$$` --- How does `\(\beta^*_1x_1\)` relate to `\(\beta_1x_1\)`? --- Remember the regression formula: `$$\beta^*_1 = \frac{\text{Cov}(x_1, y)}{\text{Var}(x_1)}$$` --- Let's plug in our more complicated linear predictor: `$$\beta^*_1 = \frac{\text{Cov}(X_1, \beta_0 + \beta_1x_1 + \beta_2x_2 + u)}{\text{Var}(x_1)}$$` --- Covariance is linear, so we can simplify: `$$\scriptsize\beta^*_1 = \frac{\text{Cov}(x_1, \beta_0) + \text{Cov}(x_1, \beta_1x_1) + \text{Cov}(x_1, \beta_2x_2) + \text{Cov}(x_1, u)}{\text{Var}(x_1)}$$` --- 1. `\(\text{Cov}(X_1, \beta_0) = 0\)` because `\(\beta_0\)` is a constant. 2. `\(\text{Cov}(X_1, \beta_1x_1) = \beta_1 \text{Var} (x_1)\)` 3. `\(\text{Cov}(x_1, \beta_2x_2) = \beta_2 \text{Cov}(x_1, x_2)\)` 4. `\(\text{Cov}(x_1, u) = 0\)` because that's the error. --- `$$\beta^*_1 = \frac{\beta_1 \text{Var} (x_1) + \beta_2 \text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$` `$$\beta^*_1 = \frac{\beta_1 \text{Var} (x_1)}{\text{Var}(x_1)} + \frac{\beta_2 \text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$` `$$\beta^*_1 = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$` --- This term: `$$\beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$` is the coefficient for `\(x_2\)` times the coefficient in a regression of `\(x_2\)` on `\(x_1\)`. --- If we think of the larger linear predictor, with both `\(x\)` variables as in some sense the "true" model, then this difference between the models is a bias, with the smaller model being worse by the amount shown on the previous slide. People call this *omitted variable bias.* --- ``` r head(turnout) ``` ``` ## # A tibble: 6 × 4 ## Year Turnout Temperature GDP ## <dbl> <dbl> <dbl> <dbl> ## 1 1828 0.58 NA 881 ## 2 1832 0.55 NA 1155 ## 3 1836 0.58 NA 1896 ## 4 1840 0.8 NA 1672 ## 5 1844 0.79 NA 1629 ## 6 1848 0.73 NA 2139 ``` --- ``` r lm(Turnout ~ GDP, data=turnout) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP, data = turnout) ## ## Coefficients: ## (Intercept) GDP ## 6.499e-01 -4.975e-09 ``` --- ``` r summary(lm(Turnout ~ GDP, data=turnout)) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP, data = turnout) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.15949 -0.07948 -0.01742 0.08628 0.17013 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.499e-01 1.540e-02 42.202 <2e-16 *** ## GDP -4.975e-09 2.177e-09 -2.285 0.0268 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.09805 on 48 degrees of freedom ## Multiple R-squared: 0.09811, Adjusted R-squared: 0.07932 ## F-statistic: 5.222 on 1 and 48 DF, p-value: 0.02677 ``` --- ``` r lm(Turnout ~ GDP + Temperature, data=turnout) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP + Temperature, data = turnout) ## ## Coefficients: ## (Intercept) GDP Temperature ## 5.751e-01 -3.525e-09 9.409e-04 ``` --- ``` r summary(lm(Turnout ~ GDP + Temperature, data=turnout)) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP + Temperature, data = turnout) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.13323 -0.05916 -0.01721 0.02748 0.19650 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.751e-01 5.458e-01 1.054 0.299 ## GDP -3.525e-09 3.006e-09 -1.173 0.249 ## Temperature 9.409e-04 1.066e-02 0.088 0.930 ## ## Residual standard error: 0.09201 on 35 degrees of freedom ## (12 observations deleted due to missingness) ## Multiple R-squared: 0.06614, Adjusted R-squared: 0.01278 ## F-statistic: 1.239 on 2 and 35 DF, p-value: 0.3019 ``` --- ``` r -4.975e-09 - (-3.525e-09) ``` ``` ## [1] -1.45e-09 ``` --- One question we might reasonably have is whether the relationship between GDP and turnout is actually linear. --- ``` r turnoutlm <- lm(Turnout ~ GDP, data=turnout) ``` --- ``` r plot(turnoutlm, which = 1) ``` <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-9-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r turnoutlm.nonlinear <- lm(Turnout ~ GDP + I(GDP^2) + I(GDP^3), data=turnout) ``` --- ``` r summary(turnoutlm.nonlinear) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP + I(GDP^2) + I(GDP^3), data = turnout) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.175003 -0.049561 -0.004461 0.061803 0.151498 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.689e-01 1.531e-02 43.685 < 2e-16 *** ## GDP -4.591e-08 1.371e-08 -3.350 0.00162 ** ## I(GDP^2) 3.769e-15 1.440e-15 2.616 0.01198 * ## I(GDP^3) -7.973e-23 3.613e-23 -2.207 0.03237 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.09022 on 46 degrees of freedom ## Multiple R-squared: 0.2682, Adjusted R-squared: 0.2204 ## F-statistic: 5.618 on 3 and 46 DF, p-value: 0.002286 ``` --- ``` r library(ggplot2) library(ggExtra) beta0 <- turnoutlm.nonlinear$coefficients[1] # Intercept beta1 <- turnoutlm.nonlinear$coefficients[2] # GDP beta2 <- turnoutlm.nonlinear$coefficients[3] # GDP^2 beta3 <- turnoutlm.nonlinear$coefficients[4] # GDP^3 gdp_seq <- seq(min(turnout$GDP), max(turnout$GDP), length.out = 1000) tempdensity <- data.frame(simgdp = runif(1000,min(turnout$GDP),1e+07), simturnout = sample(turnout$Turnout, 1000, replace=TRUE)) # Calculate predicted turnout using the cubic polynomial turnout_pred <- beta0 + beta1 * gdp_seq + beta2 * gdp_seq^2 + beta3 * gdp_seq^3 # Create data frame for plotting plot_data <- data.frame(GDP = gdp_seq, Turnout = turnout_pred) # Create the plot turnoutplot <- ggplot(plot_data, aes(x = GDP, y = Turnout)) + geom_line(linewidth = 1.2, color = "steelblue") + labs( title = "Estimated Nonlinear CEF: Turnout ~ GDP", x = "GDP", y = "Predicted Turnout" ) + geom_point(data=tempdensity, aes(x=simgdp, y=simturnout), alpha = 0) + theme_minimal() + theme( plot.title = element_text(hjust = 0.5, face = "bold"), plot.subtitle = element_text(hjust = 0.5, size = 9), panel.grid.minor = element_blank() ) + scale_y_continuous(limits = c(min(turnout_pred) - 0.1, max(turnout_pred) + 0.1), breaks = seq(0, 1, by = 0.1)) ``` --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-14-1.png" width="70%" style="display: block; margin: auto;" /> --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-15-1.png" width="70%" style="display: block; margin: auto;" /> --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-16-1.png" width="70%" style="display: block; margin: auto;" /> --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-17-1.png" width="70%" style="display: block; margin: auto;" /> --- <img src="OmittedVariableBias_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> ---  ---  ---  --- 