7: Omitted variable bias. Nonlinearity

class: center, middle, inverse, title-slide

.title[
# 7: Omitted variable bias. Nonlinearity
]
.subtitle[
## Linear Models
]
.author[
### <large>Jaye Seawright</large>
]
.institute[
### <small>Northwestern Political Science</small>
]
.date[
### Jan. 28, 2026
]

---

class: center, middle

pre[class] {
  max-height: 200px;
}
</style>

Consider the following two linear predictors:

`$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + u$$`

`$$y = \beta^*_0 + \beta^*_1x_1 + u^*$$`

---

How does `$\beta^*_1x_1$` relate to `$\beta_1x_1$`?

---

Remember the regression formula:

`$$\beta^*_1 = \frac{\text{Cov}(x_1, y)}{\text{Var}(x_1)}$$`

---

Let's plug in our more complicated linear predictor:

`$$\beta^*_1 = \frac{\text{Cov}(X_1, \beta_0 + \beta_1x_1 + \beta_2x_2 + u)}{\text{Var}(x_1)}$$`

---

Covariance is linear, so we can simplify:

`$$\scriptsize\beta^*_1 = \frac{\text{Cov}(x_1, \beta_0) + \text{Cov}(x_1, \beta_1x_1) + \text{Cov}(x_1, \beta_2x_2) + \text{Cov}(x_1, u)}{\text{Var}(x_1)}$$`

---

1. `$\text{Cov}(X_1, \beta_0) = 0$` because `$\beta_0$` is a constant.

2. `$\text{Cov}(X_1, \beta_1x_1) = \beta_1 \text{Var} (x_1)$`

3. `$\text{Cov}(x_1, \beta_2x_2) = \beta_2 \text{Cov}(x_1, x_2)$`

4. `$\text{Cov}(x_1, u) = 0$` because that's the error.

---

`$$\beta^*_1 = \frac{\beta_1 \text{Var} (x_1) + \beta_2 \text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$`

`$$\beta^*_1 = \frac{\beta_1 \text{Var} (x_1)}{\text{Var}(x_1)} + \frac{\beta_2 \text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$`

`$$\beta^*_1 = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$`

---

This term:

`$$\beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$`

is the coefficient for `$x_2$` times the coefficient in a regression of `$x_2$` on `$x_1$`.

---

If we think of the larger linear predictor, with both `$x$` variables as in some sense the "true" model, then this difference between the models is a bias, with the smaller model being worse by the amount shown on the previous slide.

People call this *omitted variable bias.*

---

``` r
head(turnout)
```

```
## # A tibble: 6 × 4
##    Year Turnout Temperature   GDP
##   <dbl>   <dbl>       <dbl> <dbl>
## 1  1828    0.58          NA   881
## 2  1832    0.55          NA  1155
## 3  1836    0.58          NA  1896
## 4  1840    0.8           NA  1672
## 5  1844    0.79          NA  1629
## 6  1848    0.73          NA  2139
```

---

``` r
lm(Turnout ~ GDP, data=turnout)
```

```
## 
## Call:
## lm(formula = Turnout ~ GDP, data = turnout)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.499e-01   -4.975e-09
```

---

``` r
summary(lm(Turnout ~ GDP, data=turnout))
```

```
## 
## Call:
## lm(formula = Turnout ~ GDP, data = turnout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15949 -0.07948 -0.01742  0.08628  0.17013 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.499e-01  1.540e-02  42.202   <2e-16 ***
## GDP         -4.975e-09  2.177e-09  -2.285   0.0268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09805 on 48 degrees of freedom
## Multiple R-squared:  0.09811,	Adjusted R-squared:  0.07932 
## F-statistic: 5.222 on 1 and 48 DF,  p-value: 0.02677
```

---

``` r
lm(Turnout ~ GDP + Temperature, data=turnout)
```

```
## 
## Call:
## lm(formula = Turnout ~ GDP + Temperature, data = turnout)
## 
## Coefficients:
## (Intercept)          GDP  Temperature  
##   5.751e-01   -3.525e-09    9.409e-04
```

---

``` r
summary(lm(Turnout ~ GDP + Temperature, data=turnout))
```

```
## 
## Call:
## lm(formula = Turnout ~ GDP + Temperature, data = turnout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13323 -0.05916 -0.01721  0.02748  0.19650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  5.751e-01  5.458e-01   1.054    0.299
## GDP         -3.525e-09  3.006e-09  -1.173    0.249
## Temperature  9.409e-04  1.066e-02   0.088    0.930
## 
## Residual standard error: 0.09201 on 35 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.06614,	Adjusted R-squared:  0.01278 
## F-statistic: 1.239 on 2 and 35 DF,  p-value: 0.3019
```

---

``` r
-4.975e-09 - (-3.525e-09)
```

```
## [1] -1.45e-09
```

---

One question we might reasonably have is whether the relationship between GDP and turnout is actually linear.

---

``` r
turnoutlm <- lm(Turnout ~ GDP, data=turnout)
```

---

``` r
plot(turnoutlm, which = 1)
```

---

``` r
turnoutlm.nonlinear <- lm(Turnout ~ GDP + I(GDP^2) + I(GDP^3), data=turnout)
```

---

``` r
summary(turnoutlm.nonlinear)
```

```
## 
## Call:
## lm(formula = Turnout ~ GDP + I(GDP^2) + I(GDP^3), data = turnout)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.175003 -0.049561 -0.004461  0.061803  0.151498 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.689e-01  1.531e-02  43.685  < 2e-16 ***
## GDP         -4.591e-08  1.371e-08  -3.350  0.00162 ** 
## I(GDP^2)     3.769e-15  1.440e-15   2.616  0.01198 *  
## I(GDP^3)    -7.973e-23  3.613e-23  -2.207  0.03237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09022 on 46 degrees of freedom
## Multiple R-squared:  0.2682,	Adjusted R-squared:  0.2204 
## F-statistic: 5.618 on 3 and 46 DF,  p-value: 0.002286
```

---

``` r
library(ggplot2)
library(ggExtra)

beta0 <- turnoutlm.nonlinear$coefficients[1]  # Intercept
beta1 <- turnoutlm.nonlinear$coefficients[2] # GDP
beta2 <- turnoutlm.nonlinear$coefficients[3]  # GDP^2
beta3 <- turnoutlm.nonlinear$coefficients[4] # GDP^3

gdp_seq <- seq(min(turnout$GDP), max(turnout$GDP), length.out = 1000)

tempdensity <- data.frame(simgdp = runif(1000,min(turnout$GDP),1e+07), simturnout = sample(turnout$Turnout, 1000, replace=TRUE))

# Calculate predicted turnout using the cubic polynomial
turnout_pred <- beta0 + 
                beta1 * gdp_seq + 
                beta2 * gdp_seq^2 + 
                beta3 * gdp_seq^3

# Create data frame for plotting
plot_data <- data.frame(GDP = gdp_seq, Turnout = turnout_pred)

# Create the plot
turnoutplot <- ggplot(plot_data, aes(x = GDP, y = Turnout)) +
  geom_line(linewidth = 1.2, color = "steelblue") +
  labs(
    title = "Estimated Nonlinear CEF: Turnout ~ GDP",
    x = "GDP",
    y = "Predicted Turnout"
  ) +
  geom_point(data=tempdensity, aes(x=simgdp, y=simturnout), alpha = 0) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 9),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(limits = c(min(turnout_pred) - 0.1, max(turnout_pred) + 0.1),
                     breaks = seq(0, 1, by = 0.1))
```

---

---

---

---

---

---

---

![OVB 1](Images/Haber1.png)

---

![OVB 2](Images/Haber2.png)

---

![OVB 3](Images/Haber3.png)

---

![OVB 4](Images/Haber4.png)