Problem Set 3

Problem 1

1a. In a multiple regression model, \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\), interpret \(\beta_1\) in your own words.

1b. Consider the slides’ example of terrorism incidents predicted by Trump vote share and 2012 margin. If the estimated equation is: \[\text{Terrorism} = 10 - 0.2 \times \text{TrumpShare} + 0.1 \times \text{D12Margin}\] Interpret what happens to predicted terrorism incidents when: - TrumpShare increases by 5 percentage points, holding D12Margin constant. - D12Margin decreases by 3 percentage points, holding TrumpShare constant.

1c. The slides show that in multivariate regression, \(\beta_1\) represents the expected change in Y when \(X_1\) increases by 1 unit, with all other variables held constant. Why is the “all other variables held constant” condition crucial for interpreting \(\beta_1\) as a partial effect?


Problem 2

The Conditional Expectation Function (CEF) has a key property: it is the best predictor of Y given X in the mean-squared error sense.

Let \(m(X)\) be any function of X used to predict Y. The mean-squared error (MSE) is defined as: \[MSE(m) = E[(Y - m(X))^2]\]

2a. Show that for any predictor \(m(X)\), the MSE can be decomposed as: \[E[(Y - m(X))^2] = E[(Y - E[Y|X])^2] + E[(E[Y|X] - m(X))^2]\]

Hint: Start with \(Y - m(X) = (Y - E[Y|X]) + (E[Y|X] - m(X))\), then expand the square.

2b. Using the decomposition from 2a, explain why the CEF (\(E[Y|X]\)) minimizes MSE among all possible predictors \(m(X)\).

2c. Connect this proof to the lecture discussion about why we can’t improve the CEF by adding something that depends on X. What does this imply about the relationship between the CEF error (\(Y - E[Y|X]\)) and X?


Problem 3

Install and load the required data:

install.packages("poliscidata")
library(poliscidata)
library(tidyverse)

# Clean and prepare the data
states_data <- states %>%
  select(state, vep12_turnout, prcapinc, religiosity, over64) %>%
  filter(!is.na(vep12_turnout), !is.na(prcapinc)) %>%
  mutate(income_thousands = prcapinc / 1000)

3a.

Run a bivariate regression predicting voter turnout (vep12_turnout) based on income (prcapinc or income_thousands).

# Your code here
turnout_bivariate <- lm(vep12_turnout ~ income_thousands, data = states_data)
summary(turnout_bivariate)

Create a visualization that shows: 1. The raw data points 2. The BLP (linear regression line) 3. A LOESS curve to approximate the true CEF 4. Compare the two curves. Does the relationship appear linear?

Questions for 3a: 1. Interpret the slope coefficient from the bivariate regression. 2. Based on the visualization, does the BLP appear to be a good approximation of the CEF? Explain. 3. Calculate and interpret R-squared.

3b.

Now run a multivariate regression predicting voter turnout based on income (prcapinc), religiosity (religiosity), and age distribution (over64).

# Your code here
turnout_multivariate <- lm(vep12_turnout ~ income_thousands + religiosity + over64, 
                           data = states_data)
summary(turnout_multivariate)

Questions for 3b: 1. Interpret each coefficient in the multivariate model. 2. How does the coefficient for income change from the bivariate to multivariate model? What might explain this change? 3. Calculate the predicted voter turnout for a state with: income = $50,000, religiosity = 50, over64 = 15%.

3c.

Compare the two models:

# Model comparison
library(modelsummary)
models <- list("Bivariate" = turnout_bivariate, 
               "Multivariate" = turnout_multivariate)
modelsummary(models, stars = TRUE, output = "markdown")

# Calculate and compare R-squared
cat("Bivariate R-squared:", summary(turnout_bivariate)$r.squared, "\n")
cat("Multivariate R-squared:", summary(turnout_multivariate)$r.squared, "\n")

Questions for 3c: 1. Which model has better fit? Does adding variables substantially improve the model?


Problem 4

4a. Recall from the lectures that the BLP has two key properties: (1) \(E[e] = 0\) and (2) \(E[e \times X] = 0\). Verify these properties for your multivariate model from Problem 3:

# Calculate residuals
residuals <- residuals(turnout_multivariate)

# Property 1: Mean of residuals
mean_residual <- mean(residuals)
cat("Mean of residuals:", mean_residual, "\n")

# Property 2: Correlation of residuals with each predictor
cor_res_income <- cor(residuals, states_data$income_thousands, use = "complete.obs")
cor_res_relig <- cor(residuals, states_data$religiosity, use = "complete.obs")
cor_res_age <- cor(residuals, states_data$over64, use = "complete.obs")

cat("Correlation with income:", cor_res_income, "\n")
cat("Correlation with religiosity:", cor_res_relig, "\n")
cat("Correlation with age:", cor_res_age, "\n")

Questions for 4: 1. Do the residuals from your model satisfy the BLP properties? What might it mean if they don’t? 2. Based on all your analyses, write a brief conclusion (3-4 sentences) about what affects voter turnout in U.S. states and how well linear regression captures these relationships.