PS 405: Linear Models — Final Summative Problem Set

Due:
Submit:
Instructions: This is an individual assignment. You may use course materials, textbooks, and R documentation, but you may not collaborate with others or use external code without citation.

Problem 1: Conceptual Foundations (15%)

In your own words, answer the following:

Explain the difference between using regression for description, prediction, and causal inference. Provide a political science example for each.
What does it mean for OLS to be the “Best Linear Unbiased Estimator” (BLUE)? Under what assumptions does this hold?
How can omitted variable bias threaten regressions? Give an example from a real or hypothetical study.

Problem 2: OLS Proofs (15%)

Consider the simple linear regression model:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

where \(\varepsilon_i \sim N(0, \sigma^2)\).

Explain in words what the Gauss-Markov theorem guarantees about the OLS estimator in this context.

Problem 3: Multiple Regression & Interpretation (20%)

For the remainder, you may use data and a research problem of your choice, or rely on the one provided below. If you use your own, please briefly explain it.

Use the states dataset provided in the poliscidata package, which contains state-level political and demographic variables.

library(poliscidata)

## Registered S3 method overwritten by 'gdata':
##   method         from  
##   reorder.factor gplots

data("states")

# Variables include:
# vep12_turnout: voter turnout rate in 2012
# uninsured_pct: % without health insurance
# college: % with college degree
# prcapinc: per capita income
# cig_tax: tax rate on cigarettes

Estimate a linear model where turnout is the dependent variable and uninsued_pct, college, and cig_tax are independent variables.
Interpret the results in substantive terms. How would you explain the relationship between income and turnout to a non-technical audience?
Assess multicollinearity using Variance Inflation Factors (VIFs). Discuss whether it is a concern here.

Problem 4: Diagnostics & Model Fit (20%)

Using the model from Problem 3:

Produce and interpret the following diagnostic plots:
- Residuals vs. Fitted
- Q-Q plot of residuals
- Scale-Location plot
Conduct a Breusch-Pagan test for heteroskedasticity. If present, re-estimate the model with robust standard errors using sandwich and lmtest.

Problem 5: Interactions & Nonlinearity (15%)

Political scientists often argue that the effects of economic and educational inequalities on turnout aren’t linear.

Test whether there should be nonlinear transformations of uninsured_pct and college in your model from Problem 3.
Interpret any nonlinear transformations you add using:
- The coefficient table
- Relevant plots (using ggeffects or margins)

Problem 6: Communicating Results (15%)

Prepare a publication-quality regression table and figure.

Create a table (using modelsummary, stargazer, gt, or kableExtra) that displays:
- The model from Problem 3 (without nonlinarity)
- Any changes to the model from Problem 5 (potentially with nonlinearity)
- Include robust standard errors in parentheses.
Create a professional scatterplot with a regression line (using ggplot2) showing the relationship between uninsured_pct and college.
Write a short abstract (≤150 words) summarizing one key finding from your analysis, as if for a political science journal.