Due Date: June 12, 2026

Problem 1: Democracy and Inequality with Multiple Imputation

Return to the inequality dataset from Lab 2. Fit the interaction model (Gini ~ Polity * log(GDP)) after performing multiple imputation with mice (5 imputations). Compare the results with the complete‑case analysis from Lab 2. Are there meaningful differences? What does this suggest about the missingness mechanism?

Problem 2: Simulating Missingness Mechanisms in the AJR Data

The AJR replication data are fully observed. To understand the consequences of different missingness mechanisms and the performance of multiple imputation, we will artificially introduce missingness in two ways:

MCAR (Missing Completely at Random) : Values are deleted entirely at random, independent of any observed or unobserved variables.
MAR (Missing at Random) : The probability of a value being missing depends on another fully observed variable (e.g., latitude).

We will then compare three estimation strategies: - Full data (the ground truth, before any deletion). - Listwise deletion (complete‑case analysis on the data with missingness). - Multiple imputation (using mice on the data with missingness).

The outcome of interest is the IV estimate of the effect of institutions (risk) on log GDP (loggdp), instrumented by settler mortality (logmort0).

Data Preparation

library(AER)
library(mice)
library(tidyverse)
library(broom)

# Load AJR data
ajr <- read.csv("../data/ajrdata.csv")

# Create analysis variables
ajr_clean <- ajr %>%
  mutate(
    loggdp = loggdp,
    institutions = risk,
    logmort = logmort0,
    lat_abst = abs(latitude)
  ) %>%
  dplyr::select(loggdp, institutions, logmort, lat_abst, neoeuro, asia, africa, other)

# Full data IV (ground truth)
iv_full <- ivreg(loggdp ~ institutions + lat_abst + neoeuro + asia + africa |
                 logmort + lat_abst + neoeuro + asia + africa,
                 data = ajr_clean)

Part A: MCAR Missingness

We randomly delete 20% of the values in institutions and logmort completely at random.

set.seed(27)
ajr_mcar <- ajr_clean

# Introduce 20% missingness in institutions and logmort (MCAR)
n <- nrow(ajr_mcar)
ajr_mcar$institutions[sample(n, size = round(0.2 * n))] <- NA
ajr_mcar$logmort[sample(n, size = round(0.2 * n))] <- NA

# Listwise deletion
ajr_mcar_cc <- na.omit(ajr_mcar)
iv_mcar_cc <- ivreg(loggdp ~ institutions + lat_abst + neoeuro + asia + africa |
                    logmort + lat_abst + neoeuro + asia + africa,
                    data = ajr_mcar_cc)

# Multiple imputation
# YOUR JOB IS TO ADD CODE HERE THAT CARRIES OUT MULTIPLE IMPUTATION TO REVERSE THE EFFECTS OF THE MISSINGNESS THAT WE HAVE ADDED

What happens when we use multiple imputation in this simulation? How close is it to the original results, both in terms of the estimate and the standard error?

Part B: MAR Missingness

Now we make missingness depend on lat_abst (absolute latitude). Specifically, observations with higher latitude are more likely to have missing values in institutions and logmort. This is a common MAR scenario where missingness is correlated with an observed covariate.

set.seed(27)
ajr_mar <- ajr_clean

n <- nrow(ajr_mar)   # <-- add this line

# Probability of missingness increases with latitude
# We'll create a logistic probability: p(missing) = plogis(scale(lat_abst))
# This creates missingness probabilities ranging from ~0.2 to ~0.8,
# averaging about 20% missing overall.
# This ensures higher latitude => higher chance of missing
p_miss <- plogis(scale(ajr_mar$lat_abst))
miss_institutions <- rbinom(n, 1, p_miss)
miss_logmort    <- rbinom(n, 1, p_miss)

ajr_mar$institutions[miss_institutions == 1] <- NA
ajr_mar$logmort[miss_logmort == 1] <- NA

# Listwise deletion
ajr_mar_cc <- na.omit(ajr_mar)
iv_mar_cc <- ivreg(loggdp ~ institutions + lat_abst + neoeuro + asia + africa |
                   logmort + lat_abst + neoeuro + asia + africa,
                   data = ajr_mar_cc)

# Multiple imputation
# YOUR JOB IS TO ADD CODE HERE THAT CARRIES OUT MULTIPLE IMPUTATION TO REVERSE THE EFFECTS OF THE MISSINGNESS THAT WE HAVE ADDED

What happens when we use multiple imputation in this simulation? How close is it to the original results, both in terms of the estimate and the standard error?

Comparison Table

We compile the coefficient on institutions from each analysis.

results <- tibble(
  Scenario = c("Full Data", 
               "MCAR: Listwise", "MCAR: MI", 
               "MAR: Listwise", "MAR: MI"),
  Estimate = c(coef(iv_full)["institutions"],
               coef(iv_mcar_cc)["institutions"],
               summary(iv_mcar_mi)$estimate["institutions"],
               coef(iv_mar_cc)["institutions"],
               summary(iv_mar_mi)$estimate["institutions"]),
  SE = c(sqrt(vcov(iv_full)["institutions", "institutions"]),
         sqrt(vcov(iv_mcar_cc)["institutions", "institutions"]),
         summary(iv_mcar_mi)$std.error["institutions"],
         sqrt(vcov(iv_mar_cc)["institutions", "institutions"]),
         summary(iv_mar_mi)$std.error["institutions"])
)

results

What do we conclude about the value of imputation? What happens to the standard errors with each simulation? If you see anything notable in the standard errors, can you diagnose and explain the reason why it happens and explain its implications for multiple imputation and causal inference?

Political Science 406 Lab 9: Missing Data

2026-04-21