Due Date: January 23, 2026
Submission: https://canvas.northwestern.edu/courses/245562/assignments/1676715
Define, in your own words, the conditional expectation function and the best linear predictor. How are these two ideas related, and in what ways are they different?
Consider the multivariate BLP:
\[m(\mathbb{X}) = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k}\]
Explain what each of the symbols used in the expression above would usually mean.
Explain the regularity conditions for the BLP to exist, and for each, give an example of a situation in which it would fail.
Let’s consider the widely studied relationship between wealth and democracy. This problem will guide you through an analysis using the Quality of Governance dataset, helping you contrast the Conditional Expectation Function (CEF) with the Best Linear Predictor (BLP).
Data Loading: Run the following code to load the
data. If you encounter issues with the rqog package, use
the alternative CSV file provided.
# Option 1: Using the rqog package (preferred)
devtools::install_github("ropengov/rqog")
library(rqog)
qogts <- read_qog(which_data = "standard", data_type = "time-series")
# Option 2: Alternative if rqog doesn't work (uncomment and run)
# qogts <- read.csv("https://github.com/jnseawright/ps405/raw/refs/heads/main/Data/qog_sample.csv")
# Clean the data for analysis
library(dplyr)
qog_clean <- qogts %>%
select(wdi_gdpcappppcon2017, vdem_libdem) %>%
filter(!is.na(wdi_gdpcappppcon2017), !is.na(vdem_libdem)) %>%
rename(gdp_pc = wdi_gdpcappppcon2017, democracy = vdem_libdem)
4a.
Create a visualization of the relationship between wealth
(wdi_gdpcappppcon2017 or gdp_pc in the cleaned
data) and democracy (vdem_libdem or
democracy). Your plot should include:
# Your code for 4a here
library(ggplot2)
# Create the plot with both LOESS (CEF approximation) and linear (BLP) fits
dem_wealth_plot <- ggplot(qog_clean, aes(x = gdp_pc, y = democracy)) +
geom_point(alpha = 0.3, color = "gray50") + # Raw data
geom_smooth(method = "loess", se = TRUE, color = "blue",
aes(color = "LOESS (CEF approx)"), size = 1.2) +
geom_smooth(method = "lm", se = TRUE, color = "red",
aes(color = "Linear (BLP)"), size = 1.2) +
scale_color_manual(values = c("LOESS (CEF approx)" = "blue",
"Linear (BLP)" = "red")) +
labs(title = "Wealth and Democracy: CEF vs. BLP",
x = "GDP per capita (constant 2017 USD)",
y = "Liberal Democracy Score (VDem)",
color = "Fit Type") +
theme_minimal() +
theme(legend.position = "bottom")
# Display the plot
dem_wealth_plot
Questions for 4a: 1. Describe what each curve (LOESS and linear) suggests about the relationship between wealth and democracy. 2. Which fit seems more appropriate for these data and why? 3. Based on the LOESS curve, does the relationship appear to be linear throughout the range of GDP values?
4b.
Fit the empirical approximation of the Best Linear Predictor connecting democracy and wealth. Report and interpret the coefficients.
# Fit the linear model (BLP)
blp_model <- lm(democracy ~ gdp_pc, data = qog_clean)
# Display model summary
summary(blp_model)
# Alternative: Using modelsummary for nicer output
library(modelsummary)
modelsummary(blp_model, stars = TRUE, output = "markdown")
Questions for 4b: 1. Interpret the intercept and slope coefficients in substantive terms. 2. What is the predicted democracy score for a country with $20,000 GDP per capita? Show your calculation. 3. Calculate and interpret R-squared. What does it tell us about this BLP?
4c.
Now let’s compare the BLP to a simple approximation of the CEF using grouped means:
# Create wealth groups
qog_clean <- qog_clean %>%
mutate(wealth_group = case_when(
gdp_pc < 10000 ~ "Low (<$10K)",
gdp_pc >= 10000 & gdp_pc <= 30000 ~ "Medium ($10K-$30K)",
gdp_pc > 30000 ~ "High (>$30K)"
))
# Calculate group means (simple CEF approximation)
group_means <- qog_clean %>%
group_by(wealth_group) %>%
summarize(
mean_democracy = mean(democracy, na.rm = TRUE),
mean_gdp = mean(gdp_pc, na.rm = TRUE),
n = n()
)
# Display group means
group_means
# Create comparison plot
library(ggplot2)
# Generate predictions from BLP for plotting
blp_predictions <- data.frame(
gdp_pc = seq(min(qog_clean$gdp_pc, na.rm = TRUE),
max(qog_clean$gdp_pc, na.rm = TRUE),
length.out = 100)
)
blp_predictions$democracy_pred <- predict(blp_model, newdata = blp_predictions)
# Create the comparison visualization
comparison_plot <- ggplot(qog_clean, aes(x = gdp_pc, y = democracy)) +
geom_point(alpha = 0.2, color = "gray50") +
# BLP line
geom_line(data = blp_predictions,
aes(x = gdp_pc, y = democracy_pred, color = "BLP"),
size = 1.5) +
# Group means (simple CEF approximation)
geom_point(data = group_means,
aes(x = mean_gdp, y = mean_democracy, color = "Group Means (CEF approx)"),
size = 4, shape = 17) +
# Vertical lines at group boundaries
geom_vline(xintercept = c(10000, 30000), linetype = "dashed", alpha = 0.5) +
scale_color_manual(values = c("BLP" = "red",
"Group Means (CEF approx)" = "darkgreen")) +
labs(title = "Comparing BLP to Grouped Means (Simple CEF)",
x = "GDP per capita (constant 2017 USD)",
y = "Liberal Democracy Score",
color = "Estimate Type") +
theme_minimal() +
theme(legend.position = "bottom")
comparison_plot
Questions for 4c: 1. How well does the BLP approximate the grouped means (our simple CEF approximation)? 2. In which wealth range does the BLP fit best? Where does it fit worst? 3. Discuss: Under what conditions might the BLP be a poor approximation of the true CEF for these data? 4. Calculate the mean squared error (MSE) for both the BLP and the grouped means approach (treating group means as predictions for all observations in that group). Which has lower MSE?
# Calculate MSE for BLP
blp_mse <- mean(residuals(blp_model)^2)
# Calculate MSE for grouped means approach
qog_with_group_preds <- qog_clean %>%
left_join(select(group_means, wealth_group, mean_democracy), by = "wealth_group") %>%
mutate(group_residual = democracy - mean_democracy)
group_mse <- mean(qog_with_group_preds$group_residual^2, na.rm = TRUE)
cat("BLP MSE:", round(blp_mse, 4), "\n")
cat("Grouped Means MSE:", round(group_mse, 4), "\n")
cat("Difference (BLP - Grouped):", round(blp_mse - group_mse, 4))
4d.
Modernization theory in political science suggests that democracy increases with wealth but at a decreasing rate (diminishing returns).
# Let's explore a non-linear specification
# Option 1: Polynomial (quadratic) model
poly_model <- lm(democracy ~ poly(gdp_pc, 2, raw = TRUE), data = qog_clean)
summary(poly_model)
# Option 2: Log transformation (common for diminishing returns)
log_model <- lm(democracy ~ log(gdp_pc), data = qog_clean)
summary(log_model)
# Compare models
library(modelsummary)
model_comparison <- list(
"Linear (BLP)" = blp_model,
"Quadratic" = poly_model,
"Log-Linear" = log_model
)
modelsummary(model_comparison, stars = TRUE, output = "markdown")
# Create comparison plot
library(patchwork)
# Generate predictions from all models
comparison_data <- data.frame(
gdp_pc = seq(min(qog_clean$gdp_pc), max(qog_clean$gdp_pc), length.out = 200)
)
comparison_data$linear_pred <- predict(blp_model, newdata = comparison_data)
comparison_data$quadratic_pred <- predict(poly_model, newdata = comparison_data)
comparison_data$log_pred <- predict(log_model, newdata = comparison_data)
# Reshape for plotting
library(tidyr)
comparison_long <- comparison_data %>%
pivot_longer(cols = -gdp_pc, names_to = "model", values_to = "prediction") %>%
mutate(model = factor(model,
levels = c("linear_pred", "quadratic_pred", "log_pred"),
labels = c("Linear (BLP)", "Quadratic", "Log-Linear")))
# Plot all models together
model_comparison_plot <- ggplot(qog_clean, aes(x = gdp_pc, y = democracy)) +
geom_point(alpha = 0.1, color = "gray50") +
geom_line(data = comparison_long,
aes(x = gdp_pc, y = prediction, color = model),
size = 1.2) +
scale_color_brewer(palette = "Set1") +
labs(title = "Comparing Linear and Non-Linear Specifications",
x = "GDP per capita",
y = "Democracy Score",
color = "Model") +
theme_minimal() +
theme(legend.position = "bottom")
model_comparison_plot
Questions for 4d: 1. Does the LOESS curve from 4a support the modernization theory prediction of diminishing returns? 2. Compare the linear (BLP), quadratic, and log-linear models. Which seems to best capture the relationship suggested by the LOESS curve? 3. What are the trade-offs between using a simple linear model (BLP) versus a more flexible specification? 4. If you were writing a paper on wealth and democracy, which model would you choose and why? Consider both statistical fit and substantive interpretability.