Due Date: January 16, 2026
Submission: https://canvas.northwestern.edu/courses/245562/assignments/1676620
Explain in your own words: What is regression trying to achieve? How does it differ from simply calculating the correlation between two variables?
In your experience and judgment, what is the difference between a good and a bad scatterplot? What are the goals of visualizations like scatterplots in the context of social science data analysis?
Let’s figure out whether there have been equal numbers of ICE arrests in 2024 in states with Democratic and Republican governors. We have a dataset pulled from Wikipedia that lists details about governors, and the deportation data. Let’s start by getting both of these loaded into our R workspace.
governor_data <- read.csv("https://github.com/jnseawright/ps210/raw/refs/heads/main/Data/stategovs.csv")
ice_arrests <- read.csv("https://github.com/jnseawright/ps210/raw/refs/heads/main/Data/icearrests.csv")
Now what we want to do is create a new variable in the ICE data that records the partisanship of the state governor where the arrest happens.
#This command creates a new empty variable called Party.
ice_arrests$Party <- NA
#This block of commands is going to loop through the ice_arrests database and
#check the relevant partisanship of the governor for each arrest
for (i in 1:nrow(ice_arrests)){
#This command is checking if a given arrest happened in one of the 50 states.
#Some arrests have no recorded location, some happen in international travel,
#some happen on military bases, etc. For those, we'll record the party as
#missing.
if (!ice_arrests$State[i] %in% levels(as.factor(governor_data$State)))
ice_arrests$Party[i] <- NA
#When the party isn't missing, we'll set it from the governor data.
else ice_arrests$Party[i] <- governor_data$Party[governor_data$State==ice_arrests$State[i]]
}
3a. Run a regression predicting state-level ICE arrests relative to the party controlling the state governorship. Interpret the results.
3b. Before running your regression, create an appropriate visualization to explore the relationship between governor’s party and ICE arrests. Consider using a boxplot, violin plot, or jittered scatterplot. Describe what you observe in the visualization and how it complements or contrasts with your regression results.
Hint: You may need to aggregate the data to state level or sample appropriately for visualization given the dataset’s size.
A common feature of regression in social science applications is multivariate analysis. It might make sense to add state populations as a conditioning variable in the regression from the previous problem. Let’s start by adding data on population to our ice_arrests dataset.
state_pops <- read.csv("https://github.com/jnseawright/ps210/raw/refs/heads/main/Data/statepops.csv")
Now we can use a version of our code from above that copies in state populations instead of partisanship.
#This command creates a new empty variable called Population.
ice_arrests$Population <- NA
#This block of commands is going to loop through the ice_arrests database and
#check the relevant population of the state for each arrest
for (i in 1:nrow(ice_arrests)){
#This command is checking if a given arrest happened in one of the 50 states.
#Some arrests have no recorded location, some happen in international travel,
#some happen on military bases, etc. For those, we'll record the population as
#missing.
if (!ice_arrests$State[i] %in% levels(as.factor(state_pops$State)))
ice_arrests$Population[i] <- NA
#When the population isn't missing, we'll set it from the state population data.
else ice_arrests$Population[i] <-
state_pops$Population2024[state_pops$State==ice_arrests$State[i]]
}
#This variable often reads in with commas and gets treated as text, so we'll
#make sure to convert it to an actual number.
ice_arrests$Population <- parse_number(ice_arrests$Population)
4a. Run a regression predicting state-level ICE arrests as a function of which party controls the governorship and also state population. Once again, interpret your results, also describing any interesting comparisons with the bivariate results in Problem 3.
4b. Create a scatterplot of ICE arrests against state population, using color to distinguish between states with Democratic and Republican governors. Add regression lines (either separate lines for each party or one overall line) to visualize the relationship. Discuss how this visualization helps you understand the multivariate regression results. What patterns do you see that might not be apparent from the regression table alone?
Reflection question: Based on your analysis in Problems 3 and 4, what are at least two limitations of using governor’s partisanship as a predictor of ICE arrests? Consider both methodological issues (e.g., measurement, confounding variables) and substantive concerns (e.g., causal mechanisms, political context).