class: center, middle, inverse, title-slide .title[ # 2: Regression for Causal Inference ] .subtitle[ ## Quantitative Causal Inference ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### April 9 and 14, 2026 ] --- class: center, middle --- ### Why Regression? From Experiments to Observational Studies - In an experiment, random assignment ensures that treatment is independent of potential outcomes. - In observational data, we worry that treated and control units differ on other characteristics (confounders). - Regression attempts to **adjust for those differences** by holding confounders constant. --- ### The Intuition: Adjusting for a Confounder ``` r lm_biased <- lm(Stateterrorism ~ TrumpShare, data = election2016) coef(lm_biased)["TrumpShare"] ``` ``` ## TrumpShare ## -0.2110919 ``` ``` r # Adjust for past election and Turnout lm_adjusted <- lm(Stateterrorism ~ TrumpShare + Turnout16 + Turnout12 + D12Margin, data = election2016) coef(lm_adjusted)["TrumpShare"] ``` ``` ## TrumpShare ## 0.02730885 ``` --- - The coefficient changes because we are now comparing treated and control units having subtracted off the estimated effects of turnout and electoral history. - This is the core idea: **control for confounders to mimic randomization**. --- ### When Does This Work? The Key Assumption **Unconfoundedness (selection on observables)** \[ D_i \perp (Y_i(0), Y_i(1)) \mid X_i \] --- - Conditional on observed covariates \(X\), treatment is as good as randomly assigned. - This means: no unmeasured confounders that affect both treatment and outcome. --- **Overlap / Common Support** For every value of \(X\), there must be both treated and control units. --- ### Visualizing Confounding with DAGs ``` ## Warning: package 'dagitty' was built under R version 4.5.3 ``` <!-- --> --- - \(X\) is a confounder: it affects both treatment \(D\) and outcome \(Y\). - Controlling for \(X\) blocks the backdoor path \(D \leftarrow X \rightarrow Y\). - This gives us the causal effect of \(D\) on \(Y\). --- ### What Does "Controlling For" Mean? - Intuitively: we are asking what the relationship between treatment and outcome would be if age were held constant. - Mathematically: regression uses the variation in treatment that is **not explained by age** to estimate the effect. **Key idea**: Including a confounder removes its influence from both treatment and outcome, reducing bias. --- ### When Is a Variable a Confounder? A confounder is a variable that: 1. Affects the treatment, 2. Affects the outcome, 3. Is not itself affected by the treatment. We can represent this with a **Directed Acyclic Graph (DAG)**. --- ### DAG: A Simple Confounder <!-- --> --- - `\(X\)` (age) influences both treatment `\(D\)` and outcome `\(Y\)`. - This opens a **backdoor path** `\(D \leftarrow X \rightarrow Y\)`. - Controlling for `\(X\)` **blocks** this path, giving us the direct effect `\(D \rightarrow Y\)`. --- ### Good Controls: Block Backdoor Paths Any variable that lies on a backdoor path from `\(D\)` to `\(Y\)` and is not itself a descendant of `\(D\)` is a **good control**. Example from the terrorism study: `past turnout` and `voting history` are plausible confounders – they might affect both vote choice in 2016 and subsequent terrorism. --- ### Bad Controls: Mediators <!-- --> --- - `\(M\)` is a **mediator**: it lies on the causal path from `\(D\)` to `\(Y\)`. - Controlling for `\(M\)` would **block part of the causal effect** we want to estimate. - This is a form of **overcontrol bias**. In the Peru study, if `enojado` (anger) is a mediator, we should **not** control for it when estimating the total effect of `simpletreat` on voting. --- ### Bad Controls: Colliders <!-- --> --- - `\(C\)` is a **collider**: it is caused by both `\(D\)` and `\(Y\)`. - Conditioning on a collider **opens a non‑causal path** `\(D \rightarrow C \leftarrow Y\)`, creating **selection bias**. --- Example: if we control for a variable that is itself affected by treatment and outcome (e.g., turnout in 2020 in the terrorism example), we can introduce bias. --- ### Bad Controls: Post‑Treatment Variables Any variable that is caused by the treatment is **post‑treatment**. Controlling for such a variable can: - Block part of the causal effect (if it is a mediator), - Create collider bias (if it is also affected by other factors), - Generally lead to incorrect estimates. --- **Rule of thumb**: Only control for variables measured **before** treatment. --- ### Translating DAGs to Regression Practice When building a regression model for causal inference: 1. **List all potential confounders** – variables that affect both treatment and outcome. 2. **Draw a DAG** (even informally) to check for backdoor paths. 3. **Do not control for mediators or colliders**. 4. **Do not control for post‑treatment variables**. 5. **Include interactions if you suspect heterogeneous effects** (more on this later). --- ### The Problem of Unobserved Confounders We can only control for variables we have measured. If an important confounder is unmeasured, our estimate may still be biased. This is why **sensitivity analysis** is crucial – we need to ask: how strong would an unmeasured confounder have to be to overturn our conclusion? (We will return to this in later weeks.) --- ### Summary: Regression for Causal Inference - Regression can reduce bias by adjusting for confounders. - Use DAGs to decide which variables to control for. - Avoid bad controls: mediators, colliders, post‑treatment variables. - Unobserved confounders remain a threat – be humble. - Heterogeneous effects mean the regression coefficient is a weighted average; explore variation. --- Let's see how this works in practice in an excellent application. --- <img src="images/luputitleauthor.JPG" width="90%" /> --- <img src="images/lupu1.JPG" width="90%" /> --- <img src="images/lupu2.JPG" width="90%" /> --- <img src="images/lupu3.JPG" width="90%" /> --- <img src="images/lupu4.JPG" width="70%" /> --- ``` r library(dagitty) ``` --- Let's assemble a DAG that captures what appears to be the theoretical intuition behind Lupu and Peisahkin's analysis. --- ``` r LupuPeisahkinDAG1 <- dagitty( "dag {Dekulakized -> Victimization PreSovietWealth -> Victimization SovietOpposition -> Victimization PreSovietReligiosity -> Victimization PriorRegion -> Victimization DeportationRegion -> Victimization DeportationRegion -> Religiosity PriorRegion -> Religiosity PreSovietReligiosity -> Religiosity SovietOpposition -> Religiosity PreSovietWealth -> Religiosity Victimization -> Religiosity Dekulakized -> Religiosity}" ) ``` --- ``` r plot( LupuPeisahkinDAG1 ) ``` ``` ## Plot coordinates for graph not supplied! Generating coordinates, see ?coordinates for how to set your own. ``` <img src="2regression_files/figure-html/unnamed-chunk-8-1.png" width="50%" /> --- * In this initial setup, there are a lot of variables that are confounders for the main relationship of interest, between Victimization and Religiosity. Nothing is post-treatment, etc. * Is this the only plausible causal framework? What about... --- ``` r LupuPeisahkinDAG2 <- dagitty( "dag {Dekulakized <- Victimization PreSovietWealth -> Victimization SovietOpposition -> Victimization PreSovietReligiosity -> Victimization PriorRegion -> Victimization DeportationRegion <- Victimization DeportationRegion -> Religiosity PriorRegion -> Religiosity PreSovietReligiosity -> Religiosity SovietOpposition -> Religiosity PreSovietWealth -> Religiosity Victimization -> Religiosity Dekulakized -> Religiosity}" ) ``` --- ``` r plot( LupuPeisahkinDAG2 ) ``` ``` ## Plot coordinates for graph not supplied! Generating coordinates, see ?coordinates for how to set your own. ``` <img src="2regression_files/figure-html/unnamed-chunk-10-1.png" width="50%" /> --- * Here, Victimization causes DeportationRegion and Dekulakized. How does this change the appropriate set of control variables? --- ``` r adjustmentSets(LupuPeisahkinDAG1, exposure = "Victimization", outcome = "Religiosity") ``` ``` ## { Dekulakized, DeportationRegion, PreSovietReligiosity, ## PreSovietWealth, PriorRegion, SovietOpposition } ``` ``` r adjustmentSets(LupuPeisahkinDAG2, exposure = "Victimization", outcome = "Religiosity") ``` ``` ## { PreSovietReligiosity, PreSovietWealth, PriorRegion, SovietOpposition ## } ``` --- ``` r impliedConditionalIndependencies(LupuPeisahkinDAG1) ``` ``` ## Dklk _||_ DprR ## Dklk _||_ PrSR ## Dklk _||_ PrSW ## Dklk _||_ PrrR ## Dklk _||_ SvtO ## DprR _||_ PrSR ## DprR _||_ PrSW ## DprR _||_ PrrR ## DprR _||_ SvtO ## PrSR _||_ PrSW ## PrSR _||_ PrrR ## PrSR _||_ SvtO ## PrSW _||_ PrrR ## PrSW _||_ SvtO ## PrrR _||_ SvtO ``` --- Are presidential regimes especially vulnerable to democratic backsliding? Let's try to get a regression-based answer as an application of what we've learned. --- ``` ## Warning: package 'ggdag' was built under R version 4.5.3 ``` ``` ## ## Attaching package: 'ggdag' ``` ``` ## The following object is masked from 'package:stats': ## ## filter ``` <img src="2regression_files/figure-html/unnamed-chunk-13-1.png" width="50%" /> --- ``` r library(tidyverse) ``` --- ``` r qog_std_ts_jan22 <- read_csv("https://github.com/jnseawright/PS406/raw/main/data/qog_std_ts_jan22.csv") ``` ``` ## Warning: One or more parsing issues, call `problems()` on your data frame for details, ## e.g.: ## dat <- vroom(...) ## problems(dat) ``` ``` ## Rows: 15168 Columns: 1915 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (7): cname, cname_qog, ccodealp, version, cname_year, ccodealp_year, ... ## dbl (1905): ccode, year, ccode_qog, ccodecow, aid_cpnc, aid_cpsc, aid_crnc, ... ## lgl (3): psi_cpsi2, psi_edate2, psi_psi2 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` r qogdems2000 <- qog_std_ts_jan22 %>% filter(year==2000 & vdem_libdem > 0.5) qog2000demsin2020 <- qog_std_ts_jan22 %>% filter(year==2020 & cname %in% qogdems2000$cname) preslm <- lm(vdem_libdem ~ br_pres, data=qog2000demsin2020) ``` --- ``` r summ(preslm) ``` <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> Observations </td> <td style="text-align:right;"> 61 (1 missing obs. deleted) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Dependent variable </td> <td style="text-align:right;"> vdem_libdem </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Type </td> <td style="text-align:right;"> OLS linear regression </td> </tr> </tbody> </table> <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> F(1,59) </td> <td style="text-align:right;"> 2.05 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> R² </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Adj. R² </td> <td style="text-align:right;"> 0.02 </td> </tr> </tbody> </table> <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;border-bottom: 0;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t val. </th> <th style="text-align:right;"> p </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> (Intercept) </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 27.19 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> br_pres </td> <td style="text-align:right;"> -0.06 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> -1.43 </td> <td style="text-align:right;"> 0.16 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> Standard errors: OLS</td></tr></tfoot> </table> --- ``` r preslm2 <- lm(vdem_libdem ~ br_pres + as.factor(ht_colonial) + ccp_systyear + br_pvote, data=qog2000demsin2020) ``` --- ``` r summ(preslm2) ``` <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> Observations </td> <td style="text-align:right;"> 58 (4 missing obs. deleted) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Dependent variable </td> <td style="text-align:right;"> vdem_libdem </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Type </td> <td style="text-align:right;"> OLS linear regression </td> </tr> </tbody> </table> <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> F(9,48) </td> <td style="text-align:right;"> 3.30 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> R² </td> <td style="text-align:right;"> 0.38 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Adj. R² </td> <td style="text-align:right;"> 0.27 </td> </tr> </tbody> </table> <table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;border-bottom: 0;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t val. </th> <th style="text-align:right;"> p </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> (Intercept) </td> <td style="text-align:right;"> 2.50 </td> <td style="text-align:right;"> 0.76 </td> <td style="text-align:right;"> 3.29 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> br_pres </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.76 </td> <td style="text-align:right;"> 0.45 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)1 </td> <td style="text-align:right;"> -0.31 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> -2.26 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)2 </td> <td style="text-align:right;"> -0.09 </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;"> -1.42 </td> <td style="text-align:right;"> 0.16 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)5 </td> <td style="text-align:right;"> -0.12 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> -2.28 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)6 </td> <td style="text-align:right;"> -0.24 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> -2.26 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)7 </td> <td style="text-align:right;"> -0.13 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> -1.59 </td> <td style="text-align:right;"> 0.12 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> as.factor(ht_colonial)9 </td> <td style="text-align:right;"> -0.07 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -0.49 </td> <td style="text-align:right;"> 0.63 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> ccp_systyear </td> <td style="text-align:right;"> -0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> -2.36 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> br_pvote </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 0.95 </td> <td style="text-align:right;"> 0.35 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> Standard errors: OLS</td></tr></tfoot> </table> --- ### Is it better to include all but one control variable, or have fewer? It turns out, maybe surprisingly, that there isn't just one answer to this question. * Sometimes, you will get an estimate with less bias by leaving out valid control variables from an incomplete model. * Sometimes you will do better by including as many control variables as you can even if you don't know them all. * We don't know which scenario is which in general. --- ### How Much Does It Hurt to Be Wrong? Suppose we know we can get a good causal inference from a regression with two control variables and our treatment, but we don't have access to any way of measuring one of those control variables. We have two options: 1. Estimate a regression including the treatment and one control variable. 2. Estimate a bivariate regression including only the treatment. --- Both of these options will have omitted variable bias. 1. The first option will have the bias due to leaving out the unmeasurable variable. 2. The second option will have the bias due to leaving out the unmeasurable variable, and the bias due to leaving out the measured control variable. --- These two sources of bias can point in the same direction, in which scenario the second option will usually be worse than the first. Or the two sources of bias can point in opposite directions, in which scenario the second option can often be better than the first. --- <img src="images/Clarke1.png" width="90%" /> --- <img src="images/Clarke2.png" width="90%" /> --- As we can see, there are in principle about as many situations where leaving out partial sets of control variables is the right choice as there are situations where including them is the right choice. --- ### Regression for Causal Inference - If the causal effect is not constant across all cases, regression will not give a consistent estimate of the average treatment effect. - Instead, it estimates a covariance-adjusted weighted average of cases' treatment effects. --- Part of our goal in this seminar is to explore advanced techniques. You don't always have to implement each of these, but learning about them can give you good ideas and broaden your skill set! --- ### Aronow and Samii 2016 <img src="images/aronowsamii1.JPG" width="90%" /> --- ### Aronow and Samii 2016 <img src="images/aronowsamii2.JPG" width="90%" /> --- ### Aronow and Samii 2016 <img src="images/aronowsamii3.JPG" width="90%" /> --- ### Chattopadhyay and Zubizarreta 2023 <img src="images/Chattopadhyay1.png" width="90%" /> --- ### Chattopadhyay and Zubizarreta 2023 `$$\hat{\tau}_{OLS} = \sum_{i:D_{i}=1} w_{i} Y_{i} - \sum_{i:D_{i}=0} w_{i} Y_{i}$$` --- ### Chattopadhyay and Zubizarreta 2023 There is a different set of weights called the ''multi-regression imputation'' or MRI: `$$\hat{\tau}_{MRI} = \sum_{i:D_{i}=1} w^{MRI}_{i}(\bar{X}) Y_{i} - \sum_{i:D_{i}=0} w^{MRI}_{i}(\bar{X}) Y_{i}$$` `$$w^{MRI}_{i}(x) = n^{−1}_{T} + (X_{i} − \bar{X}_{T})^{T}S^{−1}_{T}(x − \bar{X}_{T})$$` --- ### Chattopadhyay and Zubizarreta 2023 If there is causal heterogeneity, the OLS weights are a biased estimator of ATE, but the MRI weights can be an unbiased estimator. --- ``` r library(lmw) ``` ``` ## Warning: package 'lmw' was built under R version 4.5.3 ``` ``` r qog2000demsin2020clean <- qog2000demsin2020 %>% filter(!is.na(br_pres) & !is.na(ccp_systyear) & !is.na(ht_colonial) & !is.na(br_pvote)) ``` --- ``` r preslmw.uri <- lmw(~ br_pres + as.factor(ht_colonial) + br_pvote, data = qog2000demsin2020clean, estimand = "ATT", method = "URI", treat = "br_pres") preslmw.urifit <- lmw_est(preslmw.uri, outcome = "vdem_libdem") ``` ``` ## Warning in meatHC(x, type = type, omega = omega): HC3 covariances are ## numerically unstable for hat values close to 1 (and undefined if exactly 1) as ## for observation(s) 29, 44 ``` ``` r summary(preslmw.urifit) ``` ``` ## ## Effect estimates: ## Estimate Std. Error 95% CI L 95% CI U t value Pr(>|t|) ## E[Y₁-Y₀|A=1] 0.02045 0.03772 -0.05535 0.09625 0.542 0.59 ## ## Residual standard error: 0.1371 on 49 degrees of freedom ``` --- ``` r preslmw.mri <- lmw(~ br_pres + as.factor(ht_colonial) + br_pvote, data = qog2000demsin2020clean, estimand = "ATT", method = "MRI", treat = "br_pres") preslmw.mrifit <- lmw_est(preslmw.mri, outcome = "vdem_libdem") ``` ``` ## Warning in meatHC(x, type = type, omega = omega): HC3 covariances are ## numerically unstable for hat values close to 1 (and undefined if exactly 1) as ## for observation(s) 11, 29, 44 ``` ``` r summary(preslmw.mrifit) ``` ``` ## ## Effect estimates: ## Estimate Std. Error 95% CI L 95% CI U t value Pr(>|t|) ## E[Y₁-Y₀|A=1] 0.02144 0.03880 -0.05667 0.09955 0.552 0.583 ## ## Residual standard error: 0.1396 on 46 degrees of freedom ## ## Potential outcome means: ## Estimate Std. Error 95% CI L 95% CI U t value Pr(>|t|) ## E[Y₀|A=1] 0.62856 0.03964 0.54877 0.70835 15.86 <2e-16 *** ## E[Y₁|A=1] 0.65000 0.03184 0.58590 0.71410 20.41 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ``` r plot(preslmw.mri) ``` <img src="2regression_files/figure-html/unnamed-chunk-28-1.png" width="60%" />