class: center, middle, inverse, title-slide .title[ # 5: Instrumental Variables and RDD ] .subtitle[ ## Quantitative Causal Inference ] .author[ ###
J. Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### May 1, 2025 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> --- ### Endogeneity in OLS - `\(E(\mathbf{u} | \mathbf{X}) = 0\)`? - As you'll recall, `\(E(\hat{\mathbf{\beta}}) = \beta + (\mathbf{X}^{T} \mathbf{X})^{-1} E(\mathbf{X}^{T} \mathbf{u})\)`. So, if `\(E(\mathbf{X}^{T} \mathbf{u}) = \mathbf{\nu} \neq 0\)`, then `\(E(\hat{\mathbf{\beta}} - \mathbf{\beta}) = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{\nu} \neq 0\)`. --- ### Consequences of Endogeneity - When `\(\mathbf{X}\)` is endogenous, then our estimates of `\(\hat{\mathbf{\beta}}\)` will be a mixture of the desired relationship between `\(\mathbf{X}\)` and `\(\mathbf{y}\)` *and* the nuisance relationship between `\(\mathbf{X}\)` and `\(\mathbf{u}\)`. --- ### How Can Endogeneity Arise? - Omitted explanatory variables - Measurement error on the right-hand side of the model - Simultaneity between the right- and left-hand sides of the model - etc. --- ### What to Do When Endogeneity Is a Problem? 1. Give up. 2. Try to change the model by including all omitted relevant variables. 3. Find an instrument. 4. Find other data. --- ### Instrumental Variables - Suppose the model is: `\(\mathbf{y} = \mathbf{W} \mathbf{\gamma} + \mathbf{x} \beta + \mathbf{\epsilon}\)`. The `\(\mathbf{W}\)` variables are exogenous, but the `\(\mathbf{x}\)` variable is endogenous. - Now, assume that there exists a variable `\(z\)` that *doesn't* belong in the regression model, with the following two characteristics: - `\(cov(\mathbf{z}^{T} \mathbf{x}) \neq 0\)` - `\(E(\mathbf{z}^{T} \mathbf{\epsilon}) = 0\)` --- ### Instrumental Variables If these conditions are met (doesn't belong in the regression, related linearly with `\(\mathbf{x}\)`, no connection with `\(\mathbf{\epsilon})\)`, then `\(\mathbf{z}\)` meets the mathematical definition of an *instrument*. --- ### Instrumental Variables --- ``` r library(dagitty) hypotheticalinstruments.dag <- dagitty( "dag { Polarization -> DemocraticErosion ElitePower -> DemocraticErosion Corruption -> Polarization Corruption -> DemocraticErosion SocialMedia -> Polarization PrimaryElections -> Polarization PrimaryElections -> ElitePower EconomicInequality -> ElitePower EconomicInequality -> Polarization Polarization [exposure] DemocraticErosion [outcome]}" ) ``` --- ``` r plot( hypotheticalinstruments.dag ) ``` <img src="5naturalexperiments2_files/figure-html/unnamed-chunk-3-1.png" width="50%" /> --- ``` r instrumentalVariables(hypotheticalinstruments.dag) ``` ``` ## EconomicInequality | ElitePower ## PrimaryElections | ElitePower ## SocialMedia ``` --- ### Instrumental Variables - Let's momentarily consider a bivariate regression, `\(\mathbf{y} = \mathbf{x} \beta + \mathbf{\epsilon}\)`, with instrument `\(\mathbf{z}\)`. - The OLS estimate of `\(\beta\)` is `\((\mathbf{x}^{T}\mathbf{x})^{-1} \mathbf{x}^{T}\mathbf{y}\)`. - Consider instead the IV estimate of `\(\beta\)`: `\((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T}\mathbf{y}\)`. - `\(E(\hat{\beta}_{IV}) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T}\mathbf{y}) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} [\mathbf{x} \beta + \mathbf{\epsilon}]) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{x} \beta) + E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{\epsilon}) = \beta + 0\)` --- ### Instrumental Variables - Now let's consider a multivariate regression, `\(\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}\)`, with some `\(t \leq k\)` of the `\(\mathbf{X}\)` variables endogenous, and with `\(t\)` instruments `\(\mathbf{z}_{1} \ldots \mathbf{z}_{t}\)`. --- ### Instrumental Variables - The OLS estimate of `\(\mathbf{\beta}\)` is `\((\mathbf{X}^{T}\mathbf{X})^{-1} \mathbf{X}^{T}\mathbf{y}\)`. - Form the matrix `\(\mathbf{Z}\)`, containing the `\(t\)` instruments, as well as the `\(k - t\)` exogenous elements from `\(\mathbf{X}\)`. - The IV estimate of `\(\mathbf{\beta}\)` is: `\((\mathbf{Z}^{T}\mathbf{X})^{-1} \mathbf{Z}^{T}\mathbf{y}\)`. --- ### Instrumental Variables - As in the bivariate situation, given the IV assumptions, the IV estimator eliminates the problem of endogeneity. - This estimator only works if the number of instruments is exactly equal to the number of endogenous variables. --- ### Another Way of Thinking About Instrumental Variables - Let's partition the independent variables into two matrices, `\(\mathbf{W}\)`, which has the `\(k - t\)` exogenous variables in the model of `\(\mathbf{y}\)`, and `\(\mathbf{X}\)`, which has the `\(t\)` endogenous variables. - So the `\(\mathbf{Z}\)` matrix is the `\(\mathbf{W}\)` matrix with `\(t\)` extra columns containing the instruments. --- ### Another Way of Thinking About Instrumental Variables - Suppose we regress each column of the `\(\mathbf{X}\)` matrix on the matrix `\(\mathbf{Z}\)` and form the fitted values. - `\(\hat{\mathbf{X}} = \mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X}\)` - Now use `\(\hat{\mathbf{X}}\)` in the place of `\(\mathbf{X}\)` in the OLS regression formula. --- ### Another Way of Thinking About Instrumental Variables $$ `\begin{split} \hat{\mathbf{\beta}}_{IV} = & (\mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X})^{-1} \\ & \mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{y} = \\ & (\mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X})^{-1} \\ & \mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{y} = \\ & (\mathbf{Z}^{T}\mathbf{X})^{-1} \mathbf{Z}^{T}\mathbf{y} \end{split}` $$ - The instrumental variables estimator gives the same coefficient estimates as running an OLS regression using `\(\hat{\mathbf{X}}\)` as predicted by `\(\mathbf{Z}\)` in the place of `\(\mathbf{X}\)`. --- ### Variance in Instrumental Variables - `\(\hat{\mathbf{X}}\)` is a random variable, so the normal OLS standard errors will underestimate uncertainty when using IV. - Instead, the correct estimate of the standard errors of the coefficient estimates in IV is: - `\(\hat{V} (\hat{\mathbf{\beta}}_{IV}) = \hat{\sigma}^{2} (\mathbf{Z}^{T} \mathbf{X})^{-1} \mathbf{Z}^{T} \mathbf{Z} (\mathbf{X}^{T} \mathbf{Z})^{-1}\)` --- ### Examples of Proposed Instruments - Suppose we're interested in the relationship between education and some political variable. - One proposed instrument for education, due to David Card (1995), is residential proximity to a college or university. - A second proposed instrument for education, due to Angrist and Krueger (1991) is month of birth. - A third instrument, from Nguyen et al. (2016), involves genetic risk score for years of schooling. --- ### Examples of Proposed Instruments - Suppose our focus is on the relationship between economic performance and civil war in agricultural countries. - Miguel, Satyanath, Sergenti, E. (2004) suggest using rainfall as an instrument for economic performance. --- ``` r library(haven) mss_repdata_1_ <- read_dta("https://github.com/jnseawright/PS406/raw/main/data/mss_repdata%20(1).dta") ``` --- ``` r library(ivreg) migueliv <- ivreg(any_prio ~ gdp_g + gdp_g_l + y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest | GPCP_g + GPCP_g_l+ y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest, data=mss_repdata_1_) summary(migueliv) ``` ``` ## ## Call: ## ivreg(formula = any_prio ~ gdp_g + gdp_g_l + y_0 + polity2l + ## ethfrac + relfrac + Oil + lpopl1 + lmtnest | GPCP_g + GPCP_g_l + ## y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest, ## data = mss_repdata_1_) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0098 -0.3114 -0.1342 0.3796 2.0431 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.438746 0.137120 -3.200 0.00143 ** ## gdp_g -0.528454 1.517953 -0.348 0.72784 ## gdp_g_l -2.076062 1.781017 -1.166 0.24413 ## y_0 -0.042668 0.020714 -2.060 0.03977 * ## polity2l 0.002769 0.003220 0.860 0.39005 ## ethfrac 0.225661 0.090639 2.490 0.01301 * ## relfrac -0.236262 0.103205 -2.289 0.02235 * ## Oil 0.043934 0.056533 0.777 0.43733 ## lpopl1 0.067683 0.017231 3.928 9.38e-05 *** ## lmtnest 0.077338 0.014966 5.168 3.06e-07 *** ## ## Diagnostic tests: ## df1 df2 statistic p-value ## Weak instruments (gdp_g) 2 733 8.646 0.000194 *** ## Weak instruments (gdp_g_l) 2 733 5.943 0.002752 ** ## Wu-Hausman 2 731 0.744 0.475485 ## Sargan 0 NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4421 on 733 degrees of freedom ## Multiple R-Squared: 0.01679, Adjusted R-squared: 0.004723 ## Wald test: 10.27 on 9 and 733 DF, p-value: 5.189e-15 ``` --- ``` r library(lmtest) library(sandwich) ``` --- ``` r coeftest(migueliv, vcov = vcovCL(migueliv, cluster = ~country_name)) ``` ``` ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.4387459 0.3532897 -1.2419 0.21468 ## gdp_g -0.5284537 1.4250511 -0.3708 0.71087 ## gdp_g_l -2.0760619 1.0241329 -2.0271 0.04301 * ## y_0 -0.0426678 0.0483408 -0.8826 0.37772 ## polity2l 0.0027692 0.0044092 0.6281 0.53016 ## ethfrac 0.2256606 0.2757338 0.8184 0.41339 ## relfrac -0.2362620 0.2397070 -0.9856 0.32464 ## Oil 0.0439336 0.2123598 0.2069 0.83616 ## lpopl1 0.0676828 0.0498531 1.3576 0.17499 ## lmtnest 0.0773375 0.0385422 2.0066 0.04516 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ### What's Wrong with Weak Instruments? - For an IV estimate of a regression with only one independent variable and only one instrument, the IV estimator is: `\((\mathbf{z}^{T} \mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{y}\)`, which is the same as `\(cov(\mathbf{z}, \mathbf{y})/cov(\mathbf{z}, \mathbf{x})\)`. --- ### What's Wrong with Weak Instruments? - The `\(cov(\mathbf{z}, \mathbf{y})\)` may be thought of as a combination of three components: - the direct effect of `\(\mathbf{z}\)` on `\(\mathbf{y}\)`, - the indirect effect of `\(\mathbf{z}\)` on `\(\mathbf{y}\)` via `\(\mathbf{x}\)`, - and any correlation between `\(\mathbf{z}\)` and `\(\mathbf{u}\)`. --- ### What's Wrong with Weak Instruments? - If `\(cov(\mathbf{z}, \mathbf{x})\)` is big, then a moderate amount of contamination of `\(cov(\mathbf{z}, \mathbf{y})\)` with undesirable information will have only a small effect on the estimate. - If `\(cov(\mathbf{z}, \mathbf{x})\)` is very small, then even a small amount of contamination of `\(cov(\mathbf{z}, \mathbf{y})\)` with undesirable information will lead to serious bias in the estimate. --- ``` r summary(migueliv) ``` ``` ## ## Call: ## ivreg(formula = any_prio ~ gdp_g + gdp_g_l + y_0 + polity2l + ## ethfrac + relfrac + Oil + lpopl1 + lmtnest | GPCP_g + GPCP_g_l + ## y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest, ## data = mss_repdata_1_) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0098 -0.3114 -0.1342 0.3796 2.0431 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.438746 0.137120 -3.200 0.00143 ** ## gdp_g -0.528454 1.517953 -0.348 0.72784 ## gdp_g_l -2.076062 1.781017 -1.166 0.24413 ## y_0 -0.042668 0.020714 -2.060 0.03977 * ## polity2l 0.002769 0.003220 0.860 0.39005 ## ethfrac 0.225661 0.090639 2.490 0.01301 * ## relfrac -0.236262 0.103205 -2.289 0.02235 * ## Oil 0.043934 0.056533 0.777 0.43733 ## lpopl1 0.067683 0.017231 3.928 9.38e-05 *** ## lmtnest 0.077338 0.014966 5.168 3.06e-07 *** ## ## Diagnostic tests: ## df1 df2 statistic p-value ## Weak instruments (gdp_g) 2 733 8.646 0.000194 *** ## Weak instruments (gdp_g_l) 2 733 5.943 0.002752 ** ## Wu-Hausman 2 731 0.744 0.475485 ## Sargan 0 NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4421 on 733 degrees of freedom ## Multiple R-Squared: 0.01679, Adjusted R-squared: 0.004723 ## Wald test: 10.27 on 9 and 733 DF, p-value: 5.189e-15 ``` --- ### What's Wrong with Strong Instruments? - We know that `\(\mathbf{x}\)` and `\(\mathbf{u}\)` are related, or we wouldn't bother with the IV procedure. - If `\(\mathbf{x}\)` and `\(\mathbf{z}\)` are very strongly related, and `\(\mathbf{x}\)` and `\(\mathbf{u}\)` are also substantially related, then `\(\mathbf{z}\)` and `\(\mathbf{u}\)` are almost certainly substantially related, as well. - So a central assumption of IV regression fails. --- ``` r miguellm2 <- lm(any_prio ~ gdp_g + gdp_g_l +GPCP_g + GPCP_g_l + y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest + year:country_name, data=mss_repdata_1_) summary(miguellm2) ``` ``` ## ## Call: ## lm(formula = any_prio ~ gdp_g + gdp_g_l + GPCP_g + GPCP_g_l + ## y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest + ## year:country_name, data = mss_repdata_1_) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.92987 -0.12503 -0.02391 0.08826 1.07344 ## ## Coefficients: ## Estimate Std. Error t value ## (Intercept) -1.240e+02 2.551e+01 -4.861 ## gdp_g -4.108e-01 1.620e-01 -2.536 ## gdp_g_l -8.588e-02 1.574e-01 -0.546 ## GPCP_g -2.773e-02 5.904e-02 -0.470 ## GPCP_g_l -1.315e-01 5.968e-02 -2.204 ## y_0 1.216e+01 4.717e+00 2.577 ## polity2l -4.581e-03 2.999e-03 -1.528 ## ethfrac 8.706e+01 2.119e+01 4.108 ## relfrac 7.141e+01 2.700e+01 2.645 ## Oil -2.507e-02 1.285e-01 -0.195 ## lpopl1 5.051e-02 3.375e-01 0.150 ## lmtnest 2.545e+00 3.513e+00 0.724 ## year:country_nameAngola -7.233e-04 9.982e-03 -0.072 ## year:country_nameBenin 1.192e-02 1.037e-02 1.150 ## year:country_nameBotswana 1.147e-02 1.019e-02 1.126 ## year:country_nameBurkina Faso 8.857e-03 1.097e-02 0.808 ## year:country_nameBurundi 3.250e-02 1.112e-02 2.921 ## year:country_nameCameroon -1.018e-02 1.014e-02 -1.004 ## year:country_nameCentral African Republic -2.671e-03 1.023e-02 -0.261 ## year:country_nameChad -2.422e-03 1.002e-02 -0.242 ## year:country_nameCongo 4.287e-03 9.966e-03 0.430 ## year:country_nameDjibouti 1.642e-02 1.079e-02 1.522 ## year:country_nameEthiopia 2.734e-03 1.078e-02 0.254 ## year:country_nameGabon -1.396e-02 1.164e-02 -1.200 ## year:country_nameGambia 1.752e-02 1.073e-02 1.633 ## year:country_nameGhana 2.581e-04 1.101e-02 0.023 ## year:country_nameGuinea 1.332e-02 1.066e-02 1.249 ## year:country_nameGuinea-Bissau 4.076e-03 1.038e-02 0.393 ## year:country_nameIvory Coast -9.641e-03 1.001e-02 -0.964 ## year:country_nameKenya -9.574e-03 1.023e-02 -0.936 ## year:country_nameLesotho 2.928e-02 1.056e-02 2.772 ## year:country_nameLiberia -3.840e-03 1.008e-02 -0.381 ## year:country_nameMadagascar 2.880e-02 1.095e-02 2.630 ## year:country_nameMalawi 6.579e-03 9.954e-03 0.661 ## year:country_nameMali 1.777e-02 1.141e-02 1.557 ## year:country_nameMauritania 4.175e-02 1.195e-02 3.495 ## year:country_nameMozambique 3.336e-03 1.003e-02 0.333 ## year:country_nameNamibia 2.909e-03 1.051e-02 0.277 ## year:country_nameNiger 1.295e-02 1.048e-02 1.236 ## year:country_nameNigeria -6.873e-03 1.022e-02 -0.672 ## year:country_nameRwanda 2.860e-02 1.065e-02 2.685 ## year:country_nameSenegal 1.812e-02 1.112e-02 1.629 ## year:country_nameSierra Leone 7.694e-04 9.732e-03 0.079 ## year:country_nameSomalia 5.019e-02 1.219e-02 4.117 ## year:country_nameSouth Africa -1.563e-02 1.090e-02 -1.433 ## year:country_nameSudan 6.611e-03 1.016e-02 0.651 ## year:country_nameSwaziland 4.874e-03 9.863e-03 0.494 ## year:country_nameTanzania, United Republic of -8.465e-03 1.069e-02 -0.792 ## year:country_nameTogo 1.040e-02 1.048e-02 0.993 ## year:country_nameUganda -9.301e-03 1.038e-02 -0.896 ## year:country_nameZaire -6.586e-03 1.058e-02 -0.623 ## year:country_nameZambia 3.918e-03 1.033e-02 0.379 ## year:country_nameZimbabwe 1.104e-02 9.847e-03 1.121 ## Pr(>|t|) ## (Intercept) 1.45e-06 *** ## gdp_g 0.011436 * ## gdp_g_l 0.585406 ## GPCP_g 0.638763 ## GPCP_g_l 0.027882 * ## y_0 0.010172 * ## polity2l 0.127025 ## ethfrac 4.47e-05 *** ## relfrac 0.008363 ** ## Oil 0.845362 ## lpopl1 0.881061 ## lmtnest 0.469024 ## year:country_nameAngola 0.942250 ## year:country_nameBenin 0.250668 ## year:country_nameBotswana 0.260520 ## year:country_nameBurkina Faso 0.419562 ## year:country_nameBurundi 0.003601 ** ## year:country_nameCameroon 0.315773 ## year:country_nameCentral African Republic 0.794020 ## year:country_nameChad 0.809158 ## year:country_nameCongo 0.667215 ## year:country_nameDjibouti 0.128539 ## year:country_nameEthiopia 0.799902 ## year:country_nameGabon 0.230639 ## year:country_nameGambia 0.102895 ## year:country_nameGhana 0.981304 ## year:country_nameGuinea 0.212049 ## year:country_nameGuinea-Bissau 0.694642 ## year:country_nameIvory Coast 0.335632 ## year:country_nameKenya 0.349712 ## year:country_nameLesotho 0.005715 ** ## year:country_nameLiberia 0.703479 ## year:country_nameMadagascar 0.008726 ** ## year:country_nameMalawi 0.508872 ## year:country_nameMali 0.119819 ## year:country_nameMauritania 0.000505 *** ## year:country_nameMozambique 0.739436 ## year:country_nameNamibia 0.782099 ## year:country_nameNiger 0.216902 ## year:country_nameNigeria 0.501709 ## year:country_nameRwanda 0.007420 ** ## year:country_nameSenegal 0.103800 ## year:country_nameSierra Leone 0.937016 ## year:country_nameSomalia 4.31e-05 *** ## year:country_nameSouth Africa 0.152185 ## year:country_nameSudan 0.515574 ## year:country_nameSwaziland 0.621312 ## year:country_nameTanzania, United Republic of 0.428587 ## year:country_nameTogo 0.321189 ## year:country_nameUganda 0.370670 ## year:country_nameZaire 0.533638 ## year:country_nameZambia 0.704701 ## year:country_nameZimbabwe 0.262781 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2992 on 690 degrees of freedom ## Multiple R-squared: 0.576, Adjusted R-squared: 0.5441 ## F-statistic: 18.03 on 52 and 690 DF, p-value: < 2.2e-16 ``` --- ### Encouragement Designs in Experiments - Intent-to-treat analysis - Use the treatment assignment as an instrument, the actual treatment received as the treatment variable, and the outcome as normal. --- ``` r library(readr) peruemotions <- read_csv("https://github.com/jnseawright/PS406/raw/main/data/peruemotions.csv") ``` --- ``` r summary(lm(outsidervote~simpletreat, data=peruemotions)) ``` ``` ## ## Call: ## lm(formula = outsidervote ~ simpletreat, data = peruemotions) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.6093 -0.4916 0.3907 0.5084 0.5084 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.49164 0.02874 17.104 <2e-16 *** ## simpletreat 0.11763 0.04962 2.371 0.0182 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.497 on 448 degrees of freedom ## Multiple R-squared: 0.01239, Adjusted R-squared: 0.01018 ## F-statistic: 5.62 on 1 and 448 DF, p-value: 0.01818 ``` --- ``` r summary(lm(outsidervote~enojado, data=peruemotions)) ``` ``` ## ## Call: ## lm(formula = outsidervote ~ enojado, data = peruemotions) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.6905 -0.5147 0.3095 0.4853 0.4853 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.51471 0.02463 20.90 <2e-16 *** ## enojado 0.17577 0.08062 2.18 0.0298 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4975 on 448 degrees of freedom ## Multiple R-squared: 0.0105, Adjusted R-squared: 0.00829 ## F-statistic: 4.753 on 1 and 448 DF, p-value: 0.02976 ``` --- ``` r summary(ivreg(outsidervote~enojado|simpletreat,data=peruemotions)) ``` ``` ## ## Call: ## ivreg(formula = outsidervote ~ enojado | simpletreat, data = peruemotions) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.0804 -0.3716 -0.3716 0.6284 0.6284 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.37162 0.09586 3.877 0.000122 *** ## enojado 1.70882 0.96993 1.762 0.078788 . ## ## Diagnostic tests: ## df1 df2 statistic p-value ## Weak instruments 1 448 5.664 0.0177 * ## Wu-Hausman 1 447 4.608 0.0324 * ## Sargan 0 NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6688 on 448 degrees of freedom ## Multiple R-Squared: -0.7881, Adjusted R-squared: -0.7921 ## Wald test: 3.104 on 1 and 448 DF, p-value: 0.07879 ``` --- ``` r summary(lm(outsidervote~enojado+simpletreat, data=peruemotions)) ``` ``` ## ## Call: ## lm(formula = outsidervote ~ enojado + simpletreat, data = peruemotions) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.7439 -0.4807 0.3630 0.5193 0.5193 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.48066 0.02921 16.453 <2e-16 *** ## enojado 0.15639 0.08081 1.935 0.0536 . ## simpletreat 0.10687 0.04978 2.147 0.0324 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4955 on 447 degrees of freedom ## Multiple R-squared: 0.0206, Adjusted R-squared: 0.01621 ## F-statistic: 4.7 on 2 and 447 DF, p-value: 0.009551 ``` --- ### LATE - Let's divide our population into four categories: 1. Compliers: will have high `\(\mathbf{x}\)` whenever `\(\mathbf{z}\)` is high, and low `\(\mathbf{x}\)` whenever `\(\mathbf{z}\)` is low. 2. Defiers: will have low `\(\mathbf{x}\)` whenever `\(\mathbf{z}\)` is high, and high `\(\mathbf{x}\)` whenever `\(\mathbf{z}\)` is low. 3. Always-takers: will have high `\(\mathbf{x}\)` no matter what. 4. Never-takers: will have low `\(\mathbf{x}\)` no matter what. --- ### LATE - The effect of the instrument on the treatment is `\(\%Compliers - \%Defiers\)`. - The effect of the instrument on the outcome, given the exclusion restriction, is (ATE for Compliers) times `\(\%Compliers -\)` (ATE for Defiers) times `\(\%Defiers\)`. --- ### Beyond LATE - Aronow and Carnegie propose estimating ATE by reweighting on the *compliance score*. - The compliance score is the probability that the received treatment is greater, when in the encouragement treatment, than it is in the control. --- ``` r peruemotionstrim <- na.omit(data.frame(enojado=peruemotions$enojado, outsidervote=peruemotions$outsidervote, simpletreat=peruemotions$simpletreat, Cuzco=peruemotions$Cuzco, age=peruemotions$age)) #packageurl <- "http://cran.r-project.org/src/contrib/Archive/icsw/icsw_1.0.0.tar.gz" #install.packages(packageurl, repos=NULL, type="source") library(icsw) ``` --- ``` r exp.reweight <- with(peruemotionstrim, icsw.tsls(D=enojado, Y=outsidervote, Z=simpletreat, X=cbind(1,Cuzco), W=cbind(Cuzco,age), R=100)) ``` --- ``` r exp.reweight$coefficients ``` ``` ## Cuzco D ## 0.35982616 -0.09801187 6.14701458 ``` ``` r exp.reweight$coefs.se.boot ``` ``` ## Cuzco D ## 0.2133571 1.8464919 105.6180688 ``` --- ``` r sidelm <- lm(enojado ~ simpletreat+Cuzco+age, data= peruemotionstrim) summary(sidelm) ``` ``` ## ## Call: ## lm(formula = enojado ~ simpletreat + Cuzco + age, data = peruemotionstrim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.17130 -0.10300 -0.09608 -0.04607 1.00316 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.151607 0.042344 3.580 0.000382 *** ## simpletreat 0.068305 0.029371 2.326 0.020501 * ## Cuzco -0.035445 0.029606 -1.197 0.231876 ## age -0.002210 0.001263 -1.750 0.080903 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2893 on 434 degrees of freedom ## Multiple R-squared: 0.02289, Adjusted R-squared: 0.01614 ## F-statistic: 3.39 on 3 and 434 DF, p-value: 0.01804 ``` --- ### When Might 2SLS or IV Be a Good Idea? - If there's a true randomization in the world that you can take advantage of -- but there still might be down-sides. - As a Hausmann test, to see if your most important results hold up under some alternative assumptions. - If a reviewer demands it of you. --- ### RDD - An RDD can be analyzed by comparing simple average scores just above and just below the threshold. - Alternatively, a (simple or complex) statistical model may be used to extrapolate from the data just above and just below the threshold. --- <img src="images/Lee1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="images/Lee2.png" width="90%" style="display: block; margin: auto;" /> --- ``` r #This coding example due to Yuta Toyama. demmeans <- split(lmb_data$democrat, cut(lmb_data$lagdemvoteshare, 100)) %>% lapply(mean) %>% unlist() agg_lmb_data <- data.frame(democrat = demmeans, lagdemvoteshare = seq(0.01,1, by = 0.01)) ``` --- ``` r lmb_data <- lmb_data %>% mutate(gg_group = if_else(lagdemvoteshare > 0.5, 1,0)) gg_srd = ggplot(data=lmb_data, aes(lagdemvoteshare, democrat)) + geom_point(aes(x = lagdemvoteshare, y = democrat), data = agg_lmb_data) + xlim(0,1) + ylim(-0.1,1.1) + geom_vline(xintercept = 0.5) + xlab("Democrat Vote Share, time t") + ylab("Probability of Democrat Win, time t+1") + scale_y_continuous(breaks=seq(0,1,0.2)) + ggtitle(TeX("Effect of Initial Win on Winning Next Election: $\\P^D_{t+1} - P^R_{t+1}$")) ``` --- ``` r gg_srd + stat_smooth(aes(lagdemvoteshare, democrat, group = gg_group), method = "lm" , formula = y ~ x + I(x^2)) ``` <img src="5naturalexperiments2_files/figure-html/unnamed-chunk-25-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r gg_srd + stat_smooth(data=lmb_data %>% filter(lagdemvoteshare>.45 & lagdemvoteshare<.55), aes(lagdemvoteshare, democrat, group = gg_group), method = "lm", formula = y ~ x + I(x^2)) ``` <img src="5naturalexperiments2_files/figure-html/unnamed-chunk-26-1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="5naturalexperiments2_files/figure-html/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" /> --- ``` r lmb_subset <- lmb_data %>% filter(lagdemvoteshare>.48 & lagdemvoteshare<.52) lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") summary(lm_1) ``` ``` ## ## Call: ## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") ## ## Standard error type: HC1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 31.20 1.334 23.39 3.788e-95 28.58 33.81 913 ## lagdemocrat 21.28 1.951 10.91 3.987e-26 17.45 25.11 913 ## ## Multiple R-squared: 0.1152 , Adjusted R-squared: 0.1142 ## F-statistic: 119 on 1 and 913 DF, p-value: < 2.2e-16 ``` --- ``` ## ## Call: ## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") ## ## Standard error type: HC1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 31.20 1.334 23.39 3.788e-95 28.58 33.81 913 ## lagdemocrat 21.28 1.951 10.91 3.987e-26 17.45 25.11 913 ## ## Multiple R-squared: 0.1152 , Adjusted R-squared: 0.1142 ## F-statistic: 119 on 1 and 913 DF, p-value: < 2.2e-16 ``` --- ### RDD Windows - How wide a window above and below the break point? --- ``` r lmb_subset <- lmb_data %>% filter(lagdemvoteshare>.49 & lagdemvoteshare<.51) lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") summary(lm_1) ``` ``` ## ## Call: ## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") ## ## Standard error type: HC1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 31.71 1.938 16.358 4.539e-47 27.90 35.52 428 ## lagdemocrat 23.97 2.799 8.564 1.985e-16 18.47 29.47 428 ## ## Multiple R-squared: 0.1453 , Adjusted R-squared: 0.1433 ## F-statistic: 73.34 on 1 and 428 DF, p-value: < 2.2e-16 ``` --- ``` r lmb_subset <- lmb_data %>% filter(lagdemvoteshare>.495 & lagdemvoteshare<.505) lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") summary(lm_1) ``` ``` ## ## Call: ## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") ## ## Standard error type: HC1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 29.55 2.547 11.600 3.574e-24 24.52 34.57 202 ## lagdemocrat 29.13 3.845 7.577 1.250e-12 21.55 36.72 202 ## ## Multiple R-squared: 0.2177 , Adjusted R-squared: 0.2138 ## F-statistic: 57.41 on 1 and 202 DF, p-value: 1.25e-12 ``` --- ### RDD > Irrespective of the manner in which the bandwidth is chosen, one > should always investigate the sensitivity of the inferences to this > choice, for example, by including results for bandwidths twice (or > four times) and half (or a quarter of) the size of the originally > chosen bandwidth. Obviously, such bandwidth choices affect both > estimates and standard errors, but if the results are critically > dependent on a particular bandwidth choice, they are clearly less > credible than if they are robust to such variation in bandwidths. > (Imbens and Lemieux 2008) --- ### RDD - Green, Leong, Kern, Gerber, and Larimer find that an estimate of the optimal bandwidth proposed by Imbens and Kalyanaraman, in conjunction with local linear regression, helps RDD come very close to replicating experimental results. --- ``` r library(rdd) rddik <- RDestimate(score ~ lagdemvoteshare, cutpoint=0.5, data=lmb_data) summary(rddik) ``` ``` ## ## Call: ## RDestimate(formula = score ~ lagdemvoteshare, data = lmb_data, ## cutpoint = 0.5) ## ## Type: ## sharp ## ## Estimates: ## Bandwidth Observations Estimate Std. Error z value Pr(>|z|) ## LATE 0.12883 6072 18.84 1.637 11.511 1.163e-30 ## Half-BW 0.06442 3150 20.48 2.353 8.705 3.190e-18 ## Double-BW 0.25767 10512 22.45 1.162 19.321 3.583e-83 ## ## LATE *** ## Half-BW *** ## Double-BW *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## F-statistics: ## F Num. DoF Denom. DoF p ## LATE 516.1 3 6068 0 ## Half-BW 167.8 3 3146 0 ## Double-BW 1324.0 3 10508 0 ``` --- ``` r library(rdrobust) rdbw2 <- rdbwselect(lmb_data$score, lmb_data$lagdemvoteshare, c=0.5) summary(rdbw2) ``` ``` ## Call: rdbwselect ## ## Number of Obs. 13577 ## BW type mserd ## Kernel Triangular ## VCE method NN ## ## Number of Obs. 5670 7907 ## Order est. (p) 1 1 ## Order bias (q) 2 2 ## Unique Obs. 2878 3279 ## ## ======================================================= ## BW est. (h) BW bias (b) ## Left of c Right of c Left of c Right of c ## ======================================================= ## mserd 0.086 0.086 0.133 0.133 ## ======================================================= ``` --- ``` r lmb_subset <- lmb_data %>% filter(lagdemvoteshare>.41 & lagdemvoteshare<.59) lm_final <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") summary(lm_final) ``` ``` ## ## Call: ## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1") ## ## Standard error type: HC1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 28.35 0.5779 49.06 0.000e+00 27.21 29.48 4304 ## lagdemocrat 28.77 0.8610 33.41 7.265e-218 27.08 30.46 4304 ## ## Multiple R-squared: 0.2067 , Adjusted R-squared: 0.2066 ## F-statistic: 1117 on 1 and 4304 DF, p-value: < 2.2e-16 ```