5: Instrumental Variables and RDD

class: center, middle, inverse, title-slide

.title[
# 5: Instrumental Variables and RDD
]
.subtitle[
## Quantitative Causal Inference
]
.author[
### <large>J. Seawright</large>
]
.institute[
### <small>Northwestern Political Science</small>
]
.date[
### May 1, 2025
]

---

class: center, middle

pre[class] {
  max-height: 200px;
}
</style>

---
### Endogeneity in OLS

-   `$E(\mathbf{u} | \mathbf{X}) = 0$`?

-   As you'll recall,
    `$E(\hat{\mathbf{\beta}}) = \beta + (\mathbf{X}^{T} \mathbf{X})^{-1} E(\mathbf{X}^{T} \mathbf{u})$`.
    So, if `$E(\mathbf{X}^{T} \mathbf{u}) = \mathbf{\nu} \neq 0$`, then
    `$E(\hat{\mathbf{\beta}} - \mathbf{\beta}) = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{\nu} \neq 0$`.

---
### Consequences of Endogeneity

-   When `$\mathbf{X}$` is endogenous, then our estimates of
    `$\hat{\mathbf{\beta}}$` will be a mixture of the desired relationship
    between `$\mathbf{X}$` and `$\mathbf{y}$` *and* the nuisance
    relationship between `$\mathbf{X}$` and `$\mathbf{u}$`.

---
### How Can Endogeneity Arise?

-   Omitted explanatory variables
-   Measurement error on the right-hand side of the model
-   Simultaneity between the right- and left-hand sides of the model
-   etc.

---
### What to Do When Endogeneity Is a Problem?

1.  Give up.

2.  Try to change the model by including all omitted relevant variables.

3.  Find an instrument.

4.  Find other data.

---
### Instrumental Variables

-   Suppose the model is:
    `$\mathbf{y} = \mathbf{W} \mathbf{\gamma} + \mathbf{x} \beta + \mathbf{\epsilon}$`.
    The `$\mathbf{W}$` variables are exogenous, but the `$\mathbf{x}$`
    variable is endogenous.

-   Now, assume that there exists a variable `$z$` that *doesn't* belong
    in the regression model, with the following two characteristics:

-   `$cov(\mathbf{z}^{T} \mathbf{x}) \neq 0$`

-   `$E(\mathbf{z}^{T} \mathbf{\epsilon}) = 0$`

---
### Instrumental Variables

If these conditions are met (doesn't belong in the regression, related
linearly with `$\mathbf{x}$`, no connection with `$\mathbf{\epsilon})$`,
then `$\mathbf{z}$` meets the mathematical definition of an *instrument*.

---
### Instrumental Variables

---

``` r
library(dagitty)
hypotheticalinstruments.dag <- dagitty( "dag { Polarization -> DemocraticErosion ElitePower -> DemocraticErosion Corruption -> Polarization Corruption -> DemocraticErosion SocialMedia -> Polarization PrimaryElections -> Polarization PrimaryElections -> ElitePower EconomicInequality -> ElitePower EconomicInequality -> Polarization Polarization [exposure]
DemocraticErosion [outcome]}" )
```

---

``` r
plot( hypotheticalinstruments.dag )
```

<img src="5naturalexperiments2_files/figure-html/unnamed-chunk-3-1.png" width="50%" />
---

``` r
instrumentalVariables(hypotheticalinstruments.dag)
```

```
##  EconomicInequality |  ElitePower
##  PrimaryElections |  ElitePower
##  SocialMedia
```

---
### Instrumental Variables

-   Let's momentarily consider a bivariate regression,
    `$\mathbf{y} = \mathbf{x} \beta + \mathbf{\epsilon}$`, with instrument
    `$\mathbf{z}$`.

-   The OLS estimate of `$\beta$` is
    `$(\mathbf{x}^{T}\mathbf{x})^{-1} \mathbf{x}^{T}\mathbf{y}$`.

-   Consider instead the IV estimate of `$\beta$`:
    `$(\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T}\mathbf{y}$`.

-   `$E(\hat{\beta}_{IV}) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T}\mathbf{y}) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} [\mathbf{x} \beta + \mathbf{\epsilon}]) = E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{x} \beta) + E((\mathbf{z}^{T}\mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{\epsilon}) = \beta + 0$`

---
### Instrumental Variables

-   Now let's consider a multivariate regression,
    `$\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}$`, with
    some `$t \leq k$` of the `$\mathbf{X}$` variables endogenous, and with
    `$t$` instruments `$\mathbf{z}_{1} \ldots \mathbf{z}_{t}$`.

---
### Instrumental Variables

-   The OLS estimate of `$\mathbf{\beta}$` is
    `$(\mathbf{X}^{T}\mathbf{X})^{-1} \mathbf{X}^{T}\mathbf{y}$`.

-   Form the matrix `$\mathbf{Z}$`, containing the `$t$` instruments, as
    well as the `$k - t$` exogenous elements from `$\mathbf{X}$`.

-   The IV estimate of `$\mathbf{\beta}$` is:
    `$(\mathbf{Z}^{T}\mathbf{X})^{-1} \mathbf{Z}^{T}\mathbf{y}$`.

---
### Instrumental Variables

-   As in the bivariate situation, given the IV assumptions, the IV
    estimator eliminates the problem of endogeneity.

-   This estimator only works if the number of instruments is exactly
    equal to the number of endogenous variables.

---
### Another Way of Thinking About Instrumental Variables

-   Let's partition the independent variables into two matrices,
    `$\mathbf{W}$`, which has the `$k - t$` exogenous variables in the model
    of `$\mathbf{y}$`, and `$\mathbf{X}$`, which has the `$t$` endogenous
    variables.

-   So the `$\mathbf{Z}$` matrix is the `$\mathbf{W}$` matrix with `$t$` extra
    columns containing the instruments.

---
### Another Way of Thinking About Instrumental Variables

-   Suppose we regress each column of the `$\mathbf{X}$` matrix on the
    matrix `$\mathbf{Z}$` and form the fitted values.

-   `$\hat{\mathbf{X}} = \mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X}$`

-   Now use `$\hat{\mathbf{X}}$` in the place of `$\mathbf{X}$` in the OLS
    regression formula.

---
### Another Way of Thinking About Instrumental Variables

$$
`\begin{split}
\hat{\mathbf{\beta}}_{IV} = & (\mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X})^{-1} \\
& \mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{y} = \\
& (\mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{X})^{-1} \\ & \mathbf{X}^{T}\mathbf{Z} (\mathbf{Z}^{T} \mathbf{Z})^{-1} \mathbf{Z}^{T} \mathbf{y} = \\
& (\mathbf{Z}^{T}\mathbf{X})^{-1} \mathbf{Z}^{T}\mathbf{y}
\end{split}`
$$

-   The instrumental variables estimator gives the same coefficient
    estimates as running an OLS regression using `$\hat{\mathbf{X}}$` as predicted by `$\mathbf{Z}$` in
    the place of `$\mathbf{X}$`.

---
### Variance in Instrumental Variables

-   `$\hat{\mathbf{X}}$` is a random variable, so the normal OLS standard
    errors will underestimate uncertainty when using IV.

-   Instead, the correct estimate of the standard errors of the
    coefficient estimates in IV is:

-   `$\hat{V} (\hat{\mathbf{\beta}}_{IV}) = \hat{\sigma}^{2} (\mathbf{Z}^{T} \mathbf{X})^{-1} \mathbf{Z}^{T} \mathbf{Z} (\mathbf{X}^{T} \mathbf{Z})^{-1}$`

---
### Examples of Proposed Instruments

-   Suppose we're interested in the relationship between education and
    some political variable.

-   One proposed instrument for education, due to David Card (1995),
        is residential proximity to a college or university.

-   A second proposed instrument for education, due to Angrist and
        Krueger (1991) is month of birth.

-   A third instrument, from Nguyen et al. (2016), involves genetic
        risk score for years of schooling.

---
### Examples of Proposed Instruments

-   Suppose our focus is on the relationship between economic
    performance and civil war in agricultural countries.

-   Miguel, Satyanath, Sergenti, E. (2004) suggest using rainfall as
        an instrument for economic performance.

---

``` r
library(haven)
mss_repdata_1_ <- read_dta("https://github.com/jnseawright/PS406/raw/main/data/mss_repdata%20(1).dta")
```

---

``` r
library(ivreg)
migueliv <- ivreg(any_prio ~ gdp_g + gdp_g_l + y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest | GPCP_g + GPCP_g_l+ y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest, data=mss_repdata_1_)
summary(migueliv)
```

```
## 
## Call:
## ivreg(formula = any_prio ~ gdp_g + gdp_g_l + y_0 + polity2l + 
##     ethfrac + relfrac + Oil + lpopl1 + lmtnest | GPCP_g + GPCP_g_l + 
##     y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest, 
##     data = mss_repdata_1_)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0098 -0.3114 -0.1342  0.3796  2.0431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.438746   0.137120  -3.200  0.00143 ** 
## gdp_g       -0.528454   1.517953  -0.348  0.72784    
## gdp_g_l     -2.076062   1.781017  -1.166  0.24413    
## y_0         -0.042668   0.020714  -2.060  0.03977 *  
## polity2l     0.002769   0.003220   0.860  0.39005    
## ethfrac      0.225661   0.090639   2.490  0.01301 *  
## relfrac     -0.236262   0.103205  -2.289  0.02235 *  
## Oil          0.043934   0.056533   0.777  0.43733    
## lpopl1       0.067683   0.017231   3.928 9.38e-05 ***
## lmtnest      0.077338   0.014966   5.168 3.06e-07 ***
## 
## Diagnostic tests:
##                            df1 df2 statistic  p-value    
## Weak instruments (gdp_g)     2 733     8.646 0.000194 ***
## Weak instruments (gdp_g_l)   2 733     5.943 0.002752 ** 
## Wu-Hausman                   2 731     0.744 0.475485    
## Sargan                       0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4421 on 733 degrees of freedom
## Multiple R-Squared: 0.01679,	Adjusted R-squared: 0.004723 
## Wald test: 10.27 on 9 and 733 DF,  p-value: 5.189e-15
```

---

``` r
library(lmtest)
library(sandwich)
```

---

``` r
coeftest(migueliv, vcov = vcovCL(migueliv, cluster = ~country_name))
```

```
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.4387459  0.3532897 -1.2419  0.21468  
## gdp_g       -0.5284537  1.4250511 -0.3708  0.71087  
## gdp_g_l     -2.0760619  1.0241329 -2.0271  0.04301 *
## y_0         -0.0426678  0.0483408 -0.8826  0.37772  
## polity2l     0.0027692  0.0044092  0.6281  0.53016  
## ethfrac      0.2256606  0.2757338  0.8184  0.41339  
## relfrac     -0.2362620  0.2397070 -0.9856  0.32464  
## Oil          0.0439336  0.2123598  0.2069  0.83616  
## lpopl1       0.0676828  0.0498531  1.3576  0.17499  
## lmtnest      0.0773375  0.0385422  2.0066  0.04516 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
### What's Wrong with Weak Instruments?

-   For an IV estimate of a regression with only one independent
    variable and only one instrument, the IV estimator is:
    `$(\mathbf{z}^{T} \mathbf{x})^{-1} \mathbf{z}^{T} \mathbf{y}$`, which
    is the same as
    `$cov(\mathbf{z}, \mathbf{y})/cov(\mathbf{z}, \mathbf{x})$`.

---
### What's Wrong with Weak Instruments?

-   The `$cov(\mathbf{z}, \mathbf{y})$` may be thought of as a combination
    of three components:

-   the direct effect of `$\mathbf{z}$` on `$\mathbf{y}$`,

-   the indirect effect of `$\mathbf{z}$` on `$\mathbf{y}$` via
        `$\mathbf{x}$`,

-   and any correlation between `$\mathbf{z}$` and `$\mathbf{u}$`.

---
### What's Wrong with Weak Instruments?

-   If `$cov(\mathbf{z}, \mathbf{x})$` is big, then a moderate amount of
    contamination of `$cov(\mathbf{z}, \mathbf{y})$` with undesirable
    information will have only a small effect on the estimate.

-   If `$cov(\mathbf{z}, \mathbf{x})$` is very small, then even a small
    amount of contamination of `$cov(\mathbf{z}, \mathbf{y})$` with
    undesirable information will lead to serious bias in the estimate.

---

``` r
summary(migueliv)
```

---
### What's Wrong with Strong Instruments?

-   We know that `$\mathbf{x}$` and `$\mathbf{u}$` are related, or we
    wouldn't bother with the IV procedure.

-   If `$\mathbf{x}$` and `$\mathbf{z}$` are very strongly related, and
    `$\mathbf{x}$` and `$\mathbf{u}$` are also substantially related, then
    `$\mathbf{z}$` and `$\mathbf{u}$` are almost certainly substantially
    related, as well.

-   So a central assumption of IV regression fails.

---

``` r
miguellm2 <- lm(any_prio ~ gdp_g + gdp_g_l +GPCP_g + GPCP_g_l + y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest + year:country_name, data=mss_repdata_1_)

summary(miguellm2)
```

```
## 
## Call:
## lm(formula = any_prio ~ gdp_g + gdp_g_l + GPCP_g + GPCP_g_l + 
##     y_0 + polity2l + ethfrac + relfrac + Oil + lpopl1 + lmtnest + 
##     year:country_name, data = mss_repdata_1_)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92987 -0.12503 -0.02391  0.08826  1.07344 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                   -1.240e+02  2.551e+01  -4.861
## gdp_g                                         -4.108e-01  1.620e-01  -2.536
## gdp_g_l                                       -8.588e-02  1.574e-01  -0.546
## GPCP_g                                        -2.773e-02  5.904e-02  -0.470
## GPCP_g_l                                      -1.315e-01  5.968e-02  -2.204
## y_0                                            1.216e+01  4.717e+00   2.577
## polity2l                                      -4.581e-03  2.999e-03  -1.528
## ethfrac                                        8.706e+01  2.119e+01   4.108
## relfrac                                        7.141e+01  2.700e+01   2.645
## Oil                                           -2.507e-02  1.285e-01  -0.195
## lpopl1                                         5.051e-02  3.375e-01   0.150
## lmtnest                                        2.545e+00  3.513e+00   0.724
## year:country_nameAngola                       -7.233e-04  9.982e-03  -0.072
## year:country_nameBenin                         1.192e-02  1.037e-02   1.150
## year:country_nameBotswana                      1.147e-02  1.019e-02   1.126
## year:country_nameBurkina Faso                  8.857e-03  1.097e-02   0.808
## year:country_nameBurundi                       3.250e-02  1.112e-02   2.921
## year:country_nameCameroon                     -1.018e-02  1.014e-02  -1.004
## year:country_nameCentral African Republic     -2.671e-03  1.023e-02  -0.261
## year:country_nameChad                         -2.422e-03  1.002e-02  -0.242
## year:country_nameCongo                         4.287e-03  9.966e-03   0.430
## year:country_nameDjibouti                      1.642e-02  1.079e-02   1.522
## year:country_nameEthiopia                      2.734e-03  1.078e-02   0.254
## year:country_nameGabon                        -1.396e-02  1.164e-02  -1.200
## year:country_nameGambia                        1.752e-02  1.073e-02   1.633
## year:country_nameGhana                         2.581e-04  1.101e-02   0.023
## year:country_nameGuinea                        1.332e-02  1.066e-02   1.249
## year:country_nameGuinea-Bissau                 4.076e-03  1.038e-02   0.393
## year:country_nameIvory Coast                  -9.641e-03  1.001e-02  -0.964
## year:country_nameKenya                        -9.574e-03  1.023e-02  -0.936
## year:country_nameLesotho                       2.928e-02  1.056e-02   2.772
## year:country_nameLiberia                      -3.840e-03  1.008e-02  -0.381
## year:country_nameMadagascar                    2.880e-02  1.095e-02   2.630
## year:country_nameMalawi                        6.579e-03  9.954e-03   0.661
## year:country_nameMali                          1.777e-02  1.141e-02   1.557
## year:country_nameMauritania                    4.175e-02  1.195e-02   3.495
## year:country_nameMozambique                    3.336e-03  1.003e-02   0.333
## year:country_nameNamibia                       2.909e-03  1.051e-02   0.277
## year:country_nameNiger                         1.295e-02  1.048e-02   1.236
## year:country_nameNigeria                      -6.873e-03  1.022e-02  -0.672
## year:country_nameRwanda                        2.860e-02  1.065e-02   2.685
## year:country_nameSenegal                       1.812e-02  1.112e-02   1.629
## year:country_nameSierra Leone                  7.694e-04  9.732e-03   0.079
## year:country_nameSomalia                       5.019e-02  1.219e-02   4.117
## year:country_nameSouth Africa                 -1.563e-02  1.090e-02  -1.433
## year:country_nameSudan                         6.611e-03  1.016e-02   0.651
## year:country_nameSwaziland                     4.874e-03  9.863e-03   0.494
## year:country_nameTanzania, United Republic of -8.465e-03  1.069e-02  -0.792
## year:country_nameTogo                          1.040e-02  1.048e-02   0.993
## year:country_nameUganda                       -9.301e-03  1.038e-02  -0.896
## year:country_nameZaire                        -6.586e-03  1.058e-02  -0.623
## year:country_nameZambia                        3.918e-03  1.033e-02   0.379
## year:country_nameZimbabwe                      1.104e-02  9.847e-03   1.121
##                                               Pr(>|t|)    
## (Intercept)                                   1.45e-06 ***
## gdp_g                                         0.011436 *  
## gdp_g_l                                       0.585406    
## GPCP_g                                        0.638763    
## GPCP_g_l                                      0.027882 *  
## y_0                                           0.010172 *  
## polity2l                                      0.127025    
## ethfrac                                       4.47e-05 ***
## relfrac                                       0.008363 ** 
## Oil                                           0.845362    
## lpopl1                                        0.881061    
## lmtnest                                       0.469024    
## year:country_nameAngola                       0.942250    
## year:country_nameBenin                        0.250668    
## year:country_nameBotswana                     0.260520    
## year:country_nameBurkina Faso                 0.419562    
## year:country_nameBurundi                      0.003601 ** 
## year:country_nameCameroon                     0.315773    
## year:country_nameCentral African Republic     0.794020    
## year:country_nameChad                         0.809158    
## year:country_nameCongo                        0.667215    
## year:country_nameDjibouti                     0.128539    
## year:country_nameEthiopia                     0.799902    
## year:country_nameGabon                        0.230639    
## year:country_nameGambia                       0.102895    
## year:country_nameGhana                        0.981304    
## year:country_nameGuinea                       0.212049    
## year:country_nameGuinea-Bissau                0.694642    
## year:country_nameIvory Coast                  0.335632    
## year:country_nameKenya                        0.349712    
## year:country_nameLesotho                      0.005715 ** 
## year:country_nameLiberia                      0.703479    
## year:country_nameMadagascar                   0.008726 ** 
## year:country_nameMalawi                       0.508872    
## year:country_nameMali                         0.119819    
## year:country_nameMauritania                   0.000505 ***
## year:country_nameMozambique                   0.739436    
## year:country_nameNamibia                      0.782099    
## year:country_nameNiger                        0.216902    
## year:country_nameNigeria                      0.501709    
## year:country_nameRwanda                       0.007420 ** 
## year:country_nameSenegal                      0.103800    
## year:country_nameSierra Leone                 0.937016    
## year:country_nameSomalia                      4.31e-05 ***
## year:country_nameSouth Africa                 0.152185    
## year:country_nameSudan                        0.515574    
## year:country_nameSwaziland                    0.621312    
## year:country_nameTanzania, United Republic of 0.428587    
## year:country_nameTogo                         0.321189    
## year:country_nameUganda                       0.370670    
## year:country_nameZaire                        0.533638    
## year:country_nameZambia                       0.704701    
## year:country_nameZimbabwe                     0.262781    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2992 on 690 degrees of freedom
## Multiple R-squared:  0.576,	Adjusted R-squared:  0.5441 
## F-statistic: 18.03 on 52 and 690 DF,  p-value: < 2.2e-16
```

---
### Encouragement Designs in Experiments

-   Intent-to-treat analysis

-   Use the treatment assignment as an instrument, the actual treatment
    received as the treatment variable, and the outcome as normal.

---

``` r
library(readr)
peruemotions <- read_csv("https://github.com/jnseawright/PS406/raw/main/data/peruemotions.csv")
```

---

``` r
summary(lm(outsidervote~simpletreat, data=peruemotions))
```

```
## 
## Call:
## lm(formula = outsidervote ~ simpletreat, data = peruemotions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6093 -0.4916  0.3907  0.5084  0.5084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49164    0.02874  17.104   <2e-16 ***
## simpletreat  0.11763    0.04962   2.371   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.497 on 448 degrees of freedom
## Multiple R-squared:  0.01239,	Adjusted R-squared:  0.01018 
## F-statistic:  5.62 on 1 and 448 DF,  p-value: 0.01818
```

---

``` r
summary(lm(outsidervote~enojado, data=peruemotions))
```

```
## 
## Call:
## lm(formula = outsidervote ~ enojado, data = peruemotions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6905 -0.5147  0.3095  0.4853  0.4853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.51471    0.02463   20.90   <2e-16 ***
## enojado      0.17577    0.08062    2.18   0.0298 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4975 on 448 degrees of freedom
## Multiple R-squared:  0.0105,	Adjusted R-squared:  0.00829 
## F-statistic: 4.753 on 1 and 448 DF,  p-value: 0.02976
```

---

``` r
summary(ivreg(outsidervote~enojado|simpletreat,data=peruemotions))
```

```
## 
## Call:
## ivreg(formula = outsidervote ~ enojado | simpletreat, data = peruemotions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0804 -0.3716 -0.3716  0.6284  0.6284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.37162    0.09586   3.877 0.000122 ***
## enojado      1.70882    0.96993   1.762 0.078788 .  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1 448     5.664  0.0177 *
## Wu-Hausman         1 447     4.608  0.0324 *
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6688 on 448 degrees of freedom
## Multiple R-Squared: -0.7881,	Adjusted R-squared: -0.7921 
## Wald test: 3.104 on 1 and 448 DF,  p-value: 0.07879
```

---

``` r
summary(lm(outsidervote~enojado+simpletreat, data=peruemotions))
```

```
## 
## Call:
## lm(formula = outsidervote ~ enojado + simpletreat, data = peruemotions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7439 -0.4807  0.3630  0.5193  0.5193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.48066    0.02921  16.453   <2e-16 ***
## enojado      0.15639    0.08081   1.935   0.0536 .  
## simpletreat  0.10687    0.04978   2.147   0.0324 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4955 on 447 degrees of freedom
## Multiple R-squared:  0.0206,	Adjusted R-squared:  0.01621 
## F-statistic:   4.7 on 2 and 447 DF,  p-value: 0.009551
```

---
### LATE

-   Let's divide our population into four categories:

1.  Compliers: will have high `$\mathbf{x}$` whenever `$\mathbf{z}$` is
        high, and low `$\mathbf{x}$` whenever `$\mathbf{z}$` is low.

2.  Defiers: will have low `$\mathbf{x}$` whenever `$\mathbf{z}$` is
        high, and high `$\mathbf{x}$` whenever `$\mathbf{z}$` is low.

3.  Always-takers: will have high `$\mathbf{x}$` no matter what.

4.  Never-takers: will have low `$\mathbf{x}$` no matter what.

---
### LATE

-   The effect of the instrument on the treatment is
    `$\%Compliers - \%Defiers$`.

-   The effect of the instrument on the outcome, given the exclusion
    restriction, is (ATE for Compliers) times `$\%Compliers -$` (ATE for
    Defiers) times `$\%Defiers$`.

---
### Beyond LATE

-   Aronow and Carnegie propose estimating ATE by reweighting on the
    *compliance score*.

-   The compliance score is the probability that the received treatment
    is greater, when in the encouragement treatment, than it is in the
    control.

---

``` r
peruemotionstrim <- na.omit(data.frame(enojado=peruemotions$enojado, 
       outsidervote=peruemotions$outsidervote, 
       simpletreat=peruemotions$simpletreat, 
       Cuzco=peruemotions$Cuzco,
       age=peruemotions$age))
#packageurl <- "http://cran.r-project.org/src/contrib/Archive/icsw/icsw_1.0.0.tar.gz"
#install.packages(packageurl, repos=NULL, type="source")
library(icsw)
```

---

``` r
exp.reweight <- with(peruemotionstrim, icsw.tsls(D=enojado, Y=outsidervote, 
       Z=simpletreat, X=cbind(1,Cuzco), W=cbind(Cuzco,age), R=100))
```

---

``` r
exp.reweight$coefficients
```

```
##                   Cuzco           D 
##  0.35982616 -0.09801187  6.14701458
```

``` r
exp.reweight$coefs.se.boot
```

```
##                   Cuzco           D 
##   0.2133571   1.8464919 105.6180688
```

---

``` r
sidelm <- lm(enojado ~ simpletreat+Cuzco+age, data= peruemotionstrim)
summary(sidelm)
```

```
## 
## Call:
## lm(formula = enojado ~ simpletreat + Cuzco + age, data = peruemotionstrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17130 -0.10300 -0.09608 -0.04607  1.00316 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.151607   0.042344   3.580 0.000382 ***
## simpletreat  0.068305   0.029371   2.326 0.020501 *  
## Cuzco       -0.035445   0.029606  -1.197 0.231876    
## age         -0.002210   0.001263  -1.750 0.080903 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2893 on 434 degrees of freedom
## Multiple R-squared:  0.02289,	Adjusted R-squared:  0.01614 
## F-statistic:  3.39 on 3 and 434 DF,  p-value: 0.01804
```

---
### When Might 2SLS or IV Be a Good Idea?

-   If there's a true randomization in the world that you can take
    advantage of -- but there still might be down-sides.

-   As a Hausmann test, to see if your most important results hold up
    under some alternative assumptions.

-   If a reviewer demands it of you.

---
### RDD

-   An RDD can be analyzed by comparing simple average scores just above
    and just below the threshold.

-   Alternatively, a (simple or complex) statistical model may be used
    to extrapolate from the data just above and just below the
    threshold.

---

---

---

``` r
#This coding example due to Yuta Toyama.
demmeans <- split(lmb_data$democrat, cut(lmb_data$lagdemvoteshare, 100)) %>%
  lapply(mean) %>%
  unlist()

agg_lmb_data <- data.frame(democrat = demmeans, lagdemvoteshare = seq(0.01,1, by = 0.01))
```

---

``` r
lmb_data <- lmb_data %>%
  mutate(gg_group = if_else(lagdemvoteshare > 0.5, 1,0))

gg_srd = ggplot(data=lmb_data, aes(lagdemvoteshare, democrat)) +
  geom_point(aes(x = lagdemvoteshare, y = democrat), data = agg_lmb_data) +
  xlim(0,1) + ylim(-0.1,1.1) +
  geom_vline(xintercept = 0.5) +
  xlab("Democrat Vote Share, time t") +
  ylab("Probability of Democrat Win, time t+1") +
  scale_y_continuous(breaks=seq(0,1,0.2)) +
  ggtitle(TeX("Effect of Initial Win on Winning Next Election: $\\P^D_{t+1} - P^R_{t+1}$"))
```

---

``` r
gg_srd + stat_smooth(aes(lagdemvoteshare, democrat, group = gg_group),
  method = "lm"
  , formula = y ~ x + I(x^2))
```

---

``` r
gg_srd + stat_smooth(data=lmb_data %>% filter(lagdemvoteshare>.45 & lagdemvoteshare<.55),
  aes(lagdemvoteshare, democrat, group = gg_group),
  method = "lm", formula = y ~ x + I(x^2))
```

---
<img src="5naturalexperiments2_files/figure-html/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" />

---

``` r
lmb_subset <- lmb_data %>%
     filter(lagdemvoteshare>.48 & lagdemvoteshare<.52)

lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
summary(lm_1)
```

```
## 
## Call:
## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)    31.20      1.334   23.39 3.788e-95    28.58    33.81 913
## lagdemocrat    21.28      1.951   10.91 3.987e-26    17.45    25.11 913
## 
## Multiple R-squared:  0.1152 ,	Adjusted R-squared:  0.1142 
## F-statistic:   119 on 1 and 913 DF,  p-value: < 2.2e-16
```

---

---
### RDD Windows

-   How wide a window above and below the break point?

---

``` r
lmb_subset <- lmb_data %>%
     filter(lagdemvoteshare>.49 & lagdemvoteshare<.51)

lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
summary(lm_1)
```

```
## 
## Call:
## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)    31.71      1.938  16.358 4.539e-47    27.90    35.52 428
## lagdemocrat    23.97      2.799   8.564 1.985e-16    18.47    29.47 428
## 
## Multiple R-squared:  0.1453 ,	Adjusted R-squared:  0.1433 
## F-statistic: 73.34 on 1 and 428 DF,  p-value: < 2.2e-16
```

---

``` r
lmb_subset <- lmb_data %>%
     filter(lagdemvoteshare>.495 & lagdemvoteshare<.505)

lm_1 <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
summary(lm_1)
```

```
## 
## Call:
## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)    29.55      2.547  11.600 3.574e-24    24.52    34.57 202
## lagdemocrat    29.13      3.845   7.577 1.250e-12    21.55    36.72 202
## 
## Multiple R-squared:  0.2177 ,	Adjusted R-squared:  0.2138 
## F-statistic: 57.41 on 1 and 202 DF,  p-value: 1.25e-12
```

---
### RDD

> Irrespective of the manner in which the bandwidth is chosen, one
> should always investigate the sensitivity of the inferences to this
> choice, for example, by including results for bandwidths twice (or
> four times) and half (or a quarter of) the size of the originally
> chosen bandwidth. Obviously, such bandwidth choices affect both
> estimates and standard errors, but if the results are critically
> dependent on a particular bandwidth choice, they are clearly less
> credible than if they are robust to such variation in bandwidths.
> (Imbens and Lemieux 2008)

---
### RDD

-   Green, Leong, Kern, Gerber, and Larimer find that an estimate of the
    optimal bandwidth proposed by Imbens and Kalyanaraman, in
    conjunction with local linear regression, helps RDD come very close
    to replicating experimental results.

---

``` r
library(rdd)
rddik <- RDestimate(score ~ lagdemvoteshare, cutpoint=0.5, data=lmb_data)
summary(rddik)
```

```
## 
## Call:
## RDestimate(formula = score ~ lagdemvoteshare, data = lmb_data, 
##     cutpoint = 0.5)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       0.12883     6072         18.84     1.637       11.511   1.163e-30
## Half-BW    0.06442     3150         20.48     2.353        8.705   3.190e-18
## Double-BW  0.25767    10512         22.45     1.162       19.321   3.583e-83
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F       Num. DoF  Denom. DoF  p
## LATE        516.1  3          6068       0
## Half-BW     167.8  3          3146       0
## Double-BW  1324.0  3         10508       0
```

---

``` r
library(rdrobust)
rdbw2 <- rdbwselect(lmb_data$score, lmb_data$lagdemvoteshare, c=0.5)
summary(rdbw2)
```

```
## Call: rdbwselect
## 
## Number of Obs.                13577
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 5670         7907
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## Unique Obs.                    2878         3279
## 
## =======================================================
##                   BW est. (h)    BW bias (b)
##             Left of c Right of c  Left of c Right of c
## =======================================================
##      mserd     0.086      0.086      0.133      0.133
## =======================================================
```

---

``` r
lmb_subset <- lmb_data %>%
     filter(lagdemvoteshare>.41 & lagdemvoteshare<.59)

lm_final <- lm_robust(score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
summary(lm_final)
```

```
## 
## Call:
## lm_robust(formula = score ~ lagdemocrat, data = lmb_subset, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)    28.35     0.5779   49.06  0.000e+00    27.21    29.48 4304
## lagdemocrat    28.77     0.8610   33.41 7.265e-218    27.08    30.46 4304
## 
## Multiple R-squared:  0.2067 ,	Adjusted R-squared:  0.2066 
## F-statistic:  1117 on 1 and 4304 DF,  p-value: < 2.2e-16
```