12: Multiple Comparisons.

class: center, middle, inverse, title-slide

.title[
# 12: Multiple Comparisons.
]
.subtitle[
## Linear Models
]
.author[
### <large>Jaye Seawright</large>
]
.institute[
### <small>Northwestern Political Science</small>
]
.date[
### Feb. 16, 2026
]

---

class: center, middle

pre[class] {
  max-height: 200px;
}
</style>

Last time, we saw that researchers test regression results by rejecting `$H_{0}$` when the probability associated with it is very low. By convention, there are thresholds at 0.05, 0.01, and 0.001 about which people get emotional.

---

Consider the probability of getting a heads on a coin toss.

Now consider [this](https://www.youtube.com/watch?v=gOwLEVQGbrM).

---

``` r
set.seed(42)  # For reproducibility
single_experiment <- rbinom(17, size = 1, prob = 0.5)
cat("Number of heads in one experiment:", sum(single_experiment), "\n")
```

```
## Number of heads in one experiment: 12
```

``` r
cat("Probability of 17 heads: ", 0.5^17, " (1 in", round(1/0.5^17), ")\n")
```

```
## Probability of 17 heads:  7.629395e-06  (1 in 131072 )
```

---

All right, but what if I flip the coin as many times as I want and count streaks after the fact?

``` r
set.seed(123)
found_perfect_streak <- FALSE
experiments_run <- 0

while(!found_perfect_streak) {
  experiments_run <- experiments_run + 1
  current_experiment <- rbinom(17, size = 1, prob = 0.5)
  if(sum(current_experiment) == 17) {
    found_perfect_streak <- TRUE
    cat("Found 17 heads after", experiments_run, "experiments\n")
    cat("That's", experiments_run * 17, "total coin flips\n")
  }
}
```

```
## Found 17 heads after 57621 experiments
## That's 979557 total coin flips
```

---

``` r
set.seed(42)
n_experiments <- 1000
results <- replicate(n_experiments, sum(rbinom(17, 1, 0.5)))

max_heads <- max(results)
cat("In", n_experiments, "experiments, the maximum heads was:", max_heads, "\n")
```

```
## In 1000 experiments, the maximum heads was: 15
```

``` r
cat("Probability of getting at least", max_heads, "heads by chance:",
    mean(results >= max_heads), "\n")
```

```
## Probability of getting at least 15 heads by chance: 0.002
```

---
###The Multiple Comparisons Problem

Every significance test we carry out is [a chance to falsely conclude that a relationship is meaningful due to chance](https://xkcd.com/882/).

---

If we aren't careful, most times that we pay attention to multiple significance tests related to the same data and the same problem, we are at risk of finding false precision in our results.

---
###Why This Happens

Remember that we noted 0.05 as a probability level in significance tests that leads to excitement. This is often written as `$\alpha = 0.05$`. If our significance test has been done well, meeting the assumptions we talked about last time, the probability of getting a result of 0.05 or less on a single significance test that corresponds with a population regression slope that is actually 0 is 0.05.

So this won't happen very often, and if we have a single result below this level, we can treat it seriously with meaningful confidence.

---

However, suppose that we carry out 20 independent significance tests --- including 20 unrelated independent variables in our regression, for example. Now, we can think of each significance test as being like a coin flip in our example above. The chance of finding at least one heads --- at least one value randomly below 0.05 even though the population regression slope is zero --- is going to be much higher than 0.05:

``` r
1 - (0.95)^20
```

```
## [1] 0.6415141
```

---

---

We have two main measures of failure when carrying out multiple comparisons:

1. The *family-wide error rate*, which is the probability of incorrectly rejecting even one null hypothesis.

2. The *false discovery rate*, which is the expected proportion of false discoveries (i.e., findings for which we reject the null hypothesis) among all discoveries.

---
###Family-Wide Error Rate

`$$\text{FWER} = \text{Pr}(\text{at least one false positive})$$`
Intuition: “What’s the chance that any of my ‘significant’ findings is just noise?”

Use case: When you need strong control over false positives across all tests (e.g., confirmatory studies, many RCTs).

---
###False-Discovery Rate

`$$\text{FDR} = \text{E}(\frac{\text{False Positives}}{\text{All Positives}})$$`

---
###Bonferroni

The first and simplest way to correct the family-wide error rate is the Bonferroni correction: just divide `$\alpha$` by the number of tests you're carrying out.

Suppose our target significance level is `$\alpha$` and we're carrying out `$m$` total tests. Then the Bonferroni correction is to set our actual target significance level for each individual test at `$\frac{\alpha}{m}$` but interpret the results as still just implying a family-wide error rate of not more than `$\alpha$`.

---

To see why, consider our example of 20 independent tests from earlier. We want `$\alpha = 0.05$`, so our Bonferroni-corrected target will be `$\frac{0.05}{20} = 0.0025$`.

``` r
1 - (1 - 0.0025)^20
```

```
## [1] 0.04883012
```

---

If our tests are not independent, the Bonferroni correction will typically be too conservative.

---
### Holm Correction: A Stepwise Improvement

Bonferroni is conservative because it treats all tests equally. Holm's step-down procedure is more powerful while still controlling FWER:

1. Order p-values: `$(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)})$`
2. Compare `$p_{(k)}$` to `$\frac{\alpha}{m+1-k}$`
3. Reject all hypotheses up to the first non-rejection

---

``` r
# Example: Three tests with p-values 0.004, 0.020, 0.122
p_vals <- c(0.004, 0.020, 0.122)
alpha <- 0.05
m <- length(p_vals)

# Holm correction manually
sorted_p <- sort(p_vals)
holm_thresholds <- alpha/(m + 1 - 1:m)

cat("Sorted p-values:", sorted_p, "\n")
```

```
## Sorted p-values: 0.004 0.02 0.122
```

``` r
cat("Holm thresholds:", round(holm_thresholds, 3), "\n")
```

```
## Holm thresholds: 0.017 0.025 0.05
```

``` r
cat("Significant under Holm:", sorted_p <= holm_thresholds, "\n")
```

```
## Significant under Holm: TRUE TRUE FALSE
```

---

``` r
# Using p.adjust
holm_adjusted <- p.adjust(p_vals, method = "holm")
cat("Holm-adjusted p-values:", round(holm_adjusted, 3), "\n")
```

```
## Holm-adjusted p-values: 0.012 0.04 0.122
```

---

### False Discovery Rate: Benjamini-Hochberg

While FWER controls "any false positive," FDR controls the *proportion* of false positives among discoveries. Less conservative when many tests are expected to be true positives.

---

**Benjamini-Hochberg Procedure:**
1. Order p-values: `$(p_{(1)} \leq \cdots \leq p_{(m)})$`
2. Find largest `$(k)$` where `$(p_{(k)} \leq \frac{k}{m}\alpha)$`
3. Reject all hypotheses 1 through `$(k)$`

---

``` r
# Same three p-values
p_vals <- c(0.004, 0.020, 0.122)
m <- length(p_vals)

# BH thresholds
bh_thresholds <- (1:m)/m * alpha

cat("BH thresholds:", round(bh_thresholds, 3), "\n")
```

```
## BH thresholds: 0.017 0.033 0.05
```

``` r
cat("Significant under BH:", p_vals <= bh_thresholds, "\n")
```

```
## Significant under BH: TRUE TRUE FALSE
```

``` r
# Using p.adjust
bh_adjusted <- p.adjust(p_vals, method = "BH")
cat("BH-adjusted p-values:", round(bh_adjusted, 3), "\n")
```

```
## BH-adjusted p-values: 0.012 0.03 0.122
```

---

### Comparing Correction Methods

``` r
set.seed(8747)
# Simulate 100 tests: 50 null true (mean=0), 50 false (mean=0.5)
n_tests <- 100
effects <- c(rep(0, 50), rep(0.5, 50))
p_vals <- 2*pnorm(-abs(rnorm(n_tests, mean = effects)))

# Apply corrections
corrections <- list(
  "None" = p_vals,
  "Bonferroni" = p.adjust(p_vals, "bonferroni"),
  "Holm" = p.adjust(p_vals, "holm"),
  "BH" = p.adjust(p_vals, "BH")
)

# Count significant at alpha=0.05
sig_counts <- sapply(corrections, function(p) sum(p <= 0.05))
```

---

``` r
barplot(sig_counts, 
        main = "Significant Tests Under Different Corrections",
        ylab = "Number Significant (α=0.05)",
        col = c("#1c5253", "#d3bccc", "#b3dee2", "#f0b7b3"),
        ylim = c(0, 5))
```

![](MultipleComparisons_files/figure-html/compare-corrections-2-1.png)

---

### When to Use Which Correction?

**FWER methods (Bonferroni, Holm):**
- Confirmatory research
- Small number of tests
- Severe consequences of false positives
- Clinical trials, policy evaluations

**FDR methods (Benjamini-Hochberg):**
- Exploratory research
- Many tests (e.g., genomics, large-scale surveys)
- Willing to accept some false positives to find more true effects

---

### Indexing: Reducing Multiple Outcomes

When you measure many related outcomes, combine them into an index:

**Mean Effects Index (Kling, Liebman, & Katz, 2004):**
1. Reorient outcomes so higher = better
2. Standardize each outcome by control group: `$(z_{ik} = \frac{y_{ik} - \bar{y}_k^{control}}{SD_k^{control}})$`
3. Average z-scores for each unit

---

**Advantages:**
- Increases power when effects are consistent
- Handles missing data gracefully
- Reduces multiple comparisons problem

---

### Design-Based Approaches

**Pre-analysis plans (PAPs):**
- Specify hypotheses and analysis plan before seeing data
- Define "families" of tests for correction
- Distinguish confirmatory vs. exploratory analyses

---

**Advantages of PAPs:**
- Clarifies multiple comparisons problem
- Prevents data dredging
- Increases credibility of results

---

**Replication:**
- The ultimate test: can others reproduce your findings?
- Particularly important when dealing with multiple comparisons

---

![Abstract](Images/Rooduijn1.png) 
---

![Bonferroni Regression](Images/Rooduijn2.png)

---

![No Bonferroni Regression](Images/Rooduijn3.png)

---

![Abstract](Images/Castorena1.png)
---

![Explanation](Images/Castorena2.png)

---

![Results1](Images/Castorena3.png)

---

![Results2](Images/Castorena4.png)