5: Approximating CEFs and Equalling Them

class: center, middle, inverse, title-slide

.title[
# 5: Approximating CEFs and Equalling Them
]
.subtitle[
## Linear Models
]
.author[
### <large>Jaye Seawright</large>
]
.institute[
### <small>Northwestern Political Science</small>
]
.date[
### Jan. 21, 2026
]

---

class: center, middle

pre[class] {
  max-height: 200px;
}
</style>

We have a theorem that, if the conditional expectation function is linear, then it equals the population OLS regression.

---

What is a *theorem*?

---

Our theorem posits that the CEF is linear:

`$$E(y|x_{i}) = a + b x_{i}$$`
for some values of `$a$` and `$b$`.

---

Key property number one of the CEF:

`$$E(y_{i} - E(y_{i}|x_{i})) = 0$$`

---
###The CEF Error

* For any case `$i$`, the actual outcome `$Y_{i}$` will almost never be exactly equal to our prediction `$E(Y_{i}|X_{i})$`.

* The difference is the CEF error: `$Y_{i} - E(Y_{i}|X_{i})$`

---

* At the population level, we have an unlimited collection of these errors. Some are positive, some are negative.

* Question: What happens if we take the average of all these errors? `$E(Y_{i} - E(Y_{i}|X_{i}))$` = ?

---

* Why must the errors average to zero? Because if they didn't, our predictor wouldn't be the best one!

---

Thought Experiment:

* Suppose for a moment that `$E(Y_{i} - E(Y_{i}|X_{i})) = c$`, where `$c > 0$`.

* This would mean that, on average, our predictor E[Y_i|X_i] is systematically too low by the amount c.

* If we knew this, we could create a better predictor! We could just add c to our old one: New Predictor = `$E(Y_{i}|X_{i}) + c$`

* This new predictor would have a smaller average error. But we defined `$E(Y|X)$` as the best predictor! (This is called proof by contradiction.)

---

There's a better proof that relies on some probability theory:

`$E(Y_{i} - E(Y_{i}|X_{i})) = E(Y_{i}) - E(Y_{i}|X_{i})$`

`$E(Y_{i}) - E(Y_{i}|X_{i}) = E(Y_{i}) - E(Y_{i}) = 0$`

---

Key property number two of the CEF:

`$$E((y_{i} - E(y_{i}|x_{i})) x_{i}) = 0$$`

---

What we're suggesting here is that the CEF error times `$x_{i}$` is equal to zero in expectation.

---

Imagine a seesaw, with values of `$x_{i}$` being locations further left or right from the axis and values of the CEF error representing weights.

What would happen if there was a large cluster of weights on one side or the other?

---
###Does the CEF Error Have Useful Information?

If the CEF error is related to `$x_{i}$`, that means there is information in the error that could still have been predicted by `$x$` but wasn't.

The best available predictor wouldn't allow that.

---

Thought Experiment:

* Suppose `$E((y_{i} - E(y_{i}|x_{i})) * x_{i}) = c$`, where `$c > 0$`.

* This means that when `$x_i$` is large, our errors tend to be positive (our predictions are too low), and when `$x_i$` is small, our errors tend to be negative (our predictions are too high).

* We could then build a better predictor by adding a small multiple of `$x_i$` to our old one:

* New Predictor = `$E(Y_{i}|x_{i}) + \delta * x_{i}$`

---

Now, let's go back to our theorem.

`$$E(y|x_{i}) = a + b x_{i}$$`

`$$E(y_{i} - E(y_{i}|x_{i})) = 0$$`

`$$E((y_{i} - E(y_{i}|x_{i})) x_{i}) = 0$$`

---

`$$E(y_{i} - a + b x_{i}) = 0$$`
`$$a = E(y_{i}) - E(x_{i}) b$$`

---

`$$E((y_{i} - a + b x_{i}) x_{i}) = 0$$`

`$$E((y_{i} - E(y_{i}) + E(x_{i}) b + b x_{i}) x_{i}) = 0$$`

`$$E((y_{i} - E(y_{i}))x_{i} - (x_{i} b - E(x_{i}) b) x_{i}) = 0$$`

---

`$$b = \frac{E((y_{i} - E(y_{i}))x_{i})}{E((x_{i} - E(x_{i})) x_{i})} = \frac{cov(x,y)}{var(y)}$$`
---

So the CEF equals the BLP when the CEF is linear.

---

What happens when the CEF isn't linear?

---

* The relationship between campaign spending and vote share likely has diminishing returns: the first million dollars matters more than the tenth million.

* The relationship between education and social mobility plausibly follows an S-curve, with small gains at low education levels, rapid acceleration, then plateauing at higher levels.

---

---
###How to proceed?

1. Model the complex, true relationship with flexible methods

2. Use a simple linear model that gives us one clear, interpretable number

---

The BLP preserves the same crucial properties we saw in the CEF:

* `$E(e) = 0$` (Errors average to zero across all observations)

* `$E(e * X) = 0$` (Errors are uncorrelated with the independent variable)

---

Even when the true relationship is curved, the BLP finds the line that makes prediction errors balanced across all values of X.

---

---

Let `$Y = \text{Vote share}$`, `$X = \text{Campaign spending (millions)}$`

What the BLP gives us:

* A single "average effect" of campaign spending across all spending levels

* It will over-predict for extremely low-spending and extremely high-spending campaigns

* It will under-predict for medium-spending campaigns

* But overall, the errors balance out, with no linear information left in the residuals