class: center, middle, inverse, title-slide .title[ # 8: Deriving OLS ] .subtitle[ ## Linear Models ] .author[ ###
Jaye Seawright
] .institute[ ###
Northwestern Political Science
] .date[ ### Feb. 2, 2026 ] --- class: center, middle <style type="text/css"> pre { max-height: 400px; overflow-y: auto; } pre[class] { max-height: 200px; } </style> * Up until this point, we've been thinking about models and parameters operating at the population level. * To do empirical research, we're going to have to estimate: use finite amounts of data to come up with best guesses about population statistics. --- *Plug-in principle:* In a formula for a model of interest, replace population moments with their sample equivalents. (E.g., replace expectations with sample means, replace covariances with sample covariances, replace variances with sample variances...) --- Remember the regression model: `$$y = \beta_0 + \beta_1x_1 + \epsilon$$` Let's assume `\(y\)` and `\(x_1\)` both have mean 0, so that `\(\beta_0\)` disappears. `$$y = \beta_1x_1 + \epsilon$$` Our approach is to choose `\(\beta_1\)` to minimize `\(\text{E}(y - \beta_1x_1)^2\)`. --- To minimize `\(\text{E}(y - \beta_1x_1)^2\)`, we take its derivative with respect to `\(\beta_1\)` and set the result equal to zero. `$$\frac{\partial \text{E}(y - \beta_1x_1)^2}{\partial \beta_1} = 0$$` `$$\text{E}(2(y - \beta_1x_1)(-x_1)) = 0$$` `$$\text{E}(-2 x_1 y + 2\beta_1x_1^2) = 0$$` `$$\text{E}(-2 x_1 y) + \text{E}(2\beta_1x_1^2) = 0$$` `$$-2\text{E}(x_1 y) + 2\beta_1\text{E}(x_1^2) = 0$$` --- `$$-2\text{E}(x_1 y) + 2\beta_1\text{E}(x_1^2) = 0$$` `$$-\text{E}(x_1 y) + \beta_1\text{E}(x_1^2) = 0$$` `$$-\text{cov}(x_1, y) + \beta_1\text{var}(x_1) = 0$$` `$$\beta_1\text{var}(x_1) = \text{cov}(x_1, y)$$` `$$\beta_1^* = \frac{\text{cov}(x_1, y)}{\text{var}(x_1)}$$` Then our plug-in principle estimate is: `$$\hat\beta = \frac{\sum_{i=1}^{N}(x_i * y_i)}{\sum_{i=1}^{N}(x_i^2)}$$` --- For multivariate models, the process of the argument is basically the same, but the setup uses matrix algebra and calculus. As we'll discuss a couple more times, the regression model in matrix form looks like this: `$$Y = \mathbb{X} \beta + \epsilon$$` Here, `\(Y\)` and `\(\epsilon\)` are `\(N\)` by 1 vectors. `\(\mathbb{X}\)` is an `\(N\)` by `\(k\)` matrix, where `\(k\)` is one plus the number of independent variables in the regression. Finally, `\(\beta\)` is a `\(k\)` by 1 vector of coefficients (including the intercept). --- It will turn out that the plug-in estimate here is: `$$\hat\beta = ( \mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^T\mathbf{Y})$$` --- Consider our turnout regressions from a bit back: ``` r summary(lm(Turnout ~ GDP, data=turnout)) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP, data = turnout) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.15949 -0.07948 -0.01742 0.08628 0.17013 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.499e-01 1.540e-02 42.202 <2e-16 *** ## GDP -4.975e-09 2.177e-09 -2.285 0.0268 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.09805 on 48 degrees of freedom ## Multiple R-squared: 0.09811, Adjusted R-squared: 0.07932 ## F-statistic: 5.222 on 1 and 48 DF, p-value: 0.02677 ``` --- ``` r with(turnout, cov(Turnout, GDP, use="complete")/var(GDP, use="complete")) ``` ``` ## [1] -4.975207e-09 ``` --- ``` r summary(lm(Turnout ~ GDP + Temperature, data=turnout)) ``` ``` ## ## Call: ## lm(formula = Turnout ~ GDP + Temperature, data = turnout) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.13323 -0.05916 -0.01721 0.02748 0.19650 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.751e-01 5.458e-01 1.054 0.299 ## GDP -3.525e-09 3.006e-09 -1.173 0.249 ## Temperature 9.409e-04 1.066e-02 0.088 0.930 ## ## Residual standard error: 0.09201 on 35 degrees of freedom ## (12 observations deleted due to missingness) ## Multiple R-squared: 0.06614, Adjusted R-squared: 0.01278 ## F-statistic: 1.239 on 2 and 35 DF, p-value: 0.3019 ``` --- ``` r xmat <- as.matrix(with(turnout, data.frame(Intercept=1, GDP=GDP, Temperature=Temperature))) yvec <- turnout$Turnout xmat ``` ``` ## Intercept GDP Temperature ## [1,] 1 881 NA ## [2,] 1 1155 NA ## [3,] 1 1896 NA ## [4,] 1 1672 NA ## [5,] 1 1629 NA ## [6,] 1 2139 NA ## [7,] 1 2864 NA ## [8,] 1 4022 NA ## [9,] 1 3839 NA ## [10,] 1 6467 NA ## [11,] 1 7060 NA ## [12,] 1 8520 NA ## [13,] 1 7910 51.5 ## [14,] 1 10510 48.6 ## [15,] 1 10970 51.0 ## [16,] 1 12420 53.0 ## [17,] 1 14300 51.2 ## [18,] 1 13300 51.6 ## [19,] 1 18700 53.9 ## [20,] 1 22900 51.8 ## [21,] 1 27700 50.1 ## [22,] 1 39400 50.9 ## [23,] 1 48300 51.1 ## [24,] 1 91500 51.3 ## [25,] 1 84700 51.5 ## [26,] 1 97000 50.7 ## [27,] 1 58500 48.2 ## [28,] 1 83100 52.2 ## [29,] 1 100400 53.8 ## [30,] 1 211400 49.4 ## [31,] 1 261600 50.6 ## [32,] 1 351600 51.1 ## [33,] 1 428200 51.5 ## [34,] 1 515300 51.9 ## [35,] 1 649800 48.3 ## [36,] 1 892700 50.3 ## [37,] 1 1213000 52.3 ## [38,] 1 1783000 51.6 ## [39,] 1 2732000 52.8 ## [40,] 1 3772000 51.0 ## [41,] 1 4881000 53.3 ## [42,] 1 6256000 53.6 ## [43,] 1 7817000 54.3 ## [44,] 1 9817000 53.7 ## [45,] 1 11734000 52.1 ## [46,] 1 14706538 51.9 ## [47,] 1 16068805 56.6 ## [48,] 1 18525933 56.2 ## [49,] 1 21751238 55.7 ## [50,] 1 28708161 55.5 ``` ``` r yvec ``` ``` ## [1] 0.58 0.55 0.58 0.80 0.79 0.73 0.70 0.79 0.81 0.74 0.78 0.71 0.82 0.79 0.78 ## [16] 0.79 0.75 0.79 0.73 0.65 0.65 0.59 0.62 0.49 0.49 0.57 0.53 0.57 0.59 0.53 ## [31] 0.52 0.63 0.60 0.64 0.63 0.63 0.57 0.56 0.55 0.55 0.51 0.56 0.50 0.52 0.57 ## [46] 0.58 0.55 0.56 0.63 0.58 ``` --- ``` r xmat <- xmat[13:nrow(xmat),] yvec <- yvec[13:length(yvec)] ``` --- ``` r t(xmat)%*%xmat ``` ``` ## Intercept GDP Temperature ## Intercept 38.0 1.538179e+08 1976.1 ## GDP 153817885.0 2.501400e+15 8374162782.9 ## Temperature 1976.1 8.374163e+09 102911.9 ``` ``` r t(xmat)%*%yvec ``` ``` ## [,1] ## Intercept 23.170 ## GDP 87519078.090 ## Temperature 1203.719 ``` --- ``` r solve(t(xmat)%*%xmat, tol=1e-20) ``` ``` ## Intercept GDP Temperature ## Intercept 3.519096e+01 1.349966e-07 -6.867170e-01 ## GDP 1.349966e-07 1.067320e-15 -2.679037e-09 ## Temperature -6.867170e-01 -2.679037e-09 1.341396e-02 ``` ``` r solve(t(xmat)%*%xmat, tol=1e-20)%*%t(xmat)%*%yvec ``` ``` ## [,1] ## Intercept 5.750768e-01 ## GDP -3.524858e-09 ## Temperature 9.408781e-04 ``` ---  --- 