Let’s look at whether missing data are a problem in the study of democracy and economic development. With your group, access the Quality of Governance (QoG) dataset, and assemble a regression model predicting countries’ level of democracy based on per capita GDP and any control variables you see as relevant. Are there missing data? What are some possible solutions? Show how implementing a good solution would modify the results (if at all).

Start by loading relevant data:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(devtools)
## Loading required package: usethis
#install_github("ropengov/rqog")
library(rqog)
#Download the QOG data:
qog_standard <- read_qog(which_data="standard", data_type = "time-series")
## Local file not found. 
## Downloading QoG qog_std_ts_jan23.csv data 
## from http://www.qogdata.pol.gu.se/data/qog_std_ts_jan23.csv
## in file: C:\Users\jnsno\AppData\Local\Temp\RtmpCWQVZU/rqog/qog_std_ts_jan23.csv
## Reading cache file C:\Users\jnsno\AppData\Local\Temp\RtmpCWQVZU/rqog/qog_std_ts_jan23.csv

You can consult the very long codebook for these data. It’s immediately apparent that a lot of variables are included. There may in fact be too many for your computer to handle missingness during our class period. So your first task is to choose a set of variables for the model you want to estimate in the end. You need a measure of the outcome (democracy), the cause (economic development), and any control variables.

In addition, select a handful of variables you think might be usefully related to the variables in your model in terms of providing information about their values when there is missingness.

At this point, you’re ready to start carrying out your missing-data analysis. Fill in the sample code below with variables that make sense for your analysis. In the example code below, I do this for an analysis focusing on democracy and carbon footprints.

qog_demcarbon_data <- data.frame(carbonfootprint = qog_standard$ef_carb, 
                                 democracy = qog_standard$vdem_libdem,
                                 gdp = qog_standard$wdi_gdpcapcon2015,
                                 education = qog_standard$wdi_gerp,
                                 lifeexp = qog_standard$wdi_lifexp,
                                 fossilfuels = qog_standard$wdi_fossil,
                                 corruption = qog_standard$bci_bci,
                                 year = qog_standard$year,
                                 country = qog_standard$cname)

Now look at the pattern of missingness in your smaller subset of the data:

library(naniar)
## Warning: package 'naniar' was built under R version 4.5.3
vis_miss(qog_demcarbon_data)

library(UpSetR)
## Warning: package 'UpSetR' was built under R version 4.5.3
gg_miss_upset(qog_demcarbon_data)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

What do you conclude? Are there some variables that seem to have more missingness than others? Can you see patterns where certain variables tend to be missing together?

For my data, I notice that there is a large block of observations where all variables are simultaneously missing. They turn out to be clustered in the earlier years, which I remove. You may have a similar problem, or not, depending on your variables. It’s worth exploring.

table(qog_standard$year,is.na(qog_standard$vdem_libdem))
##       
##        FALSE TRUE
##   1946    64  137
##   1947    66  135
##   1948    72  129
##   1949    74  127
##   1950    77  124
##   1951    77  124
##   1952    78  123
##   1953    78  123
##   1954    80  121
##   1955    82  119
##   1956    85  116
##   1957    86  115
##   1958    86  115
##   1959    87  114
##   1960    90  111
##   1961   107   94
##   1962   107   94
##   1963   114   87
##   1964   115   86
##   1965   119   82
##   1966   124   77
##   1967   127   74
##   1968   129   72
##   1969   131   70
##   1970   131   70
##   1971   133   68
##   1972   135   66
##   1973   135   66
##   1974   135   66
##   1975   138   63
##   1976   143   58
##   1977   145   55
##   1978   145   55
##   1979   146   54
##   1980   146   54
##   1981   147   53
##   1982   147   53
##   1983   147   53
##   1984   147   53
##   1985   147   53
##   1986   147   53
##   1987   147   53
##   1988   147   53
##   1989   147   53
##   1990   147   52
##   1991   147   51
##   1992   164   34
##   1993   168   30
##   1994   168   30
##   1995   168   30
##   1996   168   30
##   1997   168   30
##   1998   168   30
##   1999   168   30
##   2000   168   30
##   2001   168   30
##   2002   170   28
##   2003   170   28
##   2004   170   28
##   2005   170   28
##   2006   172   26
##   2007   172   26
##   2008   172   26
##   2009   172   26
##   2010   172   26
##   2011   173   25
##   2012   173   25
##   2013   173   25
##   2014   173   25
##   2015   173   25
##   2016   173   25
##   2017   173   25
##   2018   173   25
##   2019   173   25
##   2020   173   25
##   2021   173   25
##   2022     0  198
table(qog_standard$year,is.na(qog_standard$ef_carb))
##       
##        FALSE TRUE
##   1946     0  201
##   1947     0  201
##   1948     0  201
##   1949     0  201
##   1950     0  201
##   1951     0  201
##   1952     0  201
##   1953     0  201
##   1954     0  201
##   1955     0  201
##   1956     0  201
##   1957     0  201
##   1958     0  201
##   1959     0  201
##   1960     0  201
##   1961    83  118
##   1962    83  118
##   1963    89  112
##   1964    91  110
##   1965    95  106
##   1966    97  104
##   1967    99  102
##   1968    99  102
##   1969   100  101
##   1970   101  100
##   1971   103   98
##   1972   103   98
##   1973   103   98
##   1974   104   97
##   1975   106   95
##   1976   107   94
##   1977   107   93
##   1978   107   93
##   1979   110   90
##   1980   113   87
##   1981   115   85
##   1982   115   85
##   1983   115   85
##   1984   117   83
##   1985   117   83
##   1986   117   83
##   1987   117   83
##   1988   117   83
##   1989   117   83
##   1990   117   82
##   1991   117   81
##   1992   131   67
##   1993   135   63
##   1994   135   63
##   1995   135   63
##   1996   136   62
##   1997   136   62
##   1998   136   62
##   1999   137   61
##   2000   138   60
##   2001   138   60
##   2002   138   60
##   2003   138   60
##   2004   138   60
##   2005   138   60
##   2006   139   59
##   2007   139   59
##   2008   139   59
##   2009   139   59
##   2010   139   59
##   2011   139   59
##   2012   140   58
##   2013   140   58
##   2014   140   58
##   2015   140   58
##   2016   140   58
##   2017   140   58
##   2018   161   37
##   2019     0  198
##   2020     0  198
##   2021     0  198
##   2022     0  198
qog_demcarbon_data <- qog_demcarbon_data %>% filter(year>1960)

There are multiple possible solutions for missing data, but a common approach is multiple imputation. If your group wants to pursue this as a solution, you can start with the sample code below.

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
imputed_demcarbon_data <- mice(qog_demcarbon_data, m = 5, method = 'pmm', seed = 123)
## 
##  iter imp variable
##   1   1  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   1   2  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   1   3  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   1   4  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   1   5  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   2   1  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   2   2  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   2   3  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   2   4  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   2   5  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   3   1  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   3   2  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   3   3  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   3   4  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   3   5  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   4   1  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   4   2  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   4   3  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   4   4  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   4   5  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   5   1  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   5   2  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   5   3  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   5   4  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
##   5   5  carbonfootprint  democracy  gdp  education  lifeexp  fossilfuels  corruption
## Warning: Number of logged events: 1
fit_demcarbon <- with(imputed_demcarbon_data, lm(carbonfootprint ~ democracy + I(log(gdp))))
pooled_demcarbon_results <- pool(fit_demcarbon)
summary(pooled_demcarbon_results)
##          term    estimate  std.error   statistic       df      p.value
## 1 (Intercept) -6.92926105 0.12490998 -55.4740401 12.08367 6.393995e-16
## 2   democracy -0.06171357 0.08306484  -0.7429566 14.87980 4.690763e-01
## 3 I(log(gdp))  1.03298626 0.01740758  59.3411721 11.48297 1.196320e-15

As a group, discuss your results and interpret their meaning. Does adjusting for missing data change the results in any meaningful way? Send your results and a paragraph of explanation to your TA.