Let’s look at whether missing data are a problem in the study of democracy and economic development. With your group, access the Quality of Governance (QoG) dataset, and assemble a regression model predicting countries’ level of democracy based on per capita GDP and any control variables you see as relevant. Are there missing data? What are some possible solutions? Show how implementing a good solution would modify the results (if at all).
Start by loading relevant data:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(devtools)
## Loading required package: usethis
#install_github("ropengov/rqog")
library(rqog)
#Download the QOG data:
qog_standard <- read_qog(which_data="standard", data_type = "time-series")
## Local file not found.
## Downloading QoG qog_std_ts_jan23.csv data
## from http://www.qogdata.pol.gu.se/data/qog_std_ts_jan23.csv
## in file: C:\Users\jnsno\AppData\Local\Temp\RtmpCWQVZU/rqog/qog_std_ts_jan23.csv
## Reading cache file C:\Users\jnsno\AppData\Local\Temp\RtmpCWQVZU/rqog/qog_std_ts_jan23.csv
You can consult the very long codebook for these data. It’s immediately apparent that a lot of variables are included. There may in fact be too many for your computer to handle missingness during our class period. So your first task is to choose a set of variables for the model you want to estimate in the end. You need a measure of the outcome (democracy), the cause (economic development), and any control variables.
In addition, select a handful of variables you think might be usefully related to the variables in your model in terms of providing information about their values when there is missingness.
At this point, you’re ready to start carrying out your missing-data analysis. Fill in the sample code below with variables that make sense for your analysis. In the example code below, I do this for an analysis focusing on democracy and carbon footprints.
qog_demcarbon_data <- data.frame(carbonfootprint = qog_standard$ef_carb,
democracy = qog_standard$vdem_libdem,
gdp = qog_standard$wdi_gdpcapcon2015,
education = qog_standard$wdi_gerp,
lifeexp = qog_standard$wdi_lifexp,
fossilfuels = qog_standard$wdi_fossil,
corruption = qog_standard$bci_bci,
year = qog_standard$year,
country = qog_standard$cname)
Now look at the pattern of missingness in your smaller subset of the data:
library(naniar)
## Warning: package 'naniar' was built under R version 4.5.3
vis_miss(qog_demcarbon_data)
library(UpSetR)
## Warning: package 'UpSetR' was built under R version 4.5.3
gg_miss_upset(qog_demcarbon_data)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
What do you conclude? Are there some variables that seem to have more missingness than others? Can you see patterns where certain variables tend to be missing together?
For my data, I notice that there is a large block of observations where all variables are simultaneously missing. They turn out to be clustered in the earlier years, which I remove. You may have a similar problem, or not, depending on your variables. It’s worth exploring.
table(qog_standard$year,is.na(qog_standard$vdem_libdem))
##
## FALSE TRUE
## 1946 64 137
## 1947 66 135
## 1948 72 129
## 1949 74 127
## 1950 77 124
## 1951 77 124
## 1952 78 123
## 1953 78 123
## 1954 80 121
## 1955 82 119
## 1956 85 116
## 1957 86 115
## 1958 86 115
## 1959 87 114
## 1960 90 111
## 1961 107 94
## 1962 107 94
## 1963 114 87
## 1964 115 86
## 1965 119 82
## 1966 124 77
## 1967 127 74
## 1968 129 72
## 1969 131 70
## 1970 131 70
## 1971 133 68
## 1972 135 66
## 1973 135 66
## 1974 135 66
## 1975 138 63
## 1976 143 58
## 1977 145 55
## 1978 145 55
## 1979 146 54
## 1980 146 54
## 1981 147 53
## 1982 147 53
## 1983 147 53
## 1984 147 53
## 1985 147 53
## 1986 147 53
## 1987 147 53
## 1988 147 53
## 1989 147 53
## 1990 147 52
## 1991 147 51
## 1992 164 34
## 1993 168 30
## 1994 168 30
## 1995 168 30
## 1996 168 30
## 1997 168 30
## 1998 168 30
## 1999 168 30
## 2000 168 30
## 2001 168 30
## 2002 170 28
## 2003 170 28
## 2004 170 28
## 2005 170 28
## 2006 172 26
## 2007 172 26
## 2008 172 26
## 2009 172 26
## 2010 172 26
## 2011 173 25
## 2012 173 25
## 2013 173 25
## 2014 173 25
## 2015 173 25
## 2016 173 25
## 2017 173 25
## 2018 173 25
## 2019 173 25
## 2020 173 25
## 2021 173 25
## 2022 0 198
table(qog_standard$year,is.na(qog_standard$ef_carb))
##
## FALSE TRUE
## 1946 0 201
## 1947 0 201
## 1948 0 201
## 1949 0 201
## 1950 0 201
## 1951 0 201
## 1952 0 201
## 1953 0 201
## 1954 0 201
## 1955 0 201
## 1956 0 201
## 1957 0 201
## 1958 0 201
## 1959 0 201
## 1960 0 201
## 1961 83 118
## 1962 83 118
## 1963 89 112
## 1964 91 110
## 1965 95 106
## 1966 97 104
## 1967 99 102
## 1968 99 102
## 1969 100 101
## 1970 101 100
## 1971 103 98
## 1972 103 98
## 1973 103 98
## 1974 104 97
## 1975 106 95
## 1976 107 94
## 1977 107 93
## 1978 107 93
## 1979 110 90
## 1980 113 87
## 1981 115 85
## 1982 115 85
## 1983 115 85
## 1984 117 83
## 1985 117 83
## 1986 117 83
## 1987 117 83
## 1988 117 83
## 1989 117 83
## 1990 117 82
## 1991 117 81
## 1992 131 67
## 1993 135 63
## 1994 135 63
## 1995 135 63
## 1996 136 62
## 1997 136 62
## 1998 136 62
## 1999 137 61
## 2000 138 60
## 2001 138 60
## 2002 138 60
## 2003 138 60
## 2004 138 60
## 2005 138 60
## 2006 139 59
## 2007 139 59
## 2008 139 59
## 2009 139 59
## 2010 139 59
## 2011 139 59
## 2012 140 58
## 2013 140 58
## 2014 140 58
## 2015 140 58
## 2016 140 58
## 2017 140 58
## 2018 161 37
## 2019 0 198
## 2020 0 198
## 2021 0 198
## 2022 0 198
qog_demcarbon_data <- qog_demcarbon_data %>% filter(year>1960)
There are multiple possible solutions for missing data, but a common approach is multiple imputation. If your group wants to pursue this as a solution, you can start with the sample code below.
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
imputed_demcarbon_data <- mice(qog_demcarbon_data, m = 5, method = 'pmm', seed = 123)
##
## iter imp variable
## 1 1 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 1 2 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 1 3 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 1 4 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 1 5 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 2 1 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 2 2 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 2 3 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 2 4 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 2 5 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 3 1 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 3 2 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 3 3 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 3 4 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 3 5 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 4 1 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 4 2 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 4 3 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 4 4 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 4 5 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 5 1 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 5 2 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 5 3 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 5 4 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## 5 5 carbonfootprint democracy gdp education lifeexp fossilfuels corruption
## Warning: Number of logged events: 1
fit_demcarbon <- with(imputed_demcarbon_data, lm(carbonfootprint ~ democracy + I(log(gdp))))
pooled_demcarbon_results <- pool(fit_demcarbon)
summary(pooled_demcarbon_results)
## term estimate std.error statistic df p.value
## 1 (Intercept) -6.92926105 0.12490998 -55.4740401 12.08367 6.393995e-16
## 2 democracy -0.06171357 0.08306484 -0.7429566 14.87980 4.690763e-01
## 3 I(log(gdp)) 1.03298626 0.01740758 59.3411721 11.48297 1.196320e-15
As a group, discuss your results and interpret their meaning. Does adjusting for missing data change the results in any meaningful way? Send your results and a paragraph of explanation to your TA.