class: title-slide # Session 5.2: Missing data inputation<span style="display:block; margin-top: 10px ;"></span> ## ### ### Imperial College London <!-- Can also separate the various components of the extra argument 'params', eg as in ### Imperial College London, , MSc in Epidemiology --> --- layout: true .my-footer[ .alignleft[ © Marta Blangiardo | Monica Pirani ] .aligncenter[ MSc in Epidemiology ] .alignright[ Imperial College London, NA ] ] --- # Learning Objectives After this session you should be able to: <span style="display:block; margin-top: 10px ;"></span> - Appreciate the importance of thinking about why data are missing, and stating your modelling assumptions <span style="display:block; margin-top: 10px ;"></span> - Understand the disadvantages of complete case analysis <span style="display:block; margin-top: 10px ;"></span> - Learn about how Bayesian methods can be used for handling missing data <span style="display:block; margin-top: 10px ;"></span> - Be able to run the above models in R-INLA <span style="display:block; margin-top: 10px ;"></span> The topics covered in this lecture are covered in Chapter 12 of Gómez-Rubio (2020): https://becarioprecario.bitbucket.io/inla-gitbook/ --- # Why we care about missing data? - Missing data are common! - Usually inadequately handled in both observational and experimental research <span style="display:block; margin-top: 10px ;"></span> - For example, Wood, White, and Thompson (2004) reviewed 71 recently published BMJ, JAMA, Lancet and NEJM papers - 89% had partly missing outcome data - In 37 trials with repeated outcome measures, 46% performed complete case analysis - Only 21% reported sensitivity analysis <span style="display:block; margin-top: 20px ;"></span> - Sterne, White, Carlin, Spratt, Royston, Kenward, Wood, and Carpenter (2009) reviewed articles using Multiple Imputation in BMJ, JAMA, Lancet and NEJM from 2002 to 2007 - 59 articles found, with use doubling over 6 year period - However, the reporting was almost always inadequate --- # Outline 1\. [How do missing data arise?](#how) <span style="display:block; margin-top: 30px ;"></span> 2\. [Example: children height and weight](#example) <span style="display:block; margin-top: 30px ;"></span> 3\. [Bayesian imputation](#model) <span style="display:block; margin-top: 30px ;"></span> 4\. [Extending the model](#extending) <span style="display:block; margin-top: 30px ;"></span> --- name: how <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **How do missing data arise?**]]] --- # How do missing data arise? <center><img src=./img/Missing_DAG.png width='80%' title='INCLUDE TEXT HERE'></center> --- # Different types of missing data: MCAR - There are three types of missing data, depending on why the missingness arise. Let's define `\(m_i\)` as the variable indicating if the `\(i-th\)` observation is missing `$$m_i \sim \text{Bernoulli}(p_i)$$` 1\. .red[Missing completely at random] (MCAR) occurs when the missing data are independent from the observed or unobserved data: `$$\text{logit}(p_i) = \theta_0$$` This means that the missing values can be ignored and the analysis can be conducted as usual. --- # Different types of missing data: MAR and MNAR 2\. .red[Missing at random] (MAR) occurs when the missing data depends ONLY on the observed data: `$$\text{logit}(p_i) = \theta_0 + \bf{x}_i \mathbf{\theta}_1$$` In this case, this can be introduced into the model so that missing observations are imputed as part of the model fitting. -- <span style="display:block; margin-top: 40px ;"></span> 3\. .red[Missing non at random] (MNAR) occurs when the missing data depends on both the observed and missing data: `$$\text{logit}(p_i) = \theta_0 + \bf{x}_i \mathbf{\theta}_1 + \lambda y_i$$` This scenario is difficult to tackle since there is no information about the missingness mechanism and the missing data. --- # Missing response or missing covariates - Additionally, it is crucial to distinguish between missing values .blue[in the response] and .blue[in the covariates]. <span style="display:block; margin-top: 30px ;"></span> - When the missingness is in the response (and it is MCAR or MAR), these can naturally be predicted as the distribution of the response values is determined by the statistical model to be fit (posterior predictive distribution in the Bayesian approach) <span style="display:block; margin-top: 30px ;"></span> - Missingness in the covariats requires additional steps: <span style="display:block; margin-top: 10px ;"></span> - A model needs to be specified on the covariate (imputation model) if the missingness mechanism is MCAR or MAR <span style="display:block; margin-top: 10px ;"></span> - An additional model of missingness needs to be specified if the mechanism is MNAR <span style="display:block; margin-top: 10px ;"></span> - Here we will consider only missing values in the response <span style="display:block; margin-top: 10px ;"></span> - `R-INLA` requires a certain degree of complexity to deal with missing values in covariates, if you are interested in learning more look at Gómez-Rubio, Cameletti, and Blangiardo (2022) --- name: example <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Example: height and weight of children**]]] --- # Example: height and weight of children - Information of 10,030 children measured within the Fifth Dutch Growth Study 2009 (Schönbeck, Talma, Van Dommelen, Bakker, Buitendijk, HiraSing, and Van Buuren, 2013) <span style="display:block; margin-top: 20px ;"></span> - Data available from the `fdgs` dataset in the `library(mice)` (Van Buuren and Groothuis-Oudshoorn, 2011) <span style="display:block; margin-top: 20px ;"></span> <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Description </th> <th style="text-align:right;"> Missing </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> id </td> <td style="text-align:left;"> Child ID </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> reg </td> <td style="text-align:left;"> Region (5 levels) </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> age </td> <td style="text-align:left;"> Age (year) </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> sex </td> <td style="text-align:left;"> Sex </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> hgt </td> <td style="text-align:left;"> Height (cm) </td> <td style="text-align:right;"> 23 </td> </tr> <tr> <td style="text-align:left;"> wgt </td> <td style="text-align:left;"> Weight (kg) </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> hgt.z </td> <td style="text-align:left;"> Re-scaled height (as a Z-score) </td> <td style="text-align:right;"> 23 </td> </tr> <tr> <td style="text-align:left;"> wgt.z </td> <td style="text-align:left;"> Re-scaled weight (as a Z-score) </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> --- # Summarising the data - The data can be summarised using ``` r > options(width = 100) > summary(fdgs) ``` ``` id reg age sex hgt wgt Min. :100001 North: 732 Min. : 0.008214 boy :4829 Min. : 46.0 Min. : 2.585 1st Qu.:106353 East :2528 1st Qu.: 1.618754 girl:5201 1st Qu.: 83.8 1st Qu.: 11.600 Median :203855 South:2931 Median : 8.084873 Median :131.5 Median : 27.500 Mean :180091 West :2578 Mean : 8.157936 Mean :123.9 Mean : 32.385 3rd Qu.:210591 City :1261 3rd Qu.:13.547570 3rd Qu.:162.3 3rd Qu.: 51.100 Max. :401955 Max. :21.993155 Max. :208.0 Max. :135.300 NA's :23 NA's :20 hgt.z wgt.z Min. :-4.470000 Min. :-5.04000 1st Qu.:-0.678000 1st Qu.:-0.62475 Median :-0.019000 Median : 0.02600 Mean :-0.006054 Mean : 0.04573 3rd Qu.: 0.677000 3rd Qu.: 0.70700 Max. : 3.900000 Max. : 4.74100 NA's :23 NA's :20 ``` - Note that several variables in the dataset have missing observations. In particular, height `(hgt)` and weight `(wgt)`, which are common variables used as response or predictors in models. --- # Subsetting the data - In order to provide a smaller dataset to speed up computations, only the children with missing values (in height and weight) and another 1000 ones taken at random will be used in the analysis ``` r > # Subsect 1, observations with NA's > subset1 <- which(is.na(fdgs$wgt) | is.na(fdgs$hgt)) > > #Subset 2, random sample of 1000 individuals > set.seed(1) > subset2 <- sample((1:nrow(fdgs))[-subset1], 1000) > > # Subset 1 + subset 2 > fdgs.sub <- fdgs[c(subset1, subset2), ] ``` --- # Summary of the data ``` r > summary(fdgs.sub) ``` ``` id reg age sex hgt wgt Min. :100098 North: 78 Min. : 0.07118 boy :493 Min. : 46.00 Min. : 2.585 1st Qu.:106293 East :275 1st Qu.: 1.74264 girl:550 1st Qu.: 86.28 1st Qu.: 11.960 Median :204306 South:298 Median : 8.59411 Median :136.05 Median : 29.000 Mean :183214 West :250 Mean : 8.56536 Mean :127.04 Mean : 33.614 3rd Qu.:211388 City :142 3rd Qu.:14.16427 3rd Qu.:165.10 3rd Qu.: 53.100 Max. :401949 Max. :21.88364 Max. :199.00 Max. :117.300 NA's :23 NA's :20 hgt.z wgt.z Min. :-4.26300 Min. :-4.0750 1st Qu.:-0.66800 1st Qu.:-0.6520 Median :-0.04550 Median :-0.0070 Mean :-0.02313 Mean : 0.0202 3rd Qu.: 0.64550 3rd Qu.: 0.6960 Max. : 3.20000 Max. : 3.6300 NA's :23 NA's :20 ``` --- name: model <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Bayesian imputation**]]] --- # Model specification - We first predict weight as a function of age and sex. We specify: `$$wgt_i \sim \text{Normal}(\alpha+\beta_1 sex_i + \beta_2 age_i, \sigma^2)$$` - where `wgt` is missing we will have `NA` in the corresponding vector, e.g. ``` r > fdgs.sub$wgt[1:10] ``` ``` [1] 23.200 NA NA NA 8.445 NA 6.960 10.420 NA 14.100 ``` - Remember the posterior predictive distribution (presented in lecture 2.2): `$$p(\mathbf{y}^*|\mathbf{y}) = \int p(y^*|\theta)p(\theta|\mathbf{y})d\theta$$` - here `\(\mathbf{y}^*\)` identifies the missing values, while `\(\mathbf{y}\)` is the set of observed values for `wgt` --- # Running the model in `R-INLA` ``` r > library("INLA") > wgt.inla <- inla(wgt ~ age + sex, data = fdgs.sub, + control.predictor = list(compute = TRUE), control.compute=list(return.marginals.predictor=TRUE)) > wgt.inla$summary.fixed ``` ``` mean sd 0.025quant 0.5quant 0.975quant mode kld (Intercept) 6.259187 0.43604778 5.403965 6.259187 7.114408 6.259187 8.791297e-12 age 3.330942 0.03369553 3.264855 3.330942 3.397029 3.330942 8.996111e-12 sexgirl -2.378944 0.44616249 -3.254002 -2.378944 -1.503883 -2.378944 8.779814e-12 ``` ``` r > wgt.inla$summary.hyperpar ``` ``` mean sd 0.025quant 0.5quant 0.975quant Precision for the Gaussian observations 0.01972868 0.0008731002 0.01805437 0.01971668 0.02147505 mode Precision for the Gaussian observations 0.01969729 ``` - Note that we need to include `control.predictor=list(compute = TRUE)` so `INLA` can estimate the predictive distribution for the missing observations, while `control.compute=list(return.marginals.predictor=TRUE)` tells inla that we want to access the entire posterior distribution of the prediction (rather than only the summary) --- # Getting the posterior prediction We now subset the children indexes with missing values so we can report their predictive distributions: ``` r > wgt.na = which(is.na(fdgs.sub$wgt)) > rownames(fdgs.sub)[wgt.na] ``` ``` [1] "275" "1278" "1419" "2135" "2684" "2940" "3069" "3189" "3543" "4262" "5687" "6485" "7101" [14] "7108" "7506" "8064" "8065" "8067" "8098" "8588" ``` ``` r > # Obtain the predictive distribution > wgt.inla$summary.fitted.values[wgt.na, c("mean", "sd")][1:5,] ``` ``` mean sd fitted.Predictor.0002 23.067927 0.3202367 fitted.Predictor.0003 27.617341 0.3330419 fitted.Predictor.0004 54.019923 0.3767163 fitted.Predictor.0006 13.144502 0.3929115 fitted.Predictor.0009 7.838159 0.3938823 ``` Remember that you can also access the marginal posterior distributions (rather than the summary) using `wgt.inla$marginals.fitted.values` --- # Imputing `hgt` using the same approach - Similarly, a model can be fit to explain height based on age and sex and to compute the predictive distribution of the missing observations: ``` r > hgt.inla = inla(hgt ~ age + sex, data = fdgs.sub, + control.predictor = list(compute = TRUE), + control.compute = list(return.marginals.predictor=TRUE)) > hgt.inla$summary.fixed ``` ``` mean sd 0.025quant 0.5quant 0.975quant mode kld (Intercept) 77.148121 0.71985093 75.736271 77.148122 78.559965 77.148122 9.653372e-12 age 6.022575 0.05539618 5.913926 6.022575 6.131223 6.022575 1.041944e-11 sexgirl -4.716077 0.72990150 -6.147629 -4.716080 -3.284512 -4.716080 8.929221e-12 ``` ``` r > hgt.inla$summary.hyperpar ``` ``` mean sd 0.025quant 0.5quant 0.975quant Precision for the Gaussian observations 0.007402752 0.0003279352 0.00677109 0.007398761 0.008055958 mode Precision for the Gaussian observations 0.007386158 ``` --- # Imputing `hgt` using the same approach .pull-left[ <span style="display:block; margin-top: 24px ;"></span> - We can obtain the predictions using ``` r > hgt.na = which(is.na(fdgs.sub$hgt)) > hgt.inla$summary.fitted.values[hgt.na, c("mean", "sd")][1:10,] ``` ``` mean sd fitted.Predictor.0001 122.24529 0.5353218 fitted.Predictor.0005 83.24902 0.6830427 fitted.Predictor.0007 74.64156 0.6783904 fitted.Predictor.0008 82.80382 0.6856290 fitted.Predictor.0010 88.26140 0.6011378 fitted.Predictor.0012 96.55557 0.6133715 fitted.Predictor.0013 86.31596 0.6656378 fitted.Predictor.0016 81.84746 0.6912339 fitted.Predictor.0017 77.75821 0.7159237 fitted.Predictor.0018 81.59988 0.6370189 ``` ] .pull-right[ - We can plot the entire posterior distributions for each child with: ``` r > # First child with missing value > plot(hgt.inla$marginals.fitted.values[[hgt.na[1]]], type="l") ``` <img src="./img/unnamed-chunk-12-1.png" style="display: block; margin: auto;" width="80%"> ] --- name: extending <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Extending the model**]]] --- # Joint model of height and weight - The two previous models consider height and weight separately, but it is clear that there is a high correlation between height and weight, which is caused by the age and sex of the child. ``` [1] 0.9999865 ``` - We can build a joint model for height and weight to exploit a correlated effect between the coefficients of age in both models. `\begin{align*} hgt_i &= \alpha_h + \beta_{h1} sex_i + \beta_{h2} age_i + \epsilon_{1i}\\ wgt_i &= \alpha_w + \beta_{w1} sex_i + \beta_{w2} age_i + \epsilon_{2i}\\ \end{align*}` with - `\(\alpha_h,\alpha_w\)` model intercepts - `\(\beta_{h1},\beta_{w1}\)` the effect of sex - `\(\beta_{h2},\beta_{w2}\)` the effect of age - `\(\mathbf{\epsilon_{1}},\mathbf{\epsilon}_{2}\)` the error terms (note that this specification is equivalent to the one above for the separate models) --- # Joint model of height and weight: prior - The vectors `\((\mathbf{\beta}_{1h}, \mathbf{\beta}_{1w})\)` and `\((\mathbf{\beta}_{2h}, \mathbf{\beta}_{2w})\)` are modeled using a multivariate Gaussian distribution with mean 0 and covariance matrix with `\(1/\tau_{hj}\)` and `\(1/\tau_{wj}\)` as variances, and `\(\rho_j / \sqrt{(\tau_{jh} \tau_{jw})}\)` as the covariance where `\(\rho_j\)` is the correlation parameter. - First, the bivariate response variable needs to be put in a two-column matrix given that the model will be made of two data distributions ``` r > n = nrow(fdgs.sub) > y = matrix(NA, nrow = 2 * n, ncol = 2) > y[1:n, 1] = fdgs.sub$hgt > y[(n + 1):(2*n), 2] = fdgs.sub$wgt ``` - Similarly, as we have two intercepts, we need to define these explicitly as covariates with all values equal to one: ``` r > I = matrix(NA, nrow = 2 * n, ncol = 2) > I[1:n, 1] = 1 > I[(n + 1):(2*n), 2] = 1 ``` --- # Correlated effects - Now we need to define the correlated effects. We will use the random effect specification similar to what we saw with the hierarchical models (`iid`), but modified to have the two coefficients as correlated `f(...,model=iid2d)`. - In order to do so we need to modify the variables `age` and `sex` as they will be passed to the model as weights of the latent random effects `iid2d` specification. We need them to be twice as long to match the dimension of the response: ``` r > age.joint = rep(fdgs.sub$age, 2) > sex.joint = rep(fdgs.sub$sex, 2) ``` - Finally we need two index vectors to indicate which coefficient to use from the `iid2d` model is required. These indexes will be 1 for the first half of observations (to indicate that the coefficient is `\(\beta_h\)`) and 2 for the second half (to indicate that the coefficient is `\(\beta_w\)`). ``` r > idx.age = rep(1:2, each = n) > idx.sex = rep(1:2, each = n) ``` --- # Model fitting The model is fit and summarized as seen below ``` r > # Model formula > joint.f = y ~ -1 + I + f(idx.sex, sex, model="iid2d", n=2) + f(idx.age, age, model = "iid2d", n = 2) > # Model fit > fdgs.joint = inla(joint.f, + data = list(y = y, I = I, sex = sex.joint, age = age.joint, idx.age = idx.age, idx.sex = idx.sex), + family = rep("gaussian", 2), + control.predictor = list(compute = TRUE)) > # Summary fixed (intercept) > fdgs.joint$summary.fixed ``` ``` mean sd 0.025quant 0.5quant 0.975quant mode kld I1 80.576064 1.1586252 78.320145 80.57052 82.863385 80.570440 5.987726e-09 I2 8.248398 0.6944761 6.896143 8.24496 9.619856 8.244844 5.876125e-09 ``` --- # Hyperparameters - The hyperparameters can be obtained through ``` r > fdgs.joint$summary.hyperpar[,1:5] ``` ``` mean sd 0.025quant 0.5quant Precision for the Gaussian observations 0.007402459 0.0003223362 0.006784751 0.007396622 Precision for the Gaussian observations[2] 0.019709156 0.0008617376 0.018107329 0.019676897 Precision for idx.sex (component 1) 0.566326783 0.1292450436 0.366565745 0.548421459 Precision for idx.sex (component 2) 1.457373682 0.3144833499 0.927335353 1.427427363 Rho1:2 for idx.sex 0.706425863 0.0563253871 0.577594152 0.712744034 Precision for idx.age (component 1) 0.222518895 0.0468590100 0.149488087 0.216240200 Precision for idx.age (component 2) 0.689820457 0.1228098244 0.475817181 0.680410489 Rho1:2 for idx.age 0.886190458 0.0262330701 0.824263263 0.889790732 0.975quant Precision for the Gaussian observations 0.008053636 Precision for the Gaussian observations[2] 0.021499338 Precision for idx.sex (component 1) 0.871873607 Precision for idx.sex (component 2) 2.159541792 Rho1:2 for idx.sex 0.798086390 Precision for idx.age (component 1) 0.332777840 Precision for idx.age (component 2) 0.957771160 Rho1:2 for idx.age 0.926798478 ``` --- # Variable effects - While the coefficients for `age` and `sex` are part of the random effects of the model: ``` r > #Sex > fdgs.joint$summary.random$idx.sex ``` ``` ID mean sd 0.025quant 0.5quant 0.975quant mode kld 1 1 -3.883723 0.6442724 -5.161137 -3.878537 -2.635653 -3.878438 1.732361e-08 2 2 -2.118500 0.3851571 -2.882579 -2.115280 -1.372207 -2.115121 1.629804e-08 ``` ``` r > #Age > fdgs.joint$summary.random$idx.age ``` ``` ID mean sd 0.025quant 0.5quant 0.975quant mode kld 1 1 6.023171 0.05532938 5.914660 6.023171 6.131684 6.023170 0.000000e+00 2 2 3.329908 0.03369925 3.263817 3.329908 3.395997 3.329908 7.820951e-13 ``` --- # Estimates of missing values - Finally to access the predicted values for the children with missing data we need to remember that we now have the response `y` stacked (first height, then weight) - As height is the first variable we can use `hgt.na` to get to the indexes of the children with missing data - For weight we need to use `wgt.na` and add `n` to each index, as this is the total number of observations, so the data for weight will start on the index 1044 ``` r > joint.wgt.na = wgt.na+n ``` ``` r > #Height > fdgs.joint$summary.fitted.values[hgt.na, c("mean", "sd")][1:10,] > #Weight > fdgs.joint$summary.fitted.values[joint.wgt.na, c("mean", "sd")][1:10,] ``` --- # Comparing the results of the joint and separate models .pull-left[ .red[Weight] <img src="./img/unnamed-chunk-25-1.png" > ] .pull-right[ .red[Height] <img src="./img/unnamed-chunk-26-1.png" > ] --- # References Gómez-Rubio, V. (2020). _Bayesian inference with INLA_. CRC Press. Gómez-Rubio, V., M. Cameletti, and M. Blangiardo (2022). "Missing data analysis and imputation via latent Gaussian Markov random fields". In: _SORT-Statistics and Operations Research Transactions_, pp. 217-244. Schönbeck, Y., H. Talma, P. Van Dommelen, et al. (2013). "The world’s tallest nation has stopped growing taller: the height of Dutch children from 1955 to 2009". In: _Pediatric Research_ 73.3, pp. 371-377. Sterne, J. A., I. R. White, J. B. Carlin, et al. (2009). "Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls". In: _BMJ_ 338. Van Buuren, S. and K. Groothuis-Oudshoorn (2011). "mice: Multivariate imputation by chained equations in R". In: _Journal of Statistical Software_ 45, pp. 1-67. Wood, A. M., I. R. White, and S. G. Thompson (2004). "Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals". In: _Clinical trials_ 1.4, pp. 368-376.