Session 5.2: Missing data inputation

# Session 5.2: Missing data inputation

###

### Imperial College London

---

.my-footer[ 
.alignleft[ 
&nbsp; &copy; Marta Blangiardo | Monica Pirani 
]
.aligncenter[
MSc in Epidemiology 
]
.alignright[
Imperial College London, NA 
]
]

---

# Learning Objectives

After this session you should be able to:

- Appreciate the importance of thinking about why data are
missing, and stating your modelling assumptions

- Understand the disadvantages of complete case analysis

- Learn about how Bayesian methods can be used for handling missing data

- Be able to run the above models in R-INLA

The topics covered in this lecture are covered in Chapter 12 of Gómez-Rubio (2020): https://becarioprecario.bitbucket.io/inla-gitbook/

---

# Why we care about missing data?

- Missing data are common!

- Usually inadequately handled in both observational and
experimental research

- For example, Wood, White, and Thompson (2004) reviewed 71 recently published BMJ, JAMA, Lancet and NEJM papers

- 89% had partly missing outcome data
  
  - In 37 trials with repeated outcome measures, 46% performed complete case analysis
  
  - Only 21% reported sensitivity analysis

- Sterne, White, Carlin, Spratt, Royston, Kenward, Wood, and Carpenter (2009) reviewed articles using Multiple Imputation in BMJ, JAMA, Lancet and NEJM from 2002 to 2007

- 59 articles found, with use doubling over 6 year period
  
  - However, the reporting was almost always inadequate

---

# Outline

1\. [How do missing data arise?](#how)

2\. [Example: children height and weight](#example)

3\. [Bayesian imputation](#model)

4\. [Extending the model](#extending)

---

---

# How do missing data arise?

---

# Different types of missing data: MCAR

- There are three types of missing data, depending on why the missingness arise. Let's define `$m_i$` as the variable indicating if the `$i-th$` observation is missing

`$$m_i \sim \text{Bernoulli}(p_i)$$`

1\. .red[Missing completely at random] (MCAR) occurs when the missing data are independent from the observed or unobserved data:

`$$\text{logit}(p_i) = \theta_0$$`
This means that the missing values can be ignored and the analysis can be conducted as usual.

---

# Different types of missing data: MAR and MNAR

2\. .red[Missing at random] (MAR) occurs when the missing data depends ONLY on the observed data:

`$$\text{logit}(p_i) = \theta_0 + \bf{x}_i \mathbf{\theta}_1$$`
 In this case, this can be introduced into the model so that missing observations are imputed as part of the model fitting.
 
--

3\. .red[Missing non at random] (MNAR) occurs when the missing data depends on both the observed and missing data:

`$$\text{logit}(p_i) = \theta_0 + \bf{x}_i \mathbf{\theta}_1 + \lambda y_i$$`

This scenario is difficult to tackle since there is no information about the missingness mechanism and the missing data.

---

# Missing response or missing covariates

- Additionally, it is crucial to distinguish between missing values .blue[in the response] and .blue[in the covariates].

- When the missingness is in the response (and it is MCAR or MAR), these can naturally be predicted as the distribution of the response values is determined by the statistical model to be fit (posterior predictive distribution in the Bayesian approach)

- Missingness in the covariats requires additional steps:

 - A model needs to be specified on the covariate (imputation model) if the missingness mechanism is MCAR or MAR

 - An additional model of missingness needs to be specified if the mechanism is MNAR
 
 - Here we will consider only missing values in the response
 
 - `R-INLA` requires a certain degree of complexity to deal with missing values in covariates, if you are interested in learning more look at Gómez-Rubio, Cameletti, and Blangiardo (2022)

---

---

# Example: height and weight of children

- Information of 10,030 children measured within the Fifth Dutch Growth Study 2009 (Schönbeck, Talma, Van Dommelen, Bakker, Buitendijk, HiraSing, and Van Buuren, 2013)

- Data available from the `fdgs` dataset in the `library(mice)` (Van Buuren and Groothuis-Oudshoorn, 2011)

<table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
 <tr>
 <th style="text-align:left;"> Variable </th>
 <th style="text-align:left;"> Description </th>
 <th style="text-align:right;"> Missing </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> id </td>
 <td style="text-align:left;"> Child ID </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> reg </td>
 <td style="text-align:left;"> Region (5 levels) </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> age </td>
 <td style="text-align:left;"> Age (year) </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> sex </td>
 <td style="text-align:left;"> Sex </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> hgt </td>
 <td style="text-align:left;"> Height (cm) </td>
 <td style="text-align:right;"> 23 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> wgt </td>
 <td style="text-align:left;"> Weight (kg) </td>
 <td style="text-align:right;"> 20 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> hgt.z </td>
 <td style="text-align:left;"> Re-scaled height (as a Z-score) </td>
 <td style="text-align:right;"> 23 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> wgt.z </td>
 <td style="text-align:left;"> Re-scaled weight (as a Z-score) </td>
 <td style="text-align:right;"> 20 </td>
 </tr>
</tbody>
</table>
---
# Summarising the data

- The data can be summarised using

``` r
> options(width = 100)
> summary(fdgs)
```

```
       id            reg            age              sex            hgt             wgt         
 Min.   :100001   North: 732   Min.   : 0.008214   boy :4829   Min.   : 46.0   Min.   :  2.585  
 1st Qu.:106353   East :2528   1st Qu.: 1.618754   girl:5201   1st Qu.: 83.8   1st Qu.: 11.600  
 Median :203855   South:2931   Median : 8.084873               Median :131.5   Median : 27.500  
 Mean   :180091   West :2578   Mean   : 8.157936               Mean   :123.9   Mean   : 32.385  
 3rd Qu.:210591   City :1261   3rd Qu.:13.547570               3rd Qu.:162.3   3rd Qu.: 51.100  
 Max.   :401955                Max.   :21.993155               Max.   :208.0   Max.   :135.300  
                                                               NA's   :23      NA's   :20       
     hgt.z               wgt.z         
 Min.   :-4.470000   Min.   :-5.04000  
 1st Qu.:-0.678000   1st Qu.:-0.62475  
 Median :-0.019000   Median : 0.02600  
 Mean   :-0.006054   Mean   : 0.04573  
 3rd Qu.: 0.677000   3rd Qu.: 0.70700  
 Max.   : 3.900000   Max.   : 4.74100  
 NA's   :23          NA's   :20        
```

- Note that several variables in the dataset have missing observations. In particular, height `(hgt)` and weight `(wgt)`, which are common variables used as response or predictors in models.

---

# Subsetting the data

- In order to provide a smaller dataset to speed up computations, only the children with missing values (in height and weight) and another 1000 ones taken at random will be used in the analysis

``` r
> # Subsect 1, observations with NA's
> subset1 <- which(is.na(fdgs$wgt) | is.na(fdgs$hgt))
> 
> #Subset 2, random sample of 1000 individuals
> set.seed(1)
> subset2 <- sample((1:nrow(fdgs))[-subset1], 1000)
> 
> # Subset 1 + subset 2
> fdgs.sub <- fdgs[c(subset1, subset2), ]
```

---

# Summary of the data

``` r
> summary(fdgs.sub)
```

```
       id            reg           age             sex           hgt              wgt         
 Min.   :100098   North: 78   Min.   : 0.07118   boy :493   Min.   : 46.00   Min.   :  2.585  
 1st Qu.:106293   East :275   1st Qu.: 1.74264   girl:550   1st Qu.: 86.28   1st Qu.: 11.960  
 Median :204306   South:298   Median : 8.59411              Median :136.05   Median : 29.000  
 Mean   :183214   West :250   Mean   : 8.56536              Mean   :127.04   Mean   : 33.614  
 3rd Qu.:211388   City :142   3rd Qu.:14.16427              3rd Qu.:165.10   3rd Qu.: 53.100  
 Max.   :401949               Max.   :21.88364              Max.   :199.00   Max.   :117.300  
                                                            NA's   :23       NA's   :20       
     hgt.z              wgt.z        
 Min.   :-4.26300   Min.   :-4.0750  
 1st Qu.:-0.66800   1st Qu.:-0.6520  
 Median :-0.04550   Median :-0.0070  
 Mean   :-0.02313   Mean   : 0.0202  
 3rd Qu.: 0.64550   3rd Qu.: 0.6960  
 Max.   : 3.20000   Max.   : 3.6300  
 NA's   :23         NA's   :20       
```

---
name: model

---

# Model specification

- We first predict weight as a function of age and sex. We specify:
`$$wgt_i \sim \text{Normal}(\alpha+\beta_1 sex_i + \beta_2 age_i, \sigma^2)$$`
- where `wgt` is missing we will have `NA` in the corresponding vector, e.g.

``` r
> fdgs.sub$wgt[1:10]
```

```
 [1] 23.200     NA     NA     NA  8.445     NA  6.960 10.420     NA 14.100
```
 
- Remember the posterior predictive distribution (presented in lecture 2.2):

`$$p(\mathbf{y}^*|\mathbf{y}) = \int p(y^*|\theta)p(\theta|\mathbf{y})d\theta$$`

- here `$\mathbf{y}^*$` identifies the missing values, while `$\mathbf{y}$` is the set of observed values for `wgt`

---

# Running the model in `R-INLA`

``` r
> library("INLA")
> wgt.inla <- inla(wgt ~ age + sex, data = fdgs.sub,
+ control.predictor = list(compute = TRUE), control.compute=list(return.marginals.predictor=TRUE))
> wgt.inla$summary.fixed
```

```
                 mean         sd 0.025quant  0.5quant 0.975quant      mode          kld
(Intercept)  6.259187 0.43604778   5.403965  6.259187   7.114408  6.259187 8.791297e-12
age          3.330942 0.03369553   3.264855  3.330942   3.397029  3.330942 8.996111e-12
sexgirl     -2.378944 0.44616249  -3.254002 -2.378944  -1.503883 -2.378944 8.779814e-12
```

``` r
> wgt.inla$summary.hyperpar
```

```
                                              mean           sd 0.025quant   0.5quant 0.975quant
Precision for the Gaussian observations 0.01972868 0.0008731002 0.01805437 0.01971668 0.02147505
                                              mode
Precision for the Gaussian observations 0.01969729
```

- Note that we need to include `control.predictor=list(compute = TRUE)` so `INLA` can estimate the predictive distribution for the missing observations, while `control.compute=list(return.marginals.predictor=TRUE)` tells inla that we want to access the entire posterior distribution of the prediction (rather than only the summary)

---

# Getting the posterior prediction

We now subset the children indexes with missing values so we can report their predictive distributions:

``` r
> wgt.na = which(is.na(fdgs.sub$wgt))
> rownames(fdgs.sub)[wgt.na]
```

```
 [1] "275"  "1278" "1419" "2135" "2684" "2940" "3069" "3189" "3543" "4262" "5687" "6485" "7101"
[14] "7108" "7506" "8064" "8065" "8067" "8098" "8588"
```

``` r
> # Obtain the predictive distribution
> wgt.inla$summary.fitted.values[wgt.na, c("mean", "sd")][1:5,]
```

```
                           mean        sd
fitted.Predictor.0002 23.067927 0.3202367
fitted.Predictor.0003 27.617341 0.3330419
fitted.Predictor.0004 54.019923 0.3767163
fitted.Predictor.0006 13.144502 0.3929115
fitted.Predictor.0009  7.838159 0.3938823
```

Remember that you can also access the marginal posterior distributions (rather than the summary) using `wgt.inla$marginals.fitted.values`
---

# Imputing `hgt` using the same approach

- Similarly, a model can be fit to explain height based on age and sex and to compute the predictive distribution of the missing observations:

``` r
> hgt.inla = inla(hgt ~ age + sex, data = fdgs.sub,
+ control.predictor = list(compute = TRUE),
+ control.compute = list(return.marginals.predictor=TRUE))
> hgt.inla$summary.fixed
```

```
                 mean         sd 0.025quant  0.5quant 0.975quant      mode          kld
(Intercept) 77.148121 0.71985093  75.736271 77.148122  78.559965 77.148122 9.653372e-12
age          6.022575 0.05539618   5.913926  6.022575   6.131223  6.022575 1.041944e-11
sexgirl     -4.716077 0.72990150  -6.147629 -4.716080  -3.284512 -4.716080 8.929221e-12
```

``` r
> hgt.inla$summary.hyperpar
```

```
                                               mean           sd 0.025quant    0.5quant  0.975quant
Precision for the Gaussian observations 0.007402752 0.0003279352 0.00677109 0.007398761 0.008055958
                                               mode
Precision for the Gaussian observations 0.007386158
```

---

# Imputing `hgt` using the same approach

``` r
> hgt.na = which(is.na(fdgs.sub$hgt))
> hgt.inla$summary.fitted.values[hgt.na, c("mean", "sd")][1:10,]
```

```
                           mean        sd
fitted.Predictor.0001 122.24529 0.5353218
fitted.Predictor.0005  83.24902 0.6830427
fitted.Predictor.0007  74.64156 0.6783904
fitted.Predictor.0008  82.80382 0.6856290
fitted.Predictor.0010  88.26140 0.6011378
fitted.Predictor.0012  96.55557 0.6133715
fitted.Predictor.0013  86.31596 0.6656378
fitted.Predictor.0016  81.84746 0.6912339
fitted.Predictor.0017  77.75821 0.7159237
fitted.Predictor.0018  81.59988 0.6370189
```
]

``` r
> # First child with missing value
> plot(hgt.inla$marginals.fitted.values[[hgt.na[1]]], type="l")
```

<img src="./img/unnamed-chunk-12-1.png" style="display: block; margin: auto;" width="80%">
]

---

---

# Joint model of height and weight

- The two previous models consider height and weight separately, but it is clear that there is a high correlation between height and weight, which is caused by the age and sex of the child.

```
[1] 0.9999865
```
- We can build a joint model for height and weight to exploit a correlated effect between the coefficients of age in both models.

`\begin{align*}
hgt_i &= \alpha_h + \beta_{h1} sex_i + \beta_{h2} age_i + \epsilon_{1i}\\
wgt_i &= \alpha_w + \beta_{w1} sex_i + \beta_{w2} age_i + \epsilon_{2i}\\
\end{align*}`

with 
  - `$\alpha_h,\alpha_w$` model intercepts
  - `$\beta_{h1},\beta_{w1}$` the effect of sex
  - `$\beta_{h2},\beta_{w2}$` the effect of age
  - `$\mathbf{\epsilon_{1}},\mathbf{\epsilon}_{2}$` the error terms (note that this specification is equivalent to the one above for the separate models)
  
---

# Joint model of height and weight: prior

- The vectors `$(\mathbf{\beta}_{1h}, \mathbf{\beta}_{1w})$` and `$(\mathbf{\beta}_{2h}, \mathbf{\beta}_{2w})$` are modeled using a multivariate Gaussian distribution with mean 0 and covariance matrix with `$1/\tau_{hj}$` and `$1/\tau_{wj}$` as variances, and `$\rho_j  / \sqrt{(\tau_{jh} \tau_{jw})}$` as the covariance where `$\rho_j$` is the correlation parameter.

- First, the bivariate response variable needs to be put in a two-column matrix given that the model will be made of two data distributions

``` r
> n = nrow(fdgs.sub)
> y = matrix(NA, nrow = 2 * n, ncol = 2)
> y[1:n, 1] = fdgs.sub$hgt
> y[(n + 1):(2*n), 2] = fdgs.sub$wgt
```

- Similarly, as we have two intercepts, we need to define these explicitly as covariates with all values equal to one:

``` r
> I = matrix(NA, nrow = 2 * n, ncol = 2)
> I[1:n, 1] = 1
> I[(n + 1):(2*n), 2] = 1
```

---

# Correlated effects

- Now we need to define the correlated effects. We will use the random effect specification similar to what we saw with the hierarchical models (`iid`), but modified to have the two coefficients as correlated `f(...,model=iid2d)`.

- In order to do so we need to modify the variables `age` and `sex` as they will be passed to the model as weights of the latent random effects `iid2d` specification. We need them to be twice as long to match the dimension of the response:

``` r
> age.joint = rep(fdgs.sub$age, 2)
> sex.joint = rep(fdgs.sub$sex, 2)
```

- Finally we need two index vectors to indicate which coefficient to use from the `iid2d` model is required. These indexes will be 1 for the first half of observations (to indicate that the coefficient is `$\beta_h$`) and 2 for the second half (to indicate that the coefficient is `$\beta_w$`).

``` r
> idx.age = rep(1:2, each = n)
> idx.sex = rep(1:2, each = n)
```

---

# Model fitting

The model is fit and summarized as seen below

``` r
> # Model formula
> joint.f = y ~ -1 + I + f(idx.sex, sex, model="iid2d", n=2) + f(idx.age, age, model = "iid2d", n = 2) 
> # Model fit
> fdgs.joint = inla(joint.f, 
+   data = list(y = y, I = I, sex = sex.joint, age = age.joint, idx.age = idx.age, idx.sex = idx.sex),
+   family = rep("gaussian", 2),
+   control.predictor = list(compute = TRUE))
> # Summary fixed (intercept)
> fdgs.joint$summary.fixed
```

```
        mean        sd 0.025quant 0.5quant 0.975quant      mode          kld
I1 80.576064 1.1586252  78.320145 80.57052  82.863385 80.570440 5.987726e-09
I2  8.248398 0.6944761   6.896143  8.24496   9.619856  8.244844 5.876125e-09
```

---

# Hyperparameters

- The hyperparameters can be obtained through

``` r
> fdgs.joint$summary.hyperpar[,1:5]
```

```
                                                  mean           sd  0.025quant    0.5quant
Precision for the Gaussian observations    0.007402459 0.0003223362 0.006784751 0.007396622
Precision for the Gaussian observations[2] 0.019709156 0.0008617376 0.018107329 0.019676897
Precision for idx.sex (component 1)        0.566326783 0.1292450436 0.366565745 0.548421459
Precision for idx.sex (component 2)        1.457373682 0.3144833499 0.927335353 1.427427363
Rho1:2 for idx.sex                         0.706425863 0.0563253871 0.577594152 0.712744034
Precision for idx.age (component 1)        0.222518895 0.0468590100 0.149488087 0.216240200
Precision for idx.age (component 2)        0.689820457 0.1228098244 0.475817181 0.680410489
Rho1:2 for idx.age                         0.886190458 0.0262330701 0.824263263 0.889790732
                                            0.975quant
Precision for the Gaussian observations    0.008053636
Precision for the Gaussian observations[2] 0.021499338
Precision for idx.sex (component 1)        0.871873607
Precision for idx.sex (component 2)        2.159541792
Rho1:2 for idx.sex                         0.798086390
Precision for idx.age (component 1)        0.332777840
Precision for idx.age (component 2)        0.957771160
Rho1:2 for idx.age                         0.926798478
```
---

# Variable effects

- While the coefficients for `age` and `sex` are part of the random effects of the model:

``` r
> #Sex
> fdgs.joint$summary.random$idx.sex
```

```
  ID      mean        sd 0.025quant  0.5quant 0.975quant      mode          kld
1  1 -3.883723 0.6442724  -5.161137 -3.878537  -2.635653 -3.878438 1.732361e-08
2  2 -2.118500 0.3851571  -2.882579 -2.115280  -1.372207 -2.115121 1.629804e-08
```

``` r
> #Age
> fdgs.joint$summary.random$idx.age
```

```
  ID     mean         sd 0.025quant 0.5quant 0.975quant     mode          kld
1  1 6.023171 0.05532938   5.914660 6.023171   6.131684 6.023170 0.000000e+00
2  2 3.329908 0.03369925   3.263817 3.329908   3.395997 3.329908 7.820951e-13
```

---

# Estimates of missing values

- Finally to access the predicted values for the children with missing data we need to remember that we now have the response `y` stacked (first height, then weight)

- As height is the first variable we can use `hgt.na` to get to the indexes of the children with missing data

- For weight we need to use `wgt.na` and add `n` to each index, as this is the total number of observations, so the data for weight will start on the index 1044

``` r
> joint.wgt.na = wgt.na+n
```

``` r
> #Height
> fdgs.joint$summary.fitted.values[hgt.na, c("mean", "sd")][1:10,]
> #Weight
> fdgs.joint$summary.fitted.values[joint.wgt.na, c("mean", "sd")][1:10,]
```
---

# Comparing the results of the joint and separate models

<img src="./img/unnamed-chunk-25-1.png" >
]

<img src="./img/unnamed-chunk-26-1.png" >
]
---

# References

Gómez-Rubio, V. (2020). _Bayesian inference with INLA_. CRC Press.

Gómez-Rubio, V., M. Cameletti, and M. Blangiardo (2022). "Missing data analysis and
imputation via latent Gaussian Markov random fields". In: _SORT-Statistics and Operations
Research Transactions_, pp. 217-244.

Schönbeck, Y., H. Talma, P. Van Dommelen, et al. (2013). "The world’s tallest nation has
stopped growing taller: the height of Dutch children from 1955 to 2009". In: _Pediatric
Research_ 73.3, pp. 371-377.

Sterne, J. A., I. R. White, J. B. Carlin, et al. (2009). "Multiple imputation for missing
data in epidemiological and clinical research: potential and pitfalls". In: _BMJ_ 338.

Van Buuren, S. and K. Groothuis-Oudshoorn (2011). "mice: Multivariate imputation by
chained equations in R". In: _Journal of Statistical Software_ 45, pp. 1-67.

Wood, A. M., I. R. White, and S. G. Thompson (2004). "Are missing outcome data adequately
handled? A review of published randomized controlled trials in major medical journals".
In: _Clinical trials_ 1.4, pp. 368-376.