class: title-slide # Session 4.1: Hierarchical Models, Priors and Model Checking<span style="display:block; margin-top: 10px ;"></span> ## ### ### Imperial College London <!-- Can also separate the various components of the extra argument 'params', eg as in ### Imperial College London, , MSc in Epidemiology --> --- layout: true .my-footer[ .alignleft[ © Marta Blangiardo | Monica Pirani ] .aligncenter[ MSc in Epidemiology ] .alignright[ Imperial College London, NA ] ] <style> pre { overflow-x: auto; } pre code { word-wrap: normal; white-space: pre; } </style> --- # Learning Objectives After this session you should be able to: <span style="display:block; margin-top: 10px ;"></span> - Understand the different modelling assumptions for hierarchical data <span style="display:block; margin-top: 10px ;"></span> - Be able to specify a hierarchical model for Poisson data <span style="display:block; margin-top: 10px ;"></span> - Be able to perform prediction in a Bayesian approach <span style="display:block; margin-top: 10px ;"></span> - Distinguish and choose between several prior distributions for the precision/variance parameter <span style="display:block; margin-top: 10px ;"></span> - Use the WAIC as tool for model selection. <span style="display:block; margin-top: 10px ;"></span> The topics treated in this lecture are covered in Chapter 5 of Blangiardo and Cameletti (2015). --- # Outline 1\. [What are hierarchical models](#hierarchical) <span style="display:block; margin-top: 30px ;"></span> 2\. [Different modelling assumptions](#modelling-assumptions) <span style="display:block; margin-top: 30px ;"></span> 3\. [Parameter interpretation](#Interpretation) <span style="display:block; margin-top: 30px ;"></span> 4\. [Hierarchical regression](#Hier-regression) <span style="display:block; margin-top: 30px ;"></span> 6\. [Choice of prior](#Prior) <span style="display:block; margin-top: 30px ;"></span> 7\. [Model selection](#Modelselection) --- name: hierarchical <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **What are hierarchical models**]]] --- # What are hierarchical models? **Hierarchical model** is a very broad term that refers to wide range of model set-ups <span style="display:block; margin-top: 20px ;"></span> - Multilevel models <span style="display:block; margin-top: 10px ;"></span> - Random effects models <span style="display:block; margin-top: 10px ;"></span> - Random coefficient models <span style="display:block; margin-top: 10px ;"></span> - Variance-component models <span style="display:block; margin-top: 10px ;"></span> - Mixed effect models <span style="display:block; margin-top: 10px ;"></span> .content-box-blue[**Key feature**: Hierarchical models are statistical models that provide a formal framework for analysis with a complexity of structure that matches the system being studied.] --- # The hierarchical approach - Attempt to capture (model) and understand the structure of the data <span style="display:block; margin-top: 10px ;"></span> -- - Is flexible: - all sources of correlation and heterogeneity can be incorporated in a modular fashion, in particular by the introduction of unit-specific parameters - can be combined with other types of models, e.g. for missing data or measurement error <span style="display:block; margin-top: 10px ;"></span> -- - We wish to make inference on models with many parameters `\((\lambda_1,\ldots,\lambda_N)\)` measured on N units (individuals, areas, time-points, etc.) which are related or connected by the structure of the problem. <span style="display:block; margin-top: 10px ;"></span> -- - Unit specific parameters will .red[borrow strength] from corresponding parameters associated with the other units --- # Motivating example: Disease mapping - To summarise spatial and spatio-temporal variation in disease risk - **Question**: Which areas have particularly high or low disease rates? - **Question**: Can we explain some of the variation in disease rates by area-level covariates? -- - Data are the observed `\((y_{i})\)` and expected number of cases in area `\(i\)`: `\(E_{i} = \sum_k n_{ik} r_k\)`, where `\(r_k\)` reference rate for stratum `\(k\)` (age, sex,...) - Rare disease and/or small areas: Poisson framework `$$y_i \sim \text{Poisson}(\rho_i E_i)$$` <span style="display:block; margin-top: -10px ;"></span> where `\(\rho_i\)` is the **unknown RR** in area `\(i\)` .content-box-beamer[ ### Non smoothed estimates of the RR (SMR or SIR) `\begin{align*} \text{SMR}_i &=\frac{y_i}{E_i}\\ \hbox{Var}(\hbox{SMR}_i) &= \frac{y_i}{E_i^2} \end{align*}` ] <span style="display:block; margin-top: 10px ;"></span> -- - .red[very imprecise: areas with small] `\(\class{red}{E_i}\)` .red[have high associated variance] - .red[estimated independently: makes no use of risk estimates in other areas of the map] --- # Motivating example: Disease mapping *Example*: <span style="display:block; margin-top: 10px ;"></span> - observed cases of lip cancer `\(y_i\)` diagnosed in Scotland in 1975-1980 at county level `\(i=1,\ldots,56\)` areas <span style="display:block; margin-top: 10px ;"></span> - expected number of cases `\(E_i\)` are also available using age/sex standardised reference rates and population counts: -- Assume a Poisson likelihood for the disease counts in each area: `$$y_i\sim \text{Poisson}(\lambda_i)\qquad\qquad \lambda_i = \rho_i E_i \qquad\qquad i=1,\ldots,56$$` <span style="display:block; margin-top: 10px ;"></span> - We have 56 parameters `\(\rho_i\)` (one for each area). What prior do we specify on `\(\rho_i\)`? --- # Expected numbers of cases - definition - Expected number of cases if the population had the same stratum-specific mortality/incidence rates as in a reference area - Adjustments (strata): age, gender ... <span style="display:block; margin-top: 10px ;"></span> *Indirect standardisation*: `\(E_i = \sum_k n_{ik} r_k\)` with - `\(r_k\)`: disease rate for stratum `\(k\)` in the reference population - `\(n_{ik}\)`: population at risk in area `\(i\)`, stratum `\(k\)` <span style="display:block; margin-top: 20px ;"></span> .red[If internal comparison:] `\(\color{red}{\sum_{i=1}^N O_i = \sum_{i=1}^N E_i}\)` <span style="display:block; margin-top: 10px ;"></span> <span style="display:block; margin-top: 20px ;"></span> *Direct standardisation*: apply the disease rate in the population of interest (e.g. UK) to a standard population e.g. European standard population <span style="display:block; margin-top: 10px ;"></span> *External comparison*: if the reference population is not the population of the study of interest. For example, to calculate the expected numbers in London, risks in England could be used. --- # Expected numbers of cases - calculation <center><img src=./img/Expected.png width='60%' title='Lung cancer incidence in males, all ages, using the rates in England and Wales as reference, for the period 1985-2009'></center> <span style="display:block; margin-top: -10px ;"></span> `$$\hbox{SIR}_A=\frac{118}{126.38}=0.93$$` <span style="display:block; margin-top: -10px ;"></span> - Fewer incident cases of lung cancer for males in ward A than expected in EW after adjusting for differences in age. - In R we can perform indirect standardization using the package `SpatialEpi` (we will see it in the Practical in week 6). --- name: modelling-assumptions <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Modelling assumptions**]]] --- # Different modelling assumptions .content-box-beamer[ ### Identical parameters - Assume `\(\rho_i = \rho\)` `\(\rightsquigarrow\)` all the data can be pooled and the individual areas ignored. - Assume a prior `\(\rho \sim \text{Gamma}(1,1)\)` `\(\rightsquigarrow\)` conjugate prior ] -- <span style="display:block; margin-top: 20px ;"></span> - One parameter generates all the observations <span style="display:block; margin-top: 10px ;"></span> - Very easy to implement as it is conjugate (no need for INLA) and all the data are .red[pooled] to produce one estimate of the parameter of interest <span style="display:block; margin-top: 10px ;"></span> - Can be unrealistic (it does not take into account differences in the areas) --- # Different modelling assumptions .content-box-beamer[ ### Independent parameters - All the `\(\rho_i\)` are unrelated, meaning that the areas are analysed independently - Assume a prior `\(\rho_i \sim \text{Gamma}(1,1); \qquad i=1,\ldots,56\)` `\(\rightsquigarrow\)` individual estimates of `\(\rho_i\)` are likely to be highly variable (unless very large sample sizes) ] -- <span style="display:block; margin-top: 20px ;"></span> - Every area is treated separately (No exchange of information between these). Estimates close to SMR `\((\rho_i \approx y_i / E_i)\)`. <span style="display:block; margin-top: 10px ;"></span> - Again no need for INLA, conjugacy can be exploited. --- # Different modelling assumptions .content-box-beamer[ ### Similar (exchangeable) parameters - All the `\(\rho_i\)` are assumed to be *similar* `\(\rightsquigarrow\)` they come from the same distribution (are generated by the same parameters) - Assume a hierarchical prior `\(\rho_i \sim \text{Gamma}(a,b)\)` where `\(a\)` and `\(b\)` are unknown parameters and need to be estimated. ] -- <span style="display:block; margin-top: 20px ;"></span> - Different levels of analysis - Allow the exchange of information between different levels as they are all connected to each other - Assign hyperprior distribution to `\(a\)` and `\(b\)`, for instance `$$a \sim \text{Exp}(1); b\sim \text{Exp}(1)$$` --- # Graphical representation of lip cancer hierarchical model <center><img src=./img/DAG.png width='100%' title=''></center> --- # A more flexible hierarchical prior for the relative risks - A gamma random effect prior for the `\(\rho_i\)` is mathematically convenient, but might be restrictive: - Covariate adjustment is difficult - Not possible to allow for spatial correlation between risks in nearby areas -- - A Normal random effect prior on the `\(\log \rho_i\)` is more flexible: `\begin{align*} y_i &\sim \text{Poisson}(\lambda_i = \rho_i E_i)\\ \eta_i &= \log \rho_i = b_0 + v_i\\ v_i &\sim \text{Normal}(0, \sigma^2_v) \end{align*}` -- - Need to specify hyperprior distributions for: - `\(\sigma^2_v\)` (between-area variance), e.g. `\(1/\sigma^2_v \sim \text{Gamma}(1,0.001)\)` - `\(b_0\)` (mean log relative risk), e.g. `\(b_0 \sim \text{Normal}(0,0.0001)\)` <span style="display:block; margin-top: -10px ;"></span> -- ### Advantages of this approach: <span style="display:block; margin-top: -20px ;"></span> Posterior for each `\(v_i\)` - *borrows strength* from the likelihood contributions of **all** the areas, via their joint influence on the estimate of the unknown population (prior) parameter `\(\sigma^2_v\)` → *global smoothing* of the area RR → reflects our *full uncertainty* about the true values of `\(\sigma^2_v\)` .content-blue-box[ Such models are called .red[*Hierarchical*] or .red[*Random effects*] or .red[*Multilevel*] models ] --- name: interpretation <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Interpretation**]]] --- # Parameter interpretation and useful quantities - `\(\rho_i\)` is the relative risk for the area i compared to the average area with the same structure in the expected values. - `\(v_{i}\)` are the random effects. It can also be seen as the latent variable which captures the effect of unknown or unmeasured area level covariates. - If area level covariates are spatially structured we should take this into account when modelling `\(v_i\)` (we will see it later) - `\(\text{exp}(v_{i})\)` relative risk in area `\(i\)` compared to the risk for the whole study region - The variance of the random effects `\(\sigma^2_v\)` reflects the amount of extra-Poisson variation in the data --- #Lip cancer dataset ``` r > LipCancer <- read.csv("scotlip.csv") > LipCancer ``` ``` # A tibble: 6 × 11 CODENO AREA PERIMETER RECORD_ID DISTRICT NAME CODE y POP E x <int> <dbl> <dbl> <int> <int> <chr> <chr> <int> <int> <dbl> <int> 1 6126 974002000 184951 1 1 Skye-Lochalsh w6126 9 28324 1.38 16 2 6016 1461990000 178224 2 2 Banff-Buchan w6016 39 231337 8.66 16 3 6121 1753090000 179177 3 3 Caithness w6121 11 83190 3.04 10 4 5601 898599000 128777 4 4 Berwickshire w5601 9 51710 2.53 24 5 6125 5109870000 580792 5 5 Ross-Cromarty w6125 15 129271 4.26 10 6 6554 422639000 118433 6 6 Okney w6554 8 53199 2.4 24 ``` - `DISTRICT` identifies the area - `y` identifies the counts of cancer cases - `E` identifies the expected cases of cancer using the entire region under study as reference - `x` identifies the exposure to sun (percentage of agriculture , farming and fishery works) --- # In `R-INLA` We first populate the `formula` environment ``` r > formula.inla <- y ~ 1 + + f(RECORD_ID,model="iid", hyper=list(prec=list(prior="loggamma", + param=c(1,0.01)))) ``` - The model specification is exactly the same as in GLM; - Anything with `f(.)` specifies a random effect; in this case `iid` represents the exchangeable structure. Then we run the model through ``` r > lipcancer.poisson <- inla(formula.inla,family="poisson", + data=LipCancer, E=E, + control.predictor=list(compute=TRUE), + control.compute=list(config=TRUE), + control.fixed=list(mean.intercept=0,prec.intercept=0.00001)) ``` <span style="display:block; margin-top: -10px ;"></span> Note that - `control.fixed` allows to specify the parameters of the prior for the fixed effects (intercept) - `control.predictor` tells `INLA` to include the linear predictor estimation ( the parameters of the prior for the fixed effects (intercept)) useful for prediction - see later) - `control.compute` allows to include model selection indexes, as well as to draw samples from the joint posterior --- # Results for lip cancer in Scotland example - `\(\text{exp}(b_0 + v_i)\)` is the relative risk of lip cancer in area `\(i\)` relative to the average area with the same age/sex structure (see map) - `\(\sigma_v\)` is the between-area standard deviation of log relative risk of lip cancer - As in INLA we get the precision we need to convert it into standard deviation using ``` r > sigma.v<- inla.tmarginal(function(x) sqrt(1/x), + lipcancer.poisson$marginals.hyperpar[[1]]) ``` And we can calculate quintiles with ``` r > inla.qmarginal(seq(0,1,0.2),sigma.v) ``` ``` [1] 0.5014723 0.6745473 0.7256807 0.7734031 0.8335807 1.1506232 ``` --- # Maps: comparing SMR with smoothed estimates .pull-left[ ###SMR <span style="display:block; margin-top: -10px ;"></span> <img src="./img/ggplot1-1.png" > ] .pull-right[ ###Posterior mean <span style="display:block; margin-top: -10px ;"></span> <img src="./img/ggplot2-1.png" > ] <!-- # Quantile ratios To obtain the quantile ratio we need to follow these steps: 1\. Obtain the **join posterior distribution** for the model under consideration <span style="display:block; margin-top: -10px ;"></span> ``` r > joint.post <- inla.posterior.sample(100,lipcancer.poisson) > names(joint.post[[1]]) ``` ``` [1] "hyperpar" "latent" "logdens" ``` ``` r > joint.post[[1]]$latent[1:3,] ``` ``` Predictor:1 Predictor:2 Predictor:3 1.799016 1.697503 1.248352 ``` <span style="display:block; margin-top: -10px ;"></span> Note that: - `joint.post` is a list of 100 elements and each element includes a value from 1\. the joint posterior distribution for the hyperparameters `joint.post$hyperpar` 2\. joint posterior distribution for the linear predictor `\(\bm{\eta}\)` in `joint.post$latent` (row 1 to N) 3\. joint posterior distribution for the random effects `\(\bm{v}\)` in `joint.post$latent` (N +1 to 2N) --> <!-- # Quantile ratios 2\. For each iteration rank the areas based on their `\(v_i\)` values ``` r > joint.v <- matrix(NA,56,100) > for(i in 1:100){ + joint.v[,i]<- joint.post[[i]]$latent[57:112] + } ``` - Calculate `\(v_3\)` and `\(v_{53}\)` (5% and 95%) and build the ratio ``` r > v5perc <- apply(joint.v,2, function(x) quantile(x,0.05)) > v95perc <- apply(joint.v,2, function(x) quantile(x,0.95)) > QR90<- mean(exp(v95perc-v5perc)) > QR90 ``` ``` [1] 10.82089 ``` - The `\(QR90\)` points towards a large spatial variability. --> --- # SMR versus posterior mean RR for selected areas <center><img src=./img/shrinkage-1.png width='50%' title='INCLUDE TEXT HERE'></center> - Comparing the SMR and the area level posterior mean from the model shows a shrinkage towards the global (national mean) --- name: Hier-regression <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Hierarchical Regression**]]] --- # Regression in `INLA` It is easy to move from hierarchical models to regression models with random effects. <span style="display:block; margin-top: 10px ;"></span> <!-- **Example**: We consider a dataset which contains measurements on whether Swiss Willow Tits were found on three separate visits in several locations across Switzerland. We want to understand the relationship between the presence of the birds, elevation and forest cover. --> **Example**: Still considering the Lip cancer dataset, we want to evaluate if the variable `\(x\)`, which represents the percentage of agriculture, farming and fishery works, is associated with the outcome. The hypothesis is that exposure to sun could be a risk factor for lip cancer. ``` # A tibble: 6 × 11 CODENO AREA PERIMETER RECORD_ID DISTRICT NAME CODE y POP E x <int> <dbl> <dbl> <int> <int> <chr> <chr> <int> <int> <dbl> <int> 1 6126 974002000 184951 1 1 Skye-Lochalsh w6126 9 28324 1.38 16 2 6016 1461990000 178224 2 2 Banff-Buchan w6016 39 231337 8.66 16 3 6121 1753090000 179177 3 3 Caithness w6121 11 83190 3.04 10 4 5601 898599000 128777 4 4 Berwickshire w5601 9 51710 2.53 24 5 6125 5109870000 580792 5 5 Ross-Cromarty w6125 15 129271 4.26 10 6 6554 422639000 118433 6 6 Okney w6554 8 53199 2.4 24 ``` -- We specify a random effect Poisson model `\begin{eqnarray*} y_i &\sim& \text{Poisson}(E_i \rho_i)\\ \text{log}(\rho_i) &=& b_0 + \beta_1 x_{i} + v_i\\ v_i &\sim& \text{Normal}(0, \sigma^2_v) \end{eqnarray*}` where `\(x_{i}\)` is the exposure to sun in each area. Note that `\(b_0\)` , `\(\beta_1\)`, `\(1/\sigma^2_v\)` are given independent "noninformative" priors. --- # `R-INLA` code .pull-left[ ``` r > formula.reg.inla <- y ~ 1 + x + + f(RECORD_ID,model="iid", hyper=list(prec=list(prior="loggamma", + param=c(1,0.01)))) > > lipcancer.reg <- inla(formula.reg.inla,family="poisson", + data=LipCancer, E=E, + control.predictor=list(compute=TRUE), + control.compute=list(config=TRUE), + control.fixed=list(mean.intercept=0,prec.intercept=0.00001)) > > lipcancer.reg$summary.fixed ``` ``` mean sd 0.025quant 0.5quant 0.975quant mode kld (Intercept) -0.49119663 0.15719392 -0.8050305 -0.48970373 -0.18591772 -0.48973949 4.376006e-08 x 0.06841513 0.01402741 0.0408243 0.06840119 0.09608742 0.06840194 3.293171e-08 ``` ] .pull-right[ Plot the random effects for the disease mapping model and for the ecological regression ``` r > plot(lipcancer.poisson$summary.random$RECORD_ID$mean,lipcancer.reg$summary.random$RECORD_ID$mean, ylim=c(-1.5,1.5),xlim=c(-1.5,1.5)) > abline(0,1) ``` <img src="./img/swt-plot-1.png" style="display: block; margin: auto;" width="75%"> ] --- # Comparing maps .pull-left[ ###Random effect `\(v_i\)` from disease mapping <center><img src=./img/ggplot_reg-1.png width='80%' title='INCLUDE TEXT HERE'></center> ] .pull-right[ ###Random effect `\(v_i\)` from ecological regression <span style="display:block; margin-top: -10px ;"></span> <center><img src=./img/ggplot_reg-2.png width='80%' title='INCLUDE TEXT HERE'></center> ] --- name: Prior <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Choice of prior**]]] --- # How to specify priors? - *Relatively* easy to specify priors on regression parameters - Typical choice is a Normal distribution - Tuning the variance it can be more or less informative - .alert[The scale of the variable it represents need to be considered] -- <span style="display:block; margin-top: 40px ;"></span> - Variances are more complex (and a bit more controversial) - In small area studies we usually work with Poisson/Binomial distribution on data - no variance parameter; the main interest is on random effect variance. -- <span style="display:block; margin-top: 40px ;"></span> - A Gamma `\((\epsilon,\epsilon)\)` can be used on the precision but inference could be sensitive to choice of `\(\epsilon\)`. Typically to ensure vague priors small `\(\epsilon\)` are specified (e.g. 0.1, 0.01). However, this prior has also been criticised (e.g. (Gelman, 2006)) as it has a spike for values around 0. -- - .red[Careful as "non informative" prior distributions are sensitive to changes of scale.] --- # Changing the scale - For instance starting with a Uniform on the standard deviation we end up with a high density on low values for the precision <img src="./img/tau_plot-1.png" style="display: block; margin: auto;" width="60%"> --- # Remember... - `INLA` parametrises the precision and the default is `\(\log \left(1/\sigma^2\right) \sim \text{logGamma}(1,0.00005)\)` - However alternatives can be built, for instance: - Truncated Normal on log precision (`logtnormal`) - Uniform prior on the standard deviation: as it is not implemented we need to specify it through the `expression` as follows `UN.prior = "expression: log_dens = 0 - log(2) - theta / 2; return(log_dens);"` <span style="display:block; margin-top: 50px ;"></span> .content-box-blue[In general we need to be careful to check the level of information (weakly, strong) on the scale we are interested in (e.g. variance) and see what this corresponds on the standard deviation/precision (on which prior is usually specified). ] See Gómez-Rubio (2020) for more information on how to specify priors in `INLA`. --- name: Modelselection <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Model selection**]]] --- # Which model? .center[.content-box-blue[.Large[**All models are wrong, some models are useful.** G. Box]]] -- <span style="display:block; margin-top: 30px ;"></span> So the question is: how is my model doing? <span style="display:block; margin-top: 20px ;"></span> 1. in terms of model assumption <span style="display:block; margin-top: 20px ;"></span> 2. compared to other models -- We can answer the first question using the .red[posterior predictive distribution] <span style="display:block; margin-top: 30px ;"></span> We can answer the first question using .red[methods based on the trade-off between a measure of model fit and of model complexity] --- # Posterior predictive distribution .red[Main idea: If the combined model assumptions are reasonable, then our posterior model should be able to simulate data that’s similar to the original one] - Let's assume we want to find the relationship between air pollution and asthma attacks and we collect data of the outcome over 500 days in a London Local Authority -- We propose the following model (y=number of asthma attacks, x=level of `\(PM_{10}\)` in the previous 3 days `\(i=1,\ldots,500\)`): `\begin{align} y_i &\sim \text{Poisson}(Pop \rho_i)\\ \text{log}(\rho_i) &= b_0 + \beta x_i + v_i\\ b_0, \beta &\sim N(0,0.001)\\ v_i &\sim N(0,\sigma^2_{v})\\ \text{log}(1/\sigma^2_{v}) &\sim \text{logGamma}(1,0.00005) \end{align}` The assumptions are 1. that the data are distributed as .red[Poisson] 2. that there is a linear relationship between air pollution and the log number of asthma attacks 3. that the days are similar (we include a random effect) --- # Posterior predictive distribution - We run the model and predict observations `\(y^*_1,\ldots,y^*_{500}\)` based on the posterior distribution of the parameters (note that we need to include `control.predictor` in the `inla` function to access these): ``` r > asthma_formula1 <- y ~ x + as.factor(dow) + f(ID, model="iid") > asthma_model1 <- inla(asthma_formula1, data=data,family="poisson", + control.predictor=list(compute=TRUE),control.compute=list(config=TRUE, + waic=TRUE, return.marginals.predictor=TRUE)) ``` Then to get the `\(y_i\)` values we need to incorporate an additional step getting a posterior sample from the fitted values using .pull-left[ ``` r > post<- inla.posterior.sample(100,asthma_model1) > names(post[[1]]) ``` ``` [1] "hyperpar" "latent" "logdens" ``` ``` r > post[[1]]$latent[1:3,] ``` ``` Predictor:1 Predictor:2 Predictor:3 1.1431768 0.8692804 1.8645721 ``` ] .pull-right[ <span style="display:block; margin-top: -10px ;"></span> Note that: - `post` is a list of 100 elements and each element includes a value from 1\. the joint posterior distribution for the hyperparameters `post$hyperpar` 2\. joint posterior distribution for the linear predictor `\(\bm{\eta}\)` in `post$latent` (row 1 to N) 3\. joint posterior distribution for the random effects `\(\bm{v}\)` in `post$latent` (N +1 to 2N) ] --- # Posterior predictive distribution Then we sample from a Poisson where the mean is the output of the joint posterior sample (exponentiated to go back to the original scale), for the linear predictor ``` r > #Exponentiate > predlist <- do.call(cbind, lapply(post, + function(X) exp(X$latent[startsWith(rownames(X$latent), "Pred")]))) > #Sample from the Poisson > pois.samples <- apply(predlist, 2, function(Z) rpois(n = length(Z), lambda = Z)) > pois.samples <- as.data.frame(pois.samples) ``` .pull-left[ ``` [1] 500 100 ``` ``` V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 3 4 2 8 2 0 7 1 3 2 2 4 0 3 2 4 5 2 3 2 0 3 4 4 4 7 3 7 5 5 5 3 ``` ] .pull-right[ Now we plot it against the original data <img src="./img/unnamed-chunk-9-1.png" style="display: block; margin: auto;" width="75%"> ] --- # Posterior predictive distribution Now let's assume we run a different model `\begin{align} y_i &\sim \text{Normal}(\theta_i,\sigma_2)\\ logit(\theta_i) &= b_0 + \beta x_i\\ b_0, \beta &\sim N(0,0.001)\\ \end{align}` The assumptions are 1. that the data are distributed as .red[Gaussian] 2. that there is a linear relationship between air pollution and number of asthma attacks --- # Posterior predictive distribution .pull-left[ - We run the model and predict observations `\(y^*_1,\ldots,y^*_{500}\)` based on the posterior distribution of the parameters similarly to the previous model ``` r > asthma_formula2 <- y ~ x > asthma_model2 <- inla(asthma_formula2, data=data,family="Gaussian", + control.predictor=list(compute=TRUE), + control.compute=list(config=TRUE,waic=TRUE, return.marginals.predictor=TRUE)) ``` ] .pull-right[ <img src="./img/unnamed-chunk-12-1.png" style="display: block; margin: auto;" width="75%"> ] --- # Model comparison: Bayesian p-value - We can use the posterior predictive distribution to compare to the observed one through a *p-value* .pull-left[ ``` r > # Model 1 > Bayesian_p <- matrix(NA,500,2) > > #Proportion of times the prediction is larger than the data > for(i in 1:500){ + Bayesian_p[i,1] <- sum(pois.samples[i,]<y[i])/100 + Bayesian_p[i,2] <- sum(gaussian.samples[i,]<y[i])/100 + } ``` ] .pull-right[ <img src="./img/unnamed-chunk-14-1.png" style="display: block; margin: auto;" width="75%"> ] - Ideally we would expect a uniform distribution of the p-values which would tell us there is no pattern of over(under) estimation in the prediction - Here model 1 seems far better than model 2 --- # Model comparison: fit vs complexity - When the interest lays mainly on the prior distribution or on the functional form of some parameters the deviance of the model can be used to evaluate the goodness of fit. Given the data `\(\bm{y}\)` with distribution `\(p(\bm{y}\mid \theta)\)`, the deviance of the model is defined as: `\begin{eqnarray*} D(\theta) = -2 \hbox{log} p(\bm{y} \mid \theta) \end{eqnarray*}` where `\(\theta\)` identifies the parameter of the likelihood -- - Ex. `\(y_i \sim \hbox{Bernoulli}(\theta) \rightsquigarrow p(\mathbf{y}\mid \theta) = \prod_{i=1}^n \left( \begin{array}{c} n_i \\ y_i \end{array} \right) \theta^{y_i} (1-\theta)^{n_i-y_i}\)` `\(\displaystyle D(\theta) = -2\left[\sum_i y_i \log \theta_i + (n_i-y_i)\log(1-\theta_i)+ \log\left(\begin{array}{c}n_i\\y_i \end{array}\right)\right]\)` --- # Mean deviance - The deviance of the model measures the variability linked to the likelihood, ie the probabilistic structure used for the observation (conditional on the parameters) - This quantity is a random variable in the Bayesian framework, so it is possible to synthesise it through several indexes (mean, median, etc.) - Many authors suggested using posterior mean deviance `\((\overline{D}) = E_{\theta\mid y} [D(\theta)]\)` as a measure of fit **DRAWBACK:** more complex models will fit the data better and so will have smaller `\(\overline{D}\)` - Need to have some measure of *model complexity* to trade off against `\(\overline{D}\)` --- # Watanabe AIC - WAIC - Natural way to compare models is to use criterion based on trade-off between the fit of the data to the model and the corresponding complexity of the model - Let `\(m_i\)` and `\(v_i\)` be the posterior predictive mean and variance for the `\(i^{th}\)` unit - The effective model size is `$$p_W = \sum_{i=1}^nv_i$$` <span style="display:block; margin-top: 10px ;"></span> - The criteria is `$$WAIC = -2\sum_{i=1}^nm_i + 2p_W$$` - The WAIC is readily available in `INLA` using `control.compute=list(waic=TRUE)` - Linked to cross-validation - WAIC has a model-fit and model-complexity components - Smaller WAIC indicates the preferred model --- # Back to our example... We run the model adding the `waic` (here for model 1, it is the same for model 2): ``` r > asthma_model1 <- inla(asthma_formula1, data=data,family="poisson",control.predictor=list(link=1,compute=TRUE), + control.compute=list(waic=TRUE)) ``` And now check the value of the WAIC <span style="display:block; margin-top: 20px ;"></span> ``` r > # Poisson data distribution > asthma_model1$waic$waic ``` ``` [1] 1956.986 ``` ``` r > # Normal data distribution > asthma_model2$waic$waic ``` ``` [1] 2163.642 ``` The first model is still preferred as the WAIC is smaller. --- # Summary - Hierarchical models allow **borrowing of strength** across units → posterior distribution of the unit-parameter borrows strength from the likelihood contributions for all the units, via their joint influence on the posterior estimates of the unknown hyper-parameters → improved efficiency - Judgements of exchangeability need careful assessment → units suspected a priori to be systematically different might be modelled by including relevant covariates so that residual variability more plausibly reflects exchangeability - Subgroups of prior interest should be considered separately -- Careful on the prior specification - non informative on one scale might be informative on another - always run some sensitivity analyses changing the prior and investigating how this affect the estimates of parameters of interest - posterior predictive distribution is useful to check if a model is in line with the data under study - WAIC is a useful tool for model selection, easy to calculate in INLA → bear in mind that they can only be used to compare models - similarly to the AIC they do not have an absolute meaning. --- # References Blangiardo, M. and M. Cameletti (2015). _Spatial and spatio-temporal Bayesian models with R-INLA_. John Wiley & Sons. Gelman, A. (2006). "Prior distributions for variance parameters in hierarchical models". In: _Bayesian Analysis_ 1".3", pp. 515-534. Gómez-Rubio, V. (2020). _Bayesian inference with INLA_. CRC Press.