class: title-slide # Session 2.1: Bayesian inference<span style="display:block; margin-top: 10px ;"></span> ## ### ### Imperial College London <!-- Can also separate the various components of the extra argument 'params', eg as in ### Imperial College London, , MSc in Epidemiology --> --- layout: true .my-footer[ .alignleft[ © Marta Blangiardo | Monica Pirani ] .aligncenter[ MSc in Epidemiology ] .alignright[ Imperial College London, NA ] ] <style> pre { overflow-x: auto; } pre code { word-wrap: normal; white-space: pre; } </style> --- # Learning Objectives After this session you should be able to: <span style="display:block; margin-top: 10px ;"></span> - Understand how Bayes Theorem can be applied to random variables <span style="display:block; margin-top: 10px ;"></span> - Describe what conjugacy means <span style="display:block; margin-top: 10px ;"></span> - Obtain posterior distribution for the Beta-Binomial and Gamma-Poisson families <span style="display:block; margin-top: 10px ;"></span> The topics treated in this lecture are presented in Chapter 3 of the book Blangiardo and Cameletti (2015) and in Chapter 2.3, 3.1-3.4 and 5.2-5.3 of the book Johnson, Ott, and Dogucu (2022). --- # Outline 1\. [Bayes Theorem for random variables](#BayRandom) <span style="display:block; margin-top: 30px ;"></span> 2\. [A quick recap](#Recap) <span style="display:block; margin-top: 30px ;"></span> 3\. [What is conjugacy?](#Conjugacy) <span style="display:block; margin-top: 30px ;"></span> 4\. [Some conjugacy models: Beta-Binomial](#Beta-Binomial) <span style="display:block; margin-top: 30px ;"></span> 5\. [Some conjugacy models: Gamma-Poisson](#PoissonGamma) --- name: BayRandom <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Bayes Theorem for random variables**]]] --- # An example - A clinical trial is carried out to assess the efficacy of a preventive treatment for migraine <span style="display:block; margin-top: 20px ;"></span> - A group of 20 patients are offered the new drug and they have to report if they have a migraine episode in the next week after taking the medication. <span style="display:block; margin-top: 20px ;"></span> - Our aim is to estimate the probability of success of this new drug `\((\theta)\)` <span style="display:block; margin-top: 20px ;"></span> - Our data here is `\(Y\)`: the number of patients reporting no migraine episodes. <span style="display:block; margin-top: 20px ;"></span> Note that `\(Y\)` is a **random variable** (look at recording 3 for a more detailed description of this concept) --- # Prior probability model As a first step we need to assign a prior on `\(\theta\)`. As a simplification let's assume that `\(\theta\)` is discrete and can only get the values 0.1,0.4,0.8 with the following probability function, which specifies the prior probability of each possible `\(\theta\)` value: <span style="display:block; margin-top: 30px ;"></span> ``` r > library(kableExtra) > df<- data.frame(theta=c(0.1,0.4,0.8), p=c(0.2,0.5,0.3)) > df %>% kbl(col.names = c("\\(\\theta\\)", "\\(p(\\theta)\\)")) %>% kable_classic(full_width=F, position = "center") ``` <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> \(\theta\) </th> <th style="text-align:right;"> \(p(\theta)\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 0.2 </td> </tr> <tr> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 0.5 </td> </tr> <tr> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 0.3 </td> </tr> </tbody> </table> <span style="display:block; margin-top: 30px ;"></span> Note that this prior reflects some sort of information from a previous study on a similar compound and put 50% probability on the event that 40% of the patients will report a reduction in migraine. --- # The Binomial data model - Our random variable `\(Y\)` can take any discrete value between 0 and 20 (total number of patients) - It will depend on the probability `\(\theta\)` - For our formal Bayesian analysis, we must model this dependence of `\(Y\)` on `\(\theta\)` through a **conditional probability model** - We make two assumptions about the trials: (1) the outcome of any one patient doesn’t influence the outcome of another; and (2) the probability of success does not change among patients - We can use the Binomial model .content-box-green[ `\(Y \sim \mbox{Binomial}(n, \theta)\)` with conditional probability function `$$p(y \mid \theta) = \frac{n!}{y!(n-y)!}\theta^{y}(1-\theta)^{n-y}$$` - Mean of this distribution is `\(E(Y) = n \theta\)` - Variance is `\(V(Y) = n \theta(1-\theta)\)` .red[ Check out recording 4 for a recap on the Binomial distribution] ] --- # The Binomial data model This model allows to calculate ANY conditional probability. For instance, conditioning on `\(\theta=0.4\)` let's see the difference in the probability of getting 12 or 15 successes 1. `$$p(y=12 \mid \theta=0.4) = \frac{20!}{12!8!}0.4^{12}(1-0.4)^{8}= 0.035$$` 2. `$$p(y=15 \mid \theta=0.4) = \frac{20!}{15!5!}0.4^{15}(1-0.4)^{5} = 0.0013$$` -- Note you can get these results in R using ``` r > #1 > dbinom(12,20,0.4) ``` ``` [1] 0.03549744 ``` ``` r > #2 > dbinom(15,20,0.4) ``` ``` [1] 0.001294494 ``` --- # Binomial likelihood function - The Binomial provides a theoretical model of the data we might observe. - In the end we observe 10 successes out of the 20 patients `\((y=10)\)`. - The next step in our Bayesian analysis is to determine how compatible this particular data is with the various possible `\(\theta\)` .content-box-green[ We need to evaluate the *likelihood* of getting 10 successes in the trial under each possible value of `\(\theta\)` ] Similarly to last week with events, the likelihood function follows from evaluating the conditional probability function `\(p(y=10\mid \theta)\)` for all the possible `\(\theta\)` values: 1. `$$p(y=10 \mid \theta=0.1) = \frac{20!}{10!10!}0.1^{10}(1-0.1)^{10}= 0.00000644$$` 2. `$$p(y=10 \mid \theta=0.4) = \frac{20!}{10!10!}0.4^{10}(1-0.4)^{10} = 0.117$$` 3. `$$p(y=10 \mid \theta=0.8) = \frac{20!}{10!10!}0.8^{10}(1-0.8)^{10} = 0.002$$` --- # Likelihood and probability function Let's recall the fundamental difference between probability function and likelihood function <span style="display:block; margin-top: 50px ;"></span> .pull-left[ .content-box-yellow[ When `\(\theta\)` is known, the **conditional probability function** `\(p(\cdot\mid \theta)\)` allows us to compare the probabilities of different values of the data `\(Y\)` occurring with `\(\theta\)`: `$$p(y_1\mid \theta) \text{ compared to } p(y_2\mid \theta)$$` ] ] .pull-right[ .content-box-yellow[ When `\(Y\)` is known, the **likelihood function** `\(L(\cdot\mid y) = p(y \mid \cdot)\)` allows us to compare the relative likelihood of the data `\(y\)` under different possible values of `\(\theta\)`: `$$L(\theta_1 \mid y) = p(y \mid \theta_1) \text{ compared to } \\L(\theta_2 \mid y) = p(y \mid \theta_2)$$` ] ] <span style="display:block; margin-top: 50px ;"></span> So `\(L(\cdot \mid y)\)` provides the tool we need to evaluate the relative compatibility of data `\(Y=y\)` with various `\(\theta\)` values. --- # Normalising constant - Now we have a prior for `\(\theta\)` and a likelihood and as Bayesian we want to **balance** these two pieces of information to obtain the posterior. - Something is missing... -- The normalising constant! This is the total probability of having `\(y=10\)` successes. How do we get this? -- .content-box-green[ **Law of total probability**: `$$P(A) = \sum_j P(A \mid B_j)P(B_j) \text{ (for events)}$$` `$$P(Y=y) = \sum_j p(Y \mid \theta_j)p(\theta_j) \text{ (for discrete probability functions)}$$` ] <span style="display:block; margin-top: 20px ;"></span> So we get `$$P(Y=10) = 0.00000644 \times 0.2 + 0.117 \times 0.5 + 0.002 \times 0.3 = 0.059$$` Interpretation: across all the possible `\(\theta\)` there is only around 6% chance that there are 10 successes. --- # Posterior distribition Finally we are ready to apply Bayes theorem and get the posterior distribution `$$p(\theta \mid y=10) = \frac{p(\theta)L(\theta \mid y=10)}{p(y=10)} \text{ for } \theta \in \{0.1,0.4,0.8\}$$` which gives us: `$$p(\theta=0.1 \mid y=10) = \frac{0.2 \times 0.0000064}{0.059} = 0$$` `$$p(\theta=0.4 \mid y=10) = \frac{0.5 \times 0.117}{0.059} = 0.99$$` `$$p(\theta=0.8 \mid y=10) = \frac{0.3 \times 0.002}{0.059} = 0.01$$` --- # Comparing prior and posterior .pull-left[ <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> \(\theta\) </th> <th style="text-align:right;"> \(p(\theta)\) </th> <th style="text-align:right;"> \(p(\theta \mid y=10)\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 0.99 </td> </tr> <tr> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 0.3 </td> <td style="text-align:right;"> 0.01 </td> </tr> </tbody> </table> ] .pull-right[ <img src="./img/unnamed-chunk-4-1.png" style="display: block; margin: auto;" width="110%"> ] --- # Posterior shortcut .pull-left[ - Moving forward we can actually forget about calculating the normalising constant - Note that on slide 12 the `\(p(y=10)\)` appears at the denominator of all the `\(p(\theta \mid Y)\)` <span style="display:block; margin-top: 20px ;"></span> .box-content-green[ It normalises the posterior probabilities so they sum to 1 ] - So we can simply acknowledge that `\(p(y=10) = c\)` and replace the posterior as: `$$p(\theta \mid y=10) \propto p(\theta)\times L(\theta \mid y)$$` - The proportionality means that if we compare the normalised and unnormalised posterior they preserve their relative relationship ] .pull-right[ <img src="./img/unnamed-chunk-5-1.png" style="display: block; margin: auto;" width="110%"> ] --- name: recap <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **A quick recap**]]] --- # So a quick general recap: Bayesian inference Makes fundamental distinction between - Observable quantities `\(y\)`, i.e.~the data <span style="display:block; margin-top: 20px ;"></span> - Unknown quantities `\(\theta\)` <span style="display:block; margin-top: 30px ;"></span> - `\(\theta\)` can be statistical parameters, missing data, mismeasured data... <span style="display:block; margin-top: 20px ;"></span> `\(\rightarrow\)` parameters are treated as random variables <span style="display:block; margin-top: 20px ;"></span> `\(\rightarrow\)` in the Bayesian framework, we make probability statements about model parameters -- <span style="display:block; margin-top: 50px ;"></span> .red[Note that in the Frequentist framework, parameters are fixed non-random quantities and the probability statements concern the data] --- # Bayesian inference - As with any statistical analysis, we start building a model which specifies `\(p(Y=y \mid \theta)\)` <span style="display:block; margin-top: 20px ;"></span> - This is the .red[data distribution], which relates all variables into a .red[**full probability model**] <span style="display:block; margin-top: 20px ;"></span> - The choice of data distribution depends on the nature of the data: e.g. are we analysing continuous or discrete data, are the data symmetric or skewed, etc. <span style="display:block; margin-top: 20px ;"></span> - As we observe the data, we can use descriptive tools (e.g. plots) to visualise the data and choose the best likelihood <span style="display:block; margin-top: 20px ;"></span> --- # Bayesian inference From a Bayesian point of view <span style="display:block; margin-top: 20px ;"></span> - `\(\theta\)` is unknown so should have a .red[probability distribution] reflecting our uncertainty about it before seeing the data <span style="display:block; margin-top: 20px ;"></span> `\(\rightarrow\)` need to specify a .redp[prior distribution] `\(p(\theta)\)` <span style="display:block; margin-top: 20px ;"></span> - `\(y\)` is known so we should condition on it <span style="display:block; margin-top: 20px ;"></span> `\(\rightarrow\)` use Bayes theorem to obtain conditional probability distribution for unobserved quantities of interest given the data: `$$p(\theta \mid y)= \frac{ p(\theta)\, p(y \mid \theta)} {\int p(\theta)\,p(y \mid \theta)\,d\theta} \propto p(\theta)\,p(y \mid \theta)$$` <span style="display:block; margin-top: 20px ;"></span> This is the .red[posterior distribution] -- - The prior distribution `\(p(\theta)\)`, expresses our uncertainty about `\(\theta\)` .red[before] seeing the data - The posterior distribution `\(p(\theta \mid y)\)`, expresses our uncertainty about `\(\theta\)` .red[after] seeing the data --- # The posterior distribution Posterior distribution forms basis for all inference --- can be summarised to provide <span style="display:block; margin-top: 10px ;"></span> - point and interval estimates of Quantities of Interest (QOI), e.g. treatment effect, small area estimates, ... <span style="display:block; margin-top: 10px ;"></span> - point and interval estimates of any function of the parameters <span style="display:block; margin-top: 10px ;"></span> - probability that QOI (e.g. treatment effect) exceeds a critical threshold <span style="display:block; margin-top: 10px ;"></span> - prediction of QOI in a new unit <span style="display:block; margin-top: 10px ;"></span> - prior information for future experiments, trials, surveys, ... <span style="display:block; margin-top: 10px ;"></span> - inputs for decision making <span style="display:block; margin-top: 10px ;"></span> - ... --- name: Conjugacy <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **What is conjugacy?**]]] --- # How to select a prior - Selecting the prior is crucial for a Bayesian analysis .content-box-green[ - There is no right way to select a prior - The choices often depend on the objective of the study and the nature of the data <span style="display:block; margin-top: -10px ;"></span> ] - There are other criteria to consider when choosing a prior model: **Computational ease** Especially if we don’t have access to computing power, it is helpful if the posterior model is easy to build. **Interpretability** The posterior balance (is a compromise) the data and the prior. A posterior model is interpretable, and thus more useful, when you can look at its formulation and identify the contribution of the data relative to that of the prior. -- We introduce now a type of models which satisfy both properties .content-box-yellow[ <span style="display:block; margin-top: -20px ;"></span> ### Conjugate prior <span style="display:block; margin-top: -20px ;"></span> Let the prior model for a parameter `\(\theta\)` with `\(p(\theta)\)` and the model of data `\(Y\)` conditioned on `\(\theta\)` have likelihood function `\(L(\theta \mid y)\)`. If the resulting posterior model `\(p(\theta \mid y) \propto p(\theta) \times p(y \mid \theta)\)` is of the same family as the prior, then we say it is a **conjugate prior**. <span style="display:block; margin-top: -10px ;"></span> ] --- name: Beta-Binomial <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Some conjugacy models: Beta-Binomial**]]] --- # Beta-Binomial model for proportions: example - We consider an early investigation of a new drug <span style="display:block; margin-top: 30px ;"></span> - Experience with similar compounds has suggested that response rates between 0.2 and 0.6 could be feasible <span style="display:block; margin-top: 30px ;"></span> - We interpret this as a distribution with mean = 0.4 and standard deviation 0.1 <span style="display:block; margin-top: 30px ;"></span> - A Beta(9.2,13.8) distribution has these properties (Check recording 7 to see how to go from the .red[mean and sd] to the .red[a and b] parameters of a Beta distribution) <span style="display:block; margin-top: 30px ;"></span> - Suppose we now treat `\(n=20\)` volunteers with the compound and observe `\(y=15\)` positive responses --- # Identifying the different model components - Assuming patients are independent, with common unknown response rate `\(\theta\)`, leads to a binomial data distribution `\begin{align} p(y \mid n, \theta) &= \left( \begin{array}{c} n \\ y \end{array} \right) \theta^y (1-\theta)^{n-y} \; \propto \; \theta^y (1-\theta)^{n-y} \end{align}` - `\(\theta\)` needs a continuous prior distribution: `\begin{align} \theta & \sim \hbox{Beta}(a,b )\\ p(\theta) &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma(b)} \; \theta^{a-1} (1-\theta)^{b-1} \end{align}` --- # Combining prior and data Combining the Binomial data and the Beta prior gives the following posterior distribution .content-box-green[ `\begin{align} p(\theta \mid y, n) & \propto p(y \mid \theta, n) p(\theta)\\ & \propto \theta^y (1-\theta)^{n-y} \theta^{a-1} (1-\theta)^{b-1}\\ & = \theta^{y+a-1} (1-\theta)^{n-y+b-1} \end{align}` ] The posterior is still a Beta distribution (with different parameters): `$$p(\theta \mid y, n) \propto \hbox{Beta}(y+a, \; n-y+b)$$` --- # Comparing prior, likelihood and posterior ``` r > #Prior parameters > a=9.2;b=13.8 > #Data > y=15; n=20 > > library(cowplot) > p1 <- ggplot(data = data.frame(x = c(0, 1)), aes(x)) + + stat_function(fun = dbeta, n = 101, args = list(shape1 = a, shape2 = b)) + ylab(expression(p(theta))) + scale_y_continuous(breaks = NULL) > p_temp <- dbinom(15,20,seq(0,1,0.01)) > x<-seq(0,1,0.01) > p2 <- ggplot() + aes(seq(0, 1,0.01), p_temp)+geom_line() + ylab(expression(paste("L(",theta,"|y=15)"))) + scale_y_continuous(breaks = NULL) + xlab("x") > p3 <- ggplot(data = data.frame(x = c(0, 1)), aes(x)) + + stat_function(fun = dbeta, n = 101, args = list(shape1 = a+15, shape2 = b+5)) + ylab(expression(paste("p(",theta,"|y=15)"))) + scale_y_continuous(breaks = NULL) > grid.arrange(p1,p2,p3,nrow=3) ``` <img src="./img/unnamed-chunk-6-1.png" style="display: block; margin: auto;" width="40%"> --- # Gamma-Poisson model for count data: example - .red[For a recap on the Poisson distribution see recording 5] <span style="display:block; margin-top: 20px ;"></span> - In epidemiology we are often interested in estimating the .red[rate] or .red[relative risk] rather than the .red[mean] for Poisson data: <span style="display:block; margin-top: 20px ;"></span> - Suppose we observe `\(y=7\)` cases of leukaemia in one region; - The expected number of cases is `\(E=4\)` <span style="display:block; margin-top: 20px ;"></span> - **Data distribution**: Poisson with mean `\(\theta = \lambda \times E\)`, where `\(\lambda\)` is the unknown incidence ratio: `$$p(y \mid \lambda, E) = \frac{(\lambda E)^{y}e^{-\lambda E} }{y!}$$` <span style="display:block; margin-top: 20px ;"></span> - **Prior**: `\(\mathrm{Gamma}(a, b)\)` on the the risk `\(\lambda\)`: $$p(\lambda) = \frac{b^a}{\Gamma(a)} \lambda^{a-1} e^{-b \lambda} $$ .red[Check recording 8 for a recap on the Gamma distribution] --- # Combining likelihood and prior This implies the following posterior `\begin{align*} p(\lambda \mid y) & \propto p(\lambda)\,p(y \mid \lambda) \\ & = \frac{b^a}{\Gamma(a)}\lambda^{a-1}e^{-b(\lambda)}e^{-(\lambda E)}\frac{(\lambda E)^{y}}{y!}\\ & \propto \lambda^{a+y-1} e^{-(b+E)\lambda}\\ & = \mathrm{Gamma}(a + y ,\, b + E) \end{align*}` -- The posterior is another (different) Gamma distribution `$$E(\theta \mid y)= \frac{a + y }{ b + E}$$` So posterior mean depends on the prior `\((a, b)\)` and on the data `\((y, E)\)` --- # Prior, likelihood, posterior for the Leukaemia example .pull-left[ - Assuming a prior `\(\lambda \sim \mathrm{Gamma}(2,2)\)` <span style="display:block; margin-top: 20px ;"></span> - Considering the data `\(y=7, E=4\)` <span style="display:block; margin-top: 20px ;"></span> - We obtain a posterior `\(\lambda \mid y \sim \mathrm{Gamma}(9,6)\)` centered around 1.5 ] .pull-right[ <center><img src=./img/PoissonGamma.png width='75%' title='INCLUDE TEXT HERE'></center> The posterior becomes a compromise between the prior and the data ] --- # References Blangiardo, M. and M. Cameletti (2015). _Spatial and spatio-temporal Bayesian models with R-INLA_. John Wiley & Sons. Johnson, A. A., M. Q. Ott, and M. Dogucu (2022). _Bayes Rules!: An Introduction to Applied Bayesian Modeling_. CRC Press.