class: title-slide # Session 9.1: Introduction to Geostatistics<span style="display:block; margin-top: 10px ;"></span> ## ### ### <!-- Can also separate the various components of the extra argument 'params', eg as in ### , , AirAware workshop --> --- layout: true .my-footer[ .alignleft[ © Pirani | Blangiardo | Villejo ] .aligncenter[ AirAware workshop ] .alignright[ , NA ] ] --- # Learning Objectives At the end of this session you should be able to: <span style="display:block; margin-top: 20px ;"></span> - know the common models used for **geostatistical data** analysis, i.e. Gaussian fields (GF); <span style="display:block; margin-top: 20px ;"></span> - understand the assumptions of stationarity and isotropy; <span style="display:block; margin-top: 20px ;"></span> - understand and compute the variogram/semivariogram. <span style="display:block; margin-top: 30px ;"></span> The topics covered in this lecture can be further investigated in the books: - Section 6.4 of the book **Spatial and Spatio-Temporal Bayesian models with R-INLA** https://github.com/michelacameletti/Spatial-and-Spatio-Temporal-Bayesian-Models-with-R-INLA - Chapter 12 of the book **Spatial Statistics for Data Science: Theory and Practice with R** https://www.paulamoraga.com/book-spatial/index.html - Chapter 12, Sections 12.1-12.3 of the book **Spatial Data Science** https://r-spatial.org/book/part-1.html - Chapter 2, Section 2.1 of the book **Hierarchical Modeling and Analysis for Spatial Data** --- # Outline <span style="display:block; margin-top: 30px ;"></span> 1\. [Introduction to spatial modeling for geostatistical data (based on GF)](#introduction) <span style="display:block; margin-top: 30px ;"></span> 2\. [Assumptions of stationarity and isotropy](#stationarity) <span style="display:block; margin-top: 30px ;"></span> 3\. [The variogram and semivariogram](#vario) --- name: introduction <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Introduction to spatial modeling for geostatistical data (based on GF)**]]] --- # Geostatistical data: definition - The difference between models for .red[**geostatistical**] (or point referenced) data and spatial models for areal data is that here we treat space as continuous, not discretised (areas). <span style="display:block; margin-top: 20px ;"></span> - We are concerned here with spatial data structures where the process of interest is a spatial field `$$\{Z(\mathbf{s}): \mathbf{s} \in \mathcal{D} \}$$` i.e. real values stochastic process characterized by a spatial index `\(\bm{s}\)` which varies .red[**continuously**] in the fixed domain `\(\mathcal D\)`. <span style="display:block; margin-top: 20px ;"></span> - Data are measured (possibly with error) at `\(n\)` spatial locations `\((\bm{s}_{1},\ldots,\bm{s}_{n})\)` and are denoted by `\(\bm{Z}=\left(Z(\bm{s}_1),\ldots, Z(\bm{s}_n)\right)=(Z_1,\ldots,Z_n)\)`. --- # Geostatistical data: example .pull-left[ <span style="display:block; margin-top: 20px ;"></span> - Examples: <span style="display:block; margin-top: 20px ;"></span> - in environmental science: rainfall, air pollution concentrations, radioactive emission in soil, etc. <span style="display:block; margin-top: 10px ;"></span> - in epidemiology: prevalence of a disease measured at different villages distributed over a region of interest. <span style="display:block; margin-top: 10px ;"></span> - in ecology: density of mosquitoes responsible for disease transmission measured using traps placed at different locations. ] .pull-right[
] --- # Aims - To **reconstruct the spatial field** from sparse observations taken at a finite number of spatial locations. <span style="display:block; margin-top: 20px ;"></span> - To use the spatial dependence to **predict values** of the spatial field (together with associated uncertainty) at locations where there are no observations. <span style="display:block; margin-top: 30px ;"></span> - Example below: Each point represents a weather station in Brazil, and we have an associated `\(Z(s)\)`, which is air temperature. <span style="display:block; margin-top: 20px ;"></span> - Using geostatistical methods we can reconstruct a latent spatial field from the finite set of observations taken at a finite number of spatial locations. <center><img src=./img/Brazil.jpg width='43%' title=''></center> --- # Gaussian fields - The common methodological framework to geostatistical models is that of .red[**Gaussian fields (or processes)**], which are based on the multivariate Normal distribution. <span style="display:block; margin-top: 21px ;"></span> - A spatial process `\(Z(\bm{s})\)` is a .red[**Gaussian field**] (GF) if for any `\(n \geq 1\)` and for each set of locations `\((\bm{s}_{1},\ldots,\bm{s}_{n})\)`, the vector `\((Z(\bm{s}_1),\ldots, Z(\bm{s}_n))\)` follows a .red[multivariate Normal distribution] with mean `\(\bm{\mu}=(\mu(\bm{s}_{1}), \ldots, \mu(\bm{s}_{n}))\)` and spatially structured covariance `\(\bm{\Sigma}\)`. <span style="display:block; margin-top: 21px ;"></span> - The generic element of `\(\bm{\Sigma}\)` is defined by a **spatial covariance function** (sometimes called a kernel) `\(\mathcal C(\cdot,\cdot)\)` such that `\(\Sigma_{ij}=\mbox{Cov}\left(Z(\bm{s}_{i}),Z(\bm{s}_{j})\right)=\mathcal C(Z(\bm{s}_{i}),Z(\bm{s}_{j}))\)`. --- name: stationarity <span style="display:block; margin-top: 250px ;"></span> .myblue[.center[.huge[ **Stationarity and Isotropy**]]] --- # Stationarity - A spatial process is .red[stationary] if its *statistical behaviour is similar across the map*. (e.g. the variability of PM2.5 is comparable in different neighbourhoods) <span style="display:block; margin-top: 20px ;"></span> - Why it matters: it lets us describe spatial dependence using a function of **distance**, enabling prediction and uncertainty quantification. <span style="display:block; margin-top: 20px ;"></span> - We are going to explore different degrees of stationarity: - Strict (or strong) stationarity - Weak (or second-order) stationarity - Intrinsic stationarity --- # Strict (or strong) stationarity - The spatial process is called .red[**strict (or strong) stationary**] if shifting the locations by the same vector `\(\bm{h}\)`, the **joint distribution** of the values does not change. - A strictly stationary process repeats itself throughout the spatial domain. - In practice, we rarely use strict stationarity directly. We mainly rely on **weak (second-order) stationarity** --- # Weak (or second-order) stationarity and Isotropy .red[Weak stationarity or second order stationarity] - It imposes conditions only on the mean and covariance, which are translation invariant. <span style="display:block; margin-top: 10px ;"></span> - The spatial process is called .red[**weak (or second-order) stationary**] if <span style="display:block; margin-top: 10px ;"></span> - `\(\bm{\mu}\)` is constant, i.e., `\(\mu(\bm{s}_{i}) = \mu\)` for each `\(i\)` <span style="display:block; margin-top: 10px ;"></span> - the spatial covariance function depends on length and orientation of the vector `\(\bm{h}\)` linking two points `\(\bm{s}_{i}\)` and `\(\bm{s}_{j}= \bm{s}_{i} + \bm{h}\)`, that is `\(\mbox{Cov}\left(Z(\bm{s}_{i}),Z(\bm{s}_{j})\right)=\mathcal C(\bm{s}_{i}-\bm{s}_{j}) = \mathcal C(\bm{h})\)` <span style="display:block; margin-top: 10px ;"></span> - Interpretation: two locations have the same covariance whenever they have the same **lag** \(h\) (same distance, and possibly same direction). <span style="display:block; margin-top: 20px ;"></span> -- .red[Isotropy] - If covariance depends **only on distance** (not direction), the process is isotropic: `$$\mathcal C(\bm{s}_{i}-\bm{s}_{j}) = \mathcal C(\|\bm{h}\|)$$` (where `\(\| \cdot \|\)` denotes the Euclidean norm, i.e. Euclidean distance) <span style="display:block; margin-top: 10px ;"></span> - In other words, the covariance depends only on the distance between two points irrespective of the geographical direction (north-south or east-west) of one from the other <span style="display:block; margin-top: 20px ;"></span> - If dependence changes with direction, we have .red[anisotropy] (e.g., prevailing winds, mountain ranges). --- # Intrinsic stationarity and the variogram - .read[Intrinsic stationarity] is a relaxed form of stationarity, which is based on the difference in the process between locations. - For a choice of two locations, we assume that: - the difference in means will be zero (i.e. constant mean assumption) - the variance of increments depends only on `\(\bm{h}\)`: `$$Var(Z(\mathbf{s+h})-Z(\mathbf{s})) = 2\gamma(\mathbf{h})$$` - The function `\(2\gamma(\mathbf{h})\)` is called .blue[Variogram] and `\(\gamma(\mathbf{h})\)` is called .blue[Semivariogram] - The gamma symbol, `\(\gamma\)`, is the standard symbol for variability in a variogram. Commonly, the terms variogram and semivariogram are used interchangeably, although the semivariogram is half of the variogram (we adopt this common terminology here). - Interpretation: larger `\(\gamma(\mathbf{h})\)` implies that locations separated by `\(\mathbf{h}\)` are **less similar**. --- # Variogram - The (semi)variogram is useful in exploratory analysis for geostatistical data, as it summarizes the strength of association as function of the distance (and in case of anisotropy, direction). - The variogram analysis is often associated with .blue[kriging], a widely used interpolation method for geostatistical data. - The variogram is generally estimated by the experimental (or empirical) variogram, `\(\hat{\gamma}(\mathbf{h})\)`, which measures the similarity of values as a function of the distance between their locations: `$$\hat{\gamma}(\mathbf{h})=\frac{1}{2N(\mathbf{h})}\sum_{i}^{N(\mathbf{h})} (Z(\mathbf{s}_i+\mathbf{h})-Z(\mathbf{s}_{i}))^2$$` where `\(\mathbf{h}\)` is the distance class and `\(N(\mathbf{h})\)` is the set of pairs of points; therefore the sum is taken over all sample points separated by a lag vector of magnitude `\(\mathbf{h}\)`. --- # Characteristics of the variogram - .red[Sill]: the value at which the variogram levels off (corresponds to the overall variability of the data) - .red[Range]: the distance at which the variogram reaches the sill (corresponds to the distance at which observations become independent) - .red[Nugget]: The nugget represents the combination of sampling error and short-range variability that causes two samples apparently taken from the same location to have different values. <center><img src=./img/Semivariogram.jpg width='37%' title=''></center> Note that the partial sill is the sill minus the nugget --- # Some common isotropic variogram functions [1] - After obtaining the empirical variogram, we fit a model to it (i.e. theoretical variogram) to get a smooth fit. Popular variogram functions are: <span style="display:block; margin-top: 20px ;"></span> <center><img src=./img/Type_Vario_new.png width='60%' title=''></center> .center[For details, see Banerjee, Carlin, and Gelfand (2014), Sections 2.1.2 - 2.1.4] --- # Some common isotropic variogram functions [2] - `\(t^2\)` is the nugget - `\(t^2 + \sigma^2\)` is the sill, therefore `\(\sigma^2\)` is the partial sill - the value `\(1/\phi\)` is the range, while `\(\phi\)` is called the decay parameter - For the exponential model, strictly speaking, the range is infinite. In this case the notation of .blue[effective range], `\(h_0\)`, is often used. - It is the distance at which the correlation is negligible, dropping to 0.05. Setting: `\begin{eqnarray*} \exp(-h_0\phi) &=& 0.05 \\ \Longrightarrow h_0 &=& -\log(0.05)/\phi \\ \Longrightarrow h_0 &\approx& 3/\phi \end{eqnarray*}` since `\(\log(0.05)\approx -3\)`. --- # Some isotropic variogram functions [3] Theoretical variograms for three models, respectively linear, spherical, and exponential <span style="display:block; margin-top: 20px ;"></span> <center><img src=./img/Popular_vario.jpg width='55%' title=''></center> .center[Source: Banerjee, Carlin, and Gelfand (2014)] --- # Steps of the variogram analysis We use here the Meuse data set, which includes four heavy metals measured in the top soil in a flood plain along the river Meuse (it runs from France to the Netherlands) <center><img src=./img/Picture_Tom.jpg width='40%' title=''></center> .center[Source: Hengl (2009), https://edepot.wur.nl/485517] <span style="display:block; margin-top: 20px ;"></span> - .blue[Steps]: (a) sampling locations and measured values of the target variable, (b) (semi)variogram cloud showing semivariances for all pairs (log-transformed variable), (c) semivariances aggregated to lags of about 100 m, and (d) the final (semi)variogram model fitted using `gstat`. --- # Basic code for the computation of the variogram in `R` [1] To demonstrate the computation of the variogram in `R`, we use the Meuse data set. Among the metals, we work with zinc ``` r > library(sp) > library(gstat) > > # We use Meuse dataset, which includes concentrations of zinc > # measured at 155 sampling sites within the Meuse River plain > data(meuse) > > # Transform the dataframe into a SpatialPointDataFrame > coordinates(meuse) = ~x+y # the function coordinates > # promotes the data.frame meuse > # into a SpatialPointsDataFrame > > # Bubble plot > bubble(meuse, "zinc", col=c("#00ff0088", "#00ff0088"), + main = "zinc concentrations (ppm)") > > hist(meuse$zinc) # we see a strong right skew in the data, so we log-transform them > > # Lagged scatter plot > hscat(log(zinc)~1, meuse,(0:9)*100) # the correlation is quite strong when the lag > # is between 100 meters, then decrease with distance ``` --- # Basic code for the computation of the variogram in `R` [2] ``` r > # Construct the variogram > meuse.vgm = variogram(log(zinc)~1, meuse) # we assume a constant trend for > # the variable log(zinc) > > # Plot the experimental variogram > plot(meuse.vgm) > plot(meuse.vgm, plot.numbers = TRUE, pch = "+") # The numbers of points in the > # lag group used to compute the corresponding value of gamma(h) > > # Fit a variogram model > model.1 = fit.variogram(meuse.vgm, vgm("Sph")) > plot(meuse.vgm, model=model.1) > > # Look at the result of the fit > model.1 > > # We can also specify a set of models. In this case the best fitting is returned > model.2 = fit.variogram(meuse.vgm, vgm(c("Exp", "Sph"))) > model.2 # here the spherical model with nugget=0.051, partial sill =0.591 and range=897 is chosen > > # Specify theoretical variogram with its characteristics > model.final = fit.variogram(meuse.vgm, vgm(psill=0.59,"Sph",range=897,nugget=0.05)) > plot(meuse.vgm, model=model.final) ``` --- # References Banerjee, S., B. P. Carlin, and A. E. Gelfand (2014). _Hierarchical modeling and analysis for spatial data_. Chapman and Hall/CRC. Hengl, T. (2009). "A practical guide to geostatistical mapping".