--- title: "Introduction to BMEmapping" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to BMEmapping} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Spatial datasets encountered in environmental science, geostatistics, hydrology, climatology, and engineering frequently contain heterogeneous observations characterized by varying levels of uncertainty. In many practical applications, some measurements are observed precisely and may therefore be treated as hard data, whereas others are only partially known and are more appropriately represented as interval-valued or uncertain observations, commonly referred to as soft data. Traditional spatial interpolation methods often simplify or ignore such uncertainty by replacing interval observations with single representative values, potentially leading to biased predictions and underestimated uncertainty. Bayesian Maximum Entropy (BME) provides a rigorous probabilistic framework for integrating heterogeneous spatial information while explicitly accounting for uncertainty in soft observations. The BME framework combines general knowledge, typically represented through covariance or variogram structures, with site-specific knowledge derived from both hard and soft data to estimate posterior probability density functions and generate spatial predictions. By preserving interval uncertainty throughout the modeling process, BME methods produce spatial predictions that more realistically reflect the underlying uncertainty in the observed data. The **BMEmapping** package provides a unified computational framework for Bayesian Maximum Entropy spatial modeling within the \textsf{R} environment. The package includes tools for constructing BME data objects, estimating posterior probability density functions, fitting variogram models, performing spatial prediction, visualizing uncertainty, and evaluating predictive performance using cross-validation procedures. Both conventional Bayesian Maximum Entropy (CBME) and quantile-based Bayesian Maximum Entropy (QBME) methodologies are implemented within the package. Specifically, **BMEmapping** is designed to perform spatial interpolation at unobserved locations using both hard and soft-interval data. This vignette introduces the fundamental functionality of the package and guides users through its basic usage. ## Main Functions The primary functions available in **BMEmapping** include: `bme_map` - Creates a `BMEmapping` object that stores all spatial coordinates, hard observations, and soft-interval information required for Bayesian Maximum Entropy (BME) interpolation. `prob_zk` - Computes and optionally visualizes the posterior probability density estimate at an unobserved location using the Conventional Bayesian Maximum Entropy (CBME) approach. `q_prob_zk` - Computes and optionally visualizes the posterior probability density estimate at an unobserved location using the Quantile-Based Bayesian Maximum Entropy (QBME) approach. `bme_predict` - Predicts the posterior mean or posterior mode, together with the associated posterior variance when applicable, using the CBME approach. `q_bme_predict` - Predicts the posterior mean or posterior mode, together with the associated posterior variance when applicable, using the QBME approach. `bme_cv` - Performs leave-one-out cross-validation on the hard data locations using the CBME approach and returns prediction diagnostics and model performance measures. `q_bme_cv` - Performs leave-one-out cross-validation on the hard data locations using the QBME approach and returns prediction diagnostics and predictive performance statistics. ### Input Requirements The **BMEmapping** functions require the following input arguments: #### Data parameters `x`: A matrix or data frame specifying the geographic prediction location(s) where posterior estimation or spatial prediction is to be performed. `ch`: A matrix or data frame containing the geographic coordinates of the hard data locations. `cs`: A matrix or data frame containing the geographic coordinates of the soft-interval data locations. `zh`: A numeric vector containing the observed hard data values corresponding to the locations in `ch`. `a`: A numeric vector containing the lower bounds of the soft-interval data corresponding to the locations in `cs`. `b`: A numeric vector containing the upper bounds of the soft-interval data corresponding to the locations in `cs`. #### Variogram parameters Before applying the CBME implementation, a variogram model must be fitted to characterize the spatial dependence structure of the data. This requires specification of the variogram model type together with the associated covariance parameters: `model`: The variogram model type. Supported options include `"exp"` (Exponential), `"sph"` (Spherical), and `"gau"` (Gaussian). The selected model should reflect the underlying spatial dependence structure of the data. `nugget`: The nugget effect of the variogram, representing measurement error and/or microscale spatial variability. `sill`: The sill of the variogram, representing the plateau value of the semivariance. `range`: The range (or effective range) of the variogram, representing the distance beyond which spatial correlation becomes negligible. #### Optional parameters Additional optional parameters are also available: `nhmax`: Maximum number of neighboring hard data points included in the BME integration process. `nsmax`: Maximum number of neighboring soft-interval data points included in the BME integration process. `zk_range`: A numeric vector specifying the range over which the posterior density is evaluated at the prediction location. `n`: An integer specifying the number of evaluation points used to approximate the posterior density over `zk_range`. `nq`: An integer specifying the number of quantile levels used in the QBME framework. Unless otherwise specified, all optional parameters are automatically assigned their default values. Additional details regarding the arguments and optional settings can be found in the corresponding function documentation (e.g., `?prob_zk`). ## A Data Example The `utsnowload` dataset included in the package contains detrended reliability-targeted design snow load (RTDSL) measurements collected from 232 locations across the state of Utah. Among these locations, 65 observations are available as precise measurements and are treated as hard data, while the remaining 167 observations are represented as interval-valued soft data. The dataset therefore provides an ideal example for illustrating how BME methods can incorporate heterogeneous information sources into a unified spatial prediction framework. A detailed discussion of the methodological background and applications of interval-based BME modeling is provided in Duah et al. (2025) and related references. The present tutorial focuses on practical implementation of the **BMEmapping** package and demonstrates complete workflows for: - organizing hard and soft spatial observations, - constructing BME data objects, - visualizing heterogeneous spatial data, - fitting variogram models, - estimating posterior probability density functions, - generating BME spatial predictions, - constructing prediction maps, and - evaluating predictive performance through cross-validation. The tutorial additionally demonstrates how the quantile-based Bayesian Maximum Entropy (QBME) framework can improve computational efficiency while maintaining predictive accuracy in applications involving substantial interval uncertainty. ### Loading Required Libraries The following libraries are required for spatial preprocessing, visualization, variogram modeling, and Bayesian Maximum Entropy interpolation. ```{r warning=FALSE, message=FALSE} library(BMEmapping) library(ggplot2) library(sf) library(gstat) library(dplyr) library(tidyr) library(scales) library(knitr) library(gridExtra) library(maps) library(ggspatial) ``` ### The `utsnowload` Dataset The example dataset included in the package is then loaded. ```{r } data("utsnowload") ``` #### Overview of the `utsnowload` Dataset The `utsnowload` dataset contains detrended reliability-targeted design snow load measurements together with spatial coordinates and interval bounds describing uncertainty associated with soft observations. The dataset is structured such that: - the first 65 observations correspond to hard data locations, - the remaining observations correspond to soft interval data, - longitude and latitude define the spatial coordinates, and - lower and upper interval bounds quantify uncertainty for soft observations. Complete documentation for the `utsnowload` dataset can be accessed using the command ```{r eval=FALSE} ?utsnowload ``` Summary statistics for the dataset are presented below. ```{r } summary(utsnowload) ``` The plot below displays the geographical locations of the hard and soft data. ```{r fig.width = 6, fig.height = 6, fig.align='center', warning=FALSE} utah <- map_data("state", region = "utah") utsnowload_df <- utsnowload utsnowload_df$type <- rep(c("hard", "soft"), times = c(67, 165)) ggplot() + geom_polygon( data = utah, aes(x = long, y = lat, group = group), fill = "grey95", color = "black" ) + geom_point( data = utsnowload_df, aes(x = longitude, y = latitude, color = type, shape = type), size = 1.5, show.legend = TRUE ) + coord_quickmap() + scale_x_continuous( name = "longitude", labels = function(x) paste0(abs(x), "° ", ifelse(x < 0, "W", "E")) ) + scale_y_continuous( name = "latitude", labels = function(y) paste0(abs(y), "° ", ifelse(y < 0, "S", "N")) ) + annotation_north_arrow( location = "tr", # bottom-left which_north = "true", style = north_arrow_fancy_orienteering, height = unit(1.2, "cm"), width = unit(1.2, "cm") ) + labs(color = "data type", shape = "data type") + theme( panel.background = element_rect(fill = "white", color = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank() ) ``` ### Organizing Hard and Soft Data A critical step in BME modeling involves separating the observations into hard and soft components. The spatial coordinates associated with hard and soft observations are extracted independently, together with the corresponding observed values and interval bounds. For illustrative purposes, a subset of the `utsnowload` dataset consisting of the first 30 hard-data locations is selected. This reduced dataset facilitates demonstration of the data organization and visualization workflow while limiting computational demands. ```{r} # hard data locations ch <- utsnowload[1:30, c("latitude", "longitude")] # hard data values zh <- utsnowload[1:30, c("hard")] # soft data locations cs <- utsnowload[68:227, c("latitude", "longitude")] # lower and upper bounds of soft data (intervals) a <- utsnowload[68:227, c("lower")] b <- utsnowload[68:227, c("upper")] ``` The extracted information is subsequently organized into a **BMEmapping** object, which serves as the central data structure for posterior density estimation and BME prediction. ```{r } data_object <- bme_map(ch, cs, zh, a, b) ``` ### Visualizing Hard and Soft Data Exploratory visualization provides important insight into the spatial coverage and uncertainty structure of the dataset prior to variogram estimation and spatial interpolation. The `plot()` method for **BMEmapping** objects displays: - hard-data locations, - soft-data locations, - interval uncertainty associated with soft observations, and - overall spatial support across the study region. ```{r fig.width = 6, fig.height = 6, fig.align='center'} plot(data_object) ``` This visualization facilitates identification of spatial clustering, regions with sparse observations, and areas associated with substantial interval uncertainty. Such preliminary assessment is important because the quality and spatial distribution of both hard and soft observations strongly influence posterior density estimation and predictive performance. BME prediction is demonstrated using a subset of the soft-data locations from the `utsnowload` dataset as prediction locations. The prediction locations, denoted by `x_k`, are obtained by extracting the geographic coordinates of the selected target locations. These coordinates are subsequently supplied to the BME prediction functions for posterior density estimation and computation of summary statistics such as the posterior mean and posterior mode. The prediction locations used in this illustration are defined below: ```{r} xk <- utsnowload[228:232, c("latitude", "longitude")] xk ``` ### BME Estimation Methods Two BME prediction approaches are considered throughout this tutorial. #### Conventional Bayesian Maximum Entropy (CBME) The conventional Bayesian Maximum Entropy framework assumes that the soft observations follow a uniform distribution over the specified interval bounds. Under this assumption, analytical expressions for the interval mean and variance are incorporated into the covariance structure used during posterior estimation. The CBME approach is computationally efficient and has been widely used in applications involving interval uncertainty. However, the uniform assumption may oversimplify uncertainty structures when the true distribution within the interval is asymmetric or nonuniform. #### Quantile-Based Bayesian Maximum Entropy (QBME) The quantile-based Bayesian Maximum Entropy framework extends the conventional approach by estimating covariance structures using multiple quantile-based representations of the soft interval observations. Instead of relying solely on interval midpoints or uniform assumptions, QBME constructs several quantile-specific realizations and combines the resulting covariance estimates. This strategy allows the method to better capture interval asymmetry and uncertainty while preserving the probabilistic structure of the BME framework. In many applications, QBME has been shown to improve computational efficiency while maintaining predictive accuracy comparable to conventional BME approaches. ### CBME Implementation #### Variogram Estimation Specification of an appropriate variogram model is a fundamental component of CBME interpolation because the covariance structure determines the spatial dependence relationships used during posterior estimation. Because the number of hard observations is relatively limited, a combined dataset consisting of hard observations and representative values derived from the soft intervals is commonly used for variogram estimation. In this tutorial, the midpoints of the soft intervals are combined with the hard observations to provide a practical approximation of the underlying spatial variability. Experimental variograms and fitted covariance models are estimated using the `gstat` package. The plot of the variogram is shown below: ```{r fig.width = 6, fig.height = 6, fig.align='center'} # combine hard data and midpoints of soft data df <- data.frame(rbind(ch, cs), c(zh, (a + b) / 2)) colnames(df) <- c("latitude", "longitude", "z") sf_data <- sf::st_as_sf(df, coords = c("longitude", "latitude")) # Empirical variogram vg <- variogram(z ~ 1, data = sf_data) vg_model <- fit.variogram( vg, model = vgm(c("Exp", "Sph")) ) # variogram plot plot(vg, vg_model) ``` The variogram model and parameters are given as: ```{r} # variogram model vg_model ``` #### Posterior Density Estimation One of the principal advantages of the BME framework is the ability to estimate complete posterior probability density functions rather than producing only single-point predictions. Posterior densities can be estimated at selected prediction locations using the BME prediction functions provided in the package. Before performing full-scale BME prediction, it is often instructive to examine the posterior distributions at a few selected locations (preferably the some prediction locations). Although this step is not strictly necessary and can be skipped, it provides valuable insights into the shape, spread, and asymmetry of the posterior distributions. Moreover, it allows users to verify that the chosen domain for evaluation (`zk_range`) sufficiently encompasses the possible values of the target variable and to make adjustments if necessary. ```{r fig.width = 6, fig.height = 6, fig.align='center'} # set model parameters model <- as.character(vg_model[2, 1]) nugget <- vg_model[1, 2] sill <- vg_model[2, 2] range <- vg_model[2, 3] # default zk_range p_1 <- prob_zk(xk[1,], data_object, model, nugget, sill, range) p_2 <- prob_zk(xk[2,], data_object, model, nugget, sill, range) p_3 <- prob_zk(xk[3,], data_object, model, nugget, sill, range) p_4 <- prob_zk(xk[4,], data_object, model, nugget, sill, range) p_df <- cbind.data.frame(p_1, p_2[, 2], p_3[, 2], p_4[, 2]) names(p_df) <- c("zk_i", "p1", "p2", "p3", "p4") # Function to generate ggplot for a given column name plot_prob_curve <- function(df, pi_col) { ggplot(df, aes(x = zk_i, y = .data[[pi_col]])) + geom_line(color = "blue", linewidth = 0.5) + labs(x = "z", y = "f(z)") + theme_minimal(base_size = 10) + theme( panel.background = ggplot2::element_rect(fill = "white", color = "black") ) } # Generate individual plots p1 <- plot_prob_curve(p_df, "p1") p2 <- plot_prob_curve(p_df, "p2") p3 <- plot_prob_curve(p_df, "p3") p4 <- plot_prob_curve(p_df, "p4") # Arrange in a 2x2 grid grid.arrange(p1, p2, p3, p4, ncol = 2) ``` Inspection of the posterior density plots above reveals that the estimated density functions reach zero at certain sub-regions within the default `zk_range`, for example $(-4.25, -2)$ and $(1.5, 2.5)$. This indicates that the true support of the posterior distribution is likely much narrower than the initially chosen default range. To improve the accuracy and computational efficiency of the BME predictions, it is advisable to refine the posterior domain by excluding regions where $f(z) = 0$, focusing on the interval where the posterior density is strictly positive. In this example, the refined range can be set to `[-2, 1.5]`, which better captures the plausible values of the target variable. Using the updated range, the posterior densities at the selected prediction locations are recomputed as shown below: ```{r fig.width = 6, fig.height = 6, fig.align='center'} # updated zk_range: [-2, 1.5] q_1 <- prob_zk(xk[1,], data_object, model, nugget, sill, range, zk_range = c(-2, 2)) q_2 <- prob_zk(xk[2,], data_object, model, nugget, sill, range, zk_range = c(-2, 2)) q_3 <- prob_zk(xk[3,], data_object, model, nugget, sill, range, zk_range = c(-2, 2)) q_4 <- prob_zk(xk[4,], data_object, model, nugget, sill, range, zk_range = c(-2, 2)) q_df <- cbind.data.frame(q_1, q_2[, 2], q_3[, 2], q_4[, 2]) names(q_df) <- c("zk_i", "q1", "q2", "q3", "q4") # Generate individual plots q1 <- plot_prob_curve(q_df, "q1") q2 <- plot_prob_curve(q_df, "q2") q3 <- plot_prob_curve(q_df, "q3") q4 <- plot_prob_curve(q_df, "q4") grid.arrange(q1, q2, q3, q4, ncol = 2) ``` #### Spatial Prediction Predictions within the BME framework can be computed using either the posterior mean or the posterior mode, depending on the desired interpretation and application. The posterior mean represents the expected value of the posterior probability density function and provides an average estimate that minimizes mean squared prediction error. In contrast, the posterior mode corresponds to the value associated with the highest posterior probability density and may provide a more representative estimate when the posterior distribution is asymmetric or multimodal. In addition to point predictions, the BME framework also allows estimation of the associated posterior variance, which quantifies the uncertainty of the prediction at each location. The posterior variance provides important information regarding prediction reliability, spatial confidence, and the influence of nearby hard and soft observations. Together, the posterior mean or mode and the posterior variance offer a comprehensive probabilistic characterization of the spatial prediction process. The posterior mode predictions for the estimation locations are computed as ```{r} bme_predict(xk, data_object, model, nugget, sill, range, n = 100, zk_range = c(-2, 2), type = "mode") ``` The posterior mean predictions for the estimation locations are computed as ```{r} bme_predict(xk, data_object, model, nugget, sill, range, n = 100, zk_range = c(-2, 2), type = "mean") ``` #### Cross-Validation Analysis To quantitatively evaluate predictive accuracy of the BME model, leave-one-out cross-validation (LOOCV) is conducted at all hard data locations. During each iteration, a single hard data observation is excluded from model fitting, predicted using the remaining observations, and compared with the corresponding measured value. This procedure generates pointwise predictions and residuals for every hard data location while maximizing data utilization, making LOOCV particularly suitable for small datasets where partitioning into separate training and testing subsets would reduce the effective sample size. Because each iteration excludes a different observation, slight variations occur in the fitted variogram for each prediction location. Predictive performance is quantified using mean error (ME), mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R$^2$). ```{r} CBME_cv <- bme_cv(data_object, model, nugget, sill, range, n = 100, zk_range = c(-2, 2), type = "mean") CBME_cv ``` Key summary metrics for model performance is given computed as ```{r} summary(CBME_cv) ``` Diagnostic plots are generated to evaluate model adequacy through visual assessment of residual behavior. These plots facilitate identification of deviations from assumptions including spatial independence, homoscedasticity, and stationarity. Residual patterns may also reveal locations associated with unusually large prediction errors or spatial clustering, potentially indicating the need for model refinement or incorporation of additional data considerations. The residual plots are generated by ```{r fig.width = 6, fig.height = 6, fig.align='center'} plot(CBME_cv) ``` ### The QBME Implementation Unlike the CBME approach, QBME does not require users to specify a variogram model. Instead, spatial dependence is internally captured through quantile specific variogram estimation, allowing the method to adaptatively represent variability across different levels of the data distribution. #### Posterior Distribution Estimation at Individual Locations Similarly to the CBME, it is advisable to examine the posterior distributions at a few selected locations in other to choose an appropriate `zk_range`. The posterior density plots of the selected locations are shown below: ```{r fig.width = 6, fig.height = 6, fig.align='center'} # default zk_range q_1 <- q_prob_zk(xk[1,], data_object) q_2 <- q_prob_zk(xk[2,], data_object) q_3 <- q_prob_zk(xk[3,], data_object) q_4 <- q_prob_zk(xk[4,], data_object) q_df <- cbind.data.frame(q_1, q_2[, 2], q_3[, 2], q_4[, 2]) names(q_df) <- c("zk_i", "q1", "q2", "q3", "q4") # Generate individual plots q1 <- plot_prob_curve(q_df, "q1") q2 <- plot_prob_curve(q_df, "q2") q3 <- plot_prob_curve(q_df, "q3") q4 <- plot_prob_curve(q_df, "q4") # Arrange in a 2x2 grid grid.arrange(q1, q2, q3, q4, ncol = 2) ``` Based on the above plots, the `zk_range` which better captures the plausible values of the target variable can be set to `[-1.5, 1]`. #### Spatial Prediction The posterior mode predictions for the estimation locations are computed as ```{r } q_bme_predict(xk, data_object, n = 100, zk_range = c(-1.5, 1), type = "mode") ``` The posterior mean predictions for the estimation locations are computed as ```{r } q_bme_predict(xk, data_object, n = 100, zk_range = c(-1.5, 1), type = "mean") ``` #### Cross-Validation Analysis The LOOCV on the hard data is computed as ```{r } QBME_cv <- q_bme_cv(data_object, n = 100, zk_range = c(-1.5, 1), nq = 8, type = "mean") QBME_cv ``` Key summary metrics for model performance is given computed as ```{r } summary(QBME_cv) ``` The residual plots are generated by ```{r fig.width = 6, fig.height = 6, fig.align='center'} plot(QBME_cv) ``` ## Conclusion The **BMEmapping** package provides a comprehensive framework for Bayesian Maximum Entropy spatial interpolation using both hard and soft spatial observations. By explicitly incorporating interval uncertainty into the interpolation process, the package enables more realistic uncertainty quantification and probabilistic spatial prediction than methods based solely on precise observations. The workflows demonstrated throughout this tutorial illustrate how Bayesian Maximum Entropy methods can be efficiently implemented within the \textsf{R} environment for spatial prediction, posterior density estimation, variogram modeling, uncertainty visualization, and predictive assessment. The package is particularly well suited for environmental and geostatistical applications involving heterogeneous datasets characterized by incomplete, interval-valued, or uncertain observations.