Bayesian validation framework for dynamic epidemic models

Complex models of infectious diseases are used to understand the transmission dynamics of the disease, project the course of an epidemic, predict the effect of interventions and/or provide information for power calculations of community level intervention studies. However, there have been relatively few opportunities to rigorously evaluate the predictions of such models till now. Indeed, while there is a large literature on calibration (fitting model parameters) and validation (comparing model outputs to data) of complex models based on empirical data, the lack of uniformity in accepted criteria for such procedures for models of infectious diseases has led to simple procedures being prevalent for such steps. However, recently, several community level randomized trials of combination HIV intervention have been planned and/or initiated, and in each case, significant epidemic modeling efforts were conducted during trial planning which were integral to the design of these trials. The existence of these models and the (anticipated) availability of results from the related trials, provide a unique opportunity to evaluate the models and their usefulness in trial design. In this project, we outline a framework for evaluating the predictions of complex epidemiological models and describe experiments that can be used to test their predictions.

ecology, and other fields (see for example Caswell, 1976;Guttorp and Walden, 1987;Sampson and Guttorp, 1999;Poole and Raftery, 2000). Different types of models are available for this purpose, including stochastic, agent-based and deterministic differential equation models, the choice of which must be suited to the question of interest (see Halloran et al., 2017). While stochastic and agent based models incorporate random variation adding individual interactions such as partnership formation/dissolution, differential equation models are typically simpler and run faster, and are quite useful for analyses that require multiple runs of the model. These models have also dominated the field of epidemic modeling, especially in complex diseases like HIV/AIDS (Johnson and White, 2011). In HIV epidemiology, these complex, dynamic models of infectious disease are used to understand the transmission dynamics of the disease, to obtain estimates and short-term projections of HIV epidemics (see Stanecki et al., 2012;Dorrington et al., 2005), to predict the effect of interventions (Case et al., 2018;de Montigny et al., 2018), to evaluate program implementation costs (Eaton et al., 2014;Schwartlander et al., 2011), to interpret clinical trial results (Adamson et al., 2019;Dimitrov et al., 2019) and to provide information for power calculations of community level intervention studies (Hayes and Moulton, 2017). For example, the increase in availability of antiretroviral therapy (ART) to individuals with HIV (WHO, 2013;Johnson et al., 2013) has opened up questions about the trajectory of the epidemic in presence of ART, including the probability of, and the level of coverage necessary for, complete eradication of the disease. To answer these questions, several models have been proposed to quantify the short and long term consequences of widespread ART availability and use on HIV prevalence, incidence and mortality, and to assess the level of coverage of ART necessary to substantially reduce or eliminate HIV (Granich et al., 2009;Kretzschmar et al., 2013).
Validity of results from a model is greatly dependent on the accuracy of the model structure and the input parameters, so a key goal during model development is to quantify and reduce the uncertainty about the structure and select model parameters which reproduce available epidemic data, using all of the available sources of evidence. This step is called calibration, and it involves careful tuning of the model structure to epidemiological endpoints, informed by available knowledge in the literature, and establishes credibility of the model (Ramin and Arhonditsis, 2013). Another important step in epidemiological model building is model validation, which is defined as a set of methods that help assess a model's performance in making predictions. There are different types of validation processes, including face validity (the extent to which a model and its assumptions correspond to the current science as judged by experts), internal validity (whether the model behaves as intended and has been properly implemented), cross validity (how the model behaves when calibrated to one part of the data and used to predict the other), and external validity (to compare the predicted outcomes to measurable results). Each type of validation has its own sets of methods, strengths, limitations, and best practices (see Eddy et al., 2012).
For a clinical trial that was conducted based on insights gained from a model(s) in the planning phase, it can be quite useful to check how accurately the model(s) predicted the outcomes of interest when the trial is over. This is a possible external validation step. It is also quite useful to see if any source of bias in the model can be re-accounted for, and whether the model structure can be recalibrated to ensure its re-usability in the future (Eddy et al., 2012). The results of these analyses can help investigators determine applicability of the model in the public-health decision making process, and in planning of future trials.
In this article, we propose a framework for model comparison and evaluation based on observed data. To be precise, we are interested in external validation (will be called validation hereon) and re-calibration of a model, based on observed data. It is clear that the most powerful way to validate the prediction of a mathematical model is to record the model and its predictions in advance of new observations. Recently several community level randomized trials of combination HIV intervention have been planned and/or initiated (Boily et al., 2012). Significant epidemic modeling efforts were conducted during trial planning and became integral to the design of these trials. The existence of these models that have been designed to predict trial results in a specific setting, and the (anticipated) availability of results from those trials, provides a unique opportunity to evaluate those models and their usefulness in trial design through an external validation step. In this type of analysis, the goal is to test and potentially validate the model predictions of intervention with the trial results at the end of the trial. If the trial results are similar to the model predictions, then this validates the model for the trial in question, and the model can be used for further analyses with more confidence (Boily et al., 2012).
Interestingly, there have been relatively few such opportunities to rigorously evaluate the predictions of HIV epidemic models until now. Indeed, while there are multiple theoretical methods for calibration and external validation of complex models (see Bayarri et al., 2007;Kennedy and O'Hagan, 2001;Poole and Raftery, 2000), fairly simple procedures have been used for calibration and validation of models partially due to lack of substantial high quality, population-level data on HIV incidence, HIV prevalence and sexual behavior. The main goal of this article is to develop a step-by-step methodological framework for model validation, comparison and selection, when detailed data are available. Our approach takes into account the key uncertainties in model parameters, allowing for subsequent recalibration of the model, if needed and/or possible. An additional objective is to determine the extent to which external data can be used to reject a complex epidemiological model. The key components of the approach include the use of Gaussian process response-surface method, and introduction of Bayesian representations of model bias and uncertainty, following the works of Bayarri et al. (2007), Kennedy and O'Hagan (2001) and Kennedy et al. (2002).
In Section 2, we formulate the problem and propose our methodological framework for validation of a dynamic epidemic model based on the Bayesian validation framework of Bayarri et al. (2007). Additionally, we propose a way to make decisions regarding model fit based on Bayesian credible intervals. In Section 3, we describe a mathematical model (M W ) of a heterosexual HIV epidemic originally considered in Woods et al. (2018) to investigate how the proportion of early transmission affects the impact of ART on reducing HIV incidence. The model includes stages of HIV infection, flexible sexual mixing, and changes in risk behavior over the epidemic, and was calibrated to HIV prevalence data from South Africa. Next, we apply the proposed methodology to evaluate the performance of this model under departures of different model assumptions. In other words, we try to falsify this model under null and alternative conditions, and note how it performs in such situations. In Section 4, we discuss the implications of our results. Some additional results are provided in Appendix.

Model framework
2.1.1. Epidemiological model definition-In this article, we consider epidemiological models of the form y M = M(x, θ). Here y M ∈ Ω M , the model output, typically emulates a population level outcome of interest, such as HIV prevalence. Each entry in y M represent the projected outcome at different times within different demographics or different risk subgroups. The model M is a complex non-linear function, typically defined as a solution to a system of ordinary differential equations, which translates a vector of community-specific parameters x ∈ Ω x and a vector of global parameters θ ∈ Ω θ into the outcome y M .
Throughout this paper a community will be any population within which an epidemic can be modeled independently from other communities. A community may be all individuals in a given city or region or may only be a subset, such as people who inject drugs, within which the epidemic may be evolving independently. A community may be divided into several interconnected sub-populations, for example divided by demography.
The community-specific parameters x (a vector of size p x ) denote information about attributes of a given community within which we wish to calculate the model output y M . In the context of HIV prevention trials, y can be the projected HIV prevalence by gender and age group, and x might include community-specific information about level of ART usage, initial testing coverage (that differ by subgroups or communities), HIV prevalence for that site at some previous time point. The joint spaces Ω x and Ω M together represent the granularity of the model, the class of all possible sub-populations and communities for which the model M, after calibration, can predict the population-level outcome separately.
The 'global' parameters θ (of size p θ ) describe the part of the model that is universal, or fixed across different communities. For example, in the context of a HIV trial, θ contain various biologic, behavioral or intervention characteristics (e.g., parameters for intervention efficacy, transmission risk per act, sexual mixing, background level of male circumcision etc.) which are shared between community-specific simulations.
The components of y M , x, and θ depend on the modeling context and intended granularity.
For example, it may be crucial for a model on HIV transmission to predict the response separately for men who have sex with men (MSMs) and transgender women (TGW), two crucial sub-populations at risk for the disease. However that distinction may be inconsequential in the context of some other disease (say cancer), and a universal response may be preferred under a model aiming to predict the disease for these sub-populations. We define the 'model structure' of M to be the three spaces Ω x , Ω θ , and Ω M . able to predict the underlying truth without error. In practice, however, it is quite likely that there will be uncertainty regarding the true values of x and θ, and the model structure M is only a useful approximation of the underlying mechanics. Ideally, we would assess a model M by comparing the modeled epidemiological value y M with the true value (denoted by y R ), though this may not be possible, as the reality will have to be estimated with error from a sample of the population. All these sources of uncertainties must be considered in assessing the model predictions.
2.1.3. Field data vs. reality-Suppose we are interested in evaluating our model for K different subgroups, stratified under one or more baseline/longitudinal characteristics or communities, based on model projections at a given time during the study and the true outcome at that time. As mentioned before, we assume that the outcome is not observed directly, but it can be measured without bias. Let y R,k be the true outcome for community k, k = 1,…,K, at the specified time, and assume that we observe y F,k , the 'field data' for this community at that time, as reality measured with error, that is, where ϵ F (y R,k ) are mean 0 errors with variance 1∕λ F (y R,k ), for a precision process λ F (·), which may or may not be explicitly known (we discuss this in details in Section 2.3.1). Further distributional assumptions can be made on ϵ F (y R,k ) under specific problem setups, for example, one operational assumption often used in validation problems is that ϵ F (y R,k )'s are independent normal random errors (see Bayarri et al., 2007). However, one might have to consider transformations of the responses y R,k and y F,k to enable easier distributional assumptions and interpretation of the error function ϵ F (·) in Eq. (1).
Note: Note that although y F and y M can be evaluated at several times during the study, we consider only one fixed time point here to compare the model projections with observed data, and as a result to simplify notations, we have dropped the suffix t from y F , y M as well as y R .

Induced prior for y M -Information
about different parameters of a model are often available from previously conducted trials or relevant observational studies in the literature. Knowledge from these various sources are combined for the quantification of the community-specific and global parameters, and the model is then carefully calibrated, resulting in well-informed prior distributions for these quantities. Thus, in this article, we assume that X k , 1 the community-specific parameters for the kth subgroup, and Θ, the global parameter vector, are random variables with distributions G X k ( ⋅ ; η X k ) and ℋ Θ ⋅ ; η Θ respectively, and with underlying hyperparameters η X k and η Θ . These distributions quantify the level of uncertainties about the different model inputs, the community-specific and global parameters. Also note, that the distributions of X k and Θ jointly induce a distribution, denoted by ℱ X k , Θ M , on the random model function M(X k ,Θ).
1 We will use X k , Θ etc to denote Random Variables, while x k and θ will be reserved to denote observed realizations of such, and/or to denote fixed quantities.
Note: Simplification of the above setup is possible too, for example, if information about X k is available through high quality estimates x k , we can assume that the true underlying community-specific parameter set is known apriori, and X k = x k . Similarly, information about ℋ Θ can be available only as scalars, Θ = θ , that are available as estimates/calibrated values for the 'true' structural parameters of the model. In such cases, we may choose to use these scalars directly as 'true' (or best known) values in the analysis.

Model discrepancy
2.2.1. The discrepancy function, D M -Statistical assessment of a model prediction is based on the difference between the true value y R and the model prediction. To that effect, we define the discrepancy function D M , which we will use throughout this article. For a given model M with an instance of community-specific parameters x and global parameters θ, the discrepancy between the model output and the relevant response process y is given as, In our problem setup, as previously mentioned in Section 2.1.4, X k , the community-specific parameters for the kth subgroup, and Θ, the global parameter vector, are random variables, while the true response for is y R,k . Hence, the 'random' discrepancy function for subgroup k becomes D M (y R,k ,X k ,Θ) = y R,k − M(X k ,Θ). We can analyze the distribution of D M (y R,k ) for k = 1,…,K (for each of the different subgroups/communities) to evaluate model performance. The variation in D M (y R,k ) between these communities can be useful in identifying areas where the model has failed to adequately capture the effect of factors on the response.
Note: Although we are interested in learning about D M (y R,k ), in the absence of concrete knowledge about y R , we typically have to rely on D M (y F,k ) to infer about this quantity.

2.2.2.
A Bayesian approach to evaluate D M -Since D M (X k ,Θ) is a function of the random variables X k and Θ, for which we have distributions of the form of G X k ( ⋅ ; η X k ) and ℋ Θ ⋅ ; η Θ , one trivial way to obtain and analyze the distribution of D M is through the transformation D M (X k ,Θ) = y R,k − M(X k ,Θ) on the induced distribution ℱ X k , Θ M . Although the true response y R,k is typically unknown, one can replace y R,k by its observed estimate y F,k . Although the resultant distribution is well-defined, it can only be used to assess the model in its current state. However, our interests may also lie in identifying the possible sources of error in the model, and if possible, recalibrate these faulty parts, based on observed data, to improve model performance.
Hence, using a Bayesian analysis described in the next section (Section 2.3), we propose to estimate the posterior distribution of the community-specific parameters X k , global parameters Θ, and the discrepancy function D M (X k ,Θ), to assess the model for community k and identify its sources of discrepancy. Ideally, the distribution of D M should be centered around zero with small variance. We will show how to quantitatively evaluate this property, even in high dimensions using the idea of posterior tail probability (or p T P ). We will also show how to use this procedure to identify the different sources of error or bias in model parameters, both community-specific and global, and model structure.

Gaussian Stochastic Processes (GaSP) model as a prior for D M -We
define a functional prior P D M (x, θ) on D M (x, θ) for each pair (X = x,Θ = θ) using a Gaussian Stochastic Process (GaSP). A GaSP is useful in modeling functional outputs like D M that depend smoothly on its arguments, x and θ (see Sacks et al., 1989;Currin et al., 1991).
Although a GaSP is most helpful when all of the model arguments are continuous, it can also handle cases when one or more (but not all) of its arguments are discrete, by specifying a separate Gaussian response surface at each level of the discrete factors. This, combined with the priors on X and Θ, produces a prior for D M (X,Θ).
Consider our setup of K subgroups (or communities), with community-specific random variables X k and a global random variable Θ. Suppose we want to run the model M for X k = x k , for k ∈ {1,…,K}, and Θ = θ, and evaluate the discrepancy of the model for these subgroups for the given realizations. Define the K × 1 vector of discrepancies for the K subgroups as d M = D M x 1 , θ , …, D M x K , θ T . Let Z k denote the multivariate random variable Z k = (X k ,Θ), such that z: = z 1 T , …, z K T T forms a K × p z matrix, where p z = p x + p θ and z k = x k , θ is a realization of Z K . The GaSP formulation assigns a K-variate Gaussian distribution for the vector d M with mean function μ d (z) = μ d z 1 , …, μ d z K T . The covariance function C d (·,·) for the Gaussian process is given as, for parameters β d = β d, 1 , …, β d, p z and α d = α d, 1 , …, α d, p z that control the amount of correlation between D M (z k ) and D M (z l ), and with λ d controlling the precision of the Gaussian surface. Note that for elements z k = x k , θ and z l = x l , θ (where k,l ∈ {1,…,K}), the last p θ elements in the sum ∑ j = 1 p z β d, j z k, j − z l, j α d, j are 0, as z k, j + p x = z l, j + p x = θ j for j = 1,…,p θ . Hence the above simplifies to C d z k , z l = 1 λ d exp − ∑ j = 1 p x β d, j x k, j − x l, j α d, j and as a result only the first p x elements of the hyperparameters β d and α d need to be specified. Now let us briefly discuss the specifications of the hyperparameters μ d , λ d , α d and β d . First, to limit the number of hyperparameters in the model, the components of α d are fixed. For example, α d = 2 gives us the standard Gaussian process formulation (see Bayarri et al., 2007), and for the rest of the article, we will consider the elements of α d to be fixed at that value. The mean function of the GaSP, μ d (·), can be chosen to be fixed at a constant μ, the most intuitive choice being μ = 0. To generalize the analysis further, we can specify symmetric mean 0 distributions to be used as priors if required. For elements of β d , we can either specify priors for them though another set of hyperparameters, or fix them at estimates, determined using data driven methods. However, the precision parameter λ d is typically kept stochastic, and we can specify priors like inverse gamma for λ d with the hyperparameters (shape and scale) determined in a data-driven manner. Please see a more detailed discussion on how to choose these hyperparameters for a practical setting as our simulation example in Section 3.2.

Assessing the model
2.3.1. Specification for λ F in the Bayesian analysis-Often λ F (y R,k ) can be determined by the sampling design of the field data y F,k for a given subgroup or community k. Consider the case of a epidemiological model that predicts HIV prevalence, where reality y R,k is a proportion in the interval (0,1), and is estimated through sampling. Now if the observed 'field data' y F is collected based on a set of N samples, then it will be distributed as Ny F , k Bin(N, y R, k ) . Here the precision parameter λ F has an explicit expression, Note that for a large enough sample size, y F,k is approximately distributed as Gaussian with variance 1∕λ F (y R,k ). A natural estimate for λ F (y R,k ) can be produced by replacing y R,k in (4) by its estimate y F,k . However in case the error structure is unknown or the estimates are unreliable, standard priors like 1∕λ F can be used if replicates are available, otherwise data dependent priors centered at a suitably defined estimate of λ F (y R,k ) can be used in the analysis instead. analysis. This analysis can be further modified based on different distributional assumptions on the various parts of the model (community-specific and/or global) and the observed data (error variance). One such modified analysis has been discussed in the Appendix, which was used in our simulation exercise. From here on, we will also assume that ϵ F (y R,k )'s are independent normal random errors. This assumption is certainly valid in many cases of epidemiological modeling, when model predictions are of standard forms (for example, binomial proportions like prevalence or rates like log-incidence) and the field data estimates are obtained using survey data or other standard methods, as the estimation error is approximately Gaussian due to the Central Limit Theorem. This is the case in our simulation example that we will study in detail later.

Calculation of posterior of D M -Using
Suppose we want to run the analysis for K different communities with random communityspecific parameters for the kth community, X k , with X k G X k , and a random global parameter set Θ, with distribution ℋ Θ . Recall the equations that connect y R,k , y F,k and M; for community k, where ϵ F (y R, k ) N(0, 1/λ F (y R, k )) .
( 5) Also, suppose that the precision process is unknown, and we have prior distributions of the form λ F (y R,k ) ~ P(λ F |y R,k ) for community k. Given the unknowns, these produce a multivariate normal density for the collection of all field data y F,k for k ∈ {1,…,K}, denoted by f(y F , 1 , …, y F , K | Θ, X 1 , …, X K , λ F (y R, 1 ), …, λ F (y R, K ), Removing the known (or fixed) quantities from the likelihood, the posterior density of the unknowns given the data y F,k for k ∈ {1,…,K} can be written as Note that the prior P (Θ, X 1 , …, be written as where P (D M | Θ, X 1 , …, X K , λ d ) is the multivariate GASP prior for the discrepancy at all K communities. From here onward, we will refer to the k-variate discrepancy random variable The posterior distribution can be determined through MCMC techniques (Robert and Casella, 2004). The MCMC step can be performed in a number of different ways. In our simulation study, we have used the Metropolis Hastings (MH) algorithm to calculate the posterior distribution, however alternative techniques like Bayesian Melding (see Poole and Raftery, 2000), Hamiltonian Monte Carlo or HMC (see Chatzilena et al., 2019;Betancourt, 2017) can be used instead. For example, in comparison with the traditional MH algorithm, HMC can offer greater computational efficiency, especially in higher dimensional or more complex modeling situations. Please also see Robert and Changye (2020) for a comparative analysis of these methods to determine the most optimal framework for a given problem. In the Appendix, we describe the MH procedure for a simplified setup, which will be the setup for the simulation example presented in Section 3.

Identifying sources of error-The
Bayesian procedure employed to obtain the posterior distribution of discrepancy vector D M = D M X 1 , Θ , …, D M X K , Θ can be run in a number of different ways depending on the targeted goals of the validation procedure, as described below. We first describe what we mean by 'updating a prior' in this context. Note that the main goal of our Bayesian analysis is to obtain the posterior distribution of D M . This posterior distribution is informed by the priors on D M , the model structure M, the stochastic distributions G X k on X k for k ∈ {1,…,K} and ℋ Θ on Θ, and the observed data y F,k for k ∈ {1,…,K}. Now, in the process of obtaining the posterior distribution of D M , we can additionally choose to obtain the posterior distributions of any, some or all of the random variables X 1 ,…,X K ,Θ. For parameters that we choose to update, their stochastic distributions are considered as priors and the relevant posteriors are obtained through the MCMC procedure. Parameters which are not being updated are not considered directly in the MCMC procedure. While, we do have to generate realizations of these parameters from their current stochastic distributions for calculating the likelihood in each MCMC update. Their distributions are not updated in the MCMC steps. In the Appendix, we have described how to achieve this in our simulation setup through the Metropolis Hastings procedure.
Although the flow chart in Fig. 1 shows the flow of the analysis when the objective is to re-calibrate all parameters (if necessary to obtain a better model fit), Researchers may be interested in running only part of the analysis flow depending on study objectives. Below we present a few such ways to run the analysis: 1. Evaluation of current model To assess the model in its current state, we can run the analysis without updating the priors ℋ Θ or G X . As a result we only obtain the posterior distribution of D M along with the posteriors of the hyperparameters controlling the discrepancy distribution in the model.

2.
Recalibration of community-specific parameters If we are interested in assessing the model after allowing for recalibration/re-tuning of the communityspecific parameters, we can update G X k to obtain G X k post , the posterior distribution of X k along with the posterior distribution of D M . Note that one might choose to update all of the community-specific parameters at a time together, or one by one, or in groups of more than one.

3.
Recalibration of community-specific and global parameters If we are interested in assessing the model after allowing for recalibration of both the global and the community-specific parameters, we can choose to update ℋ Θ to obtain ℋ Θ post , the posterior distribution of Θ as well, along with G X k post , the posterior distribution of X k and the posterior distribution of D M . Like in step 2, one might choose to update any combination of community-specific and global parameters at a time together, or in steps (for which new data become available after the last model evaluation), or till they are all updated.

Note:
The entire analysis can be conducted in a single step or in a series of several steps, with each step devoted to updating and analyzing a specific parameter or a group of parameters. The order in which the community-specific and global parameters are updated in the analysis above is interchangeable, depending on the context and necessity. For example, the order can be determined by running sensitivity analyses to gauge the influence of each parameter (or parameter set) on the final model outputs, and then updating them in order of importance.
After running the analysis in one of the aforementioned ways, the posterior distribution of the discrepancy, P post (D M ), is assessed carefully to see if it is considerably removed from being centered at 0, which may suggest (depending on the analysis run) that any or all of (i) the structure of the model, (ii) the priors for Θ and (iii) the priors for X K need to be updated.
A step by step flowchart for the entire scope of the analysis under the discussed setup is given in Fig. 1.

Quantification of model performance with Posterior Tail Probability
(p T P )-When the model assumptions are satisfied, that is, when we have informative and precise priors on X 1 ,…,X K ,Θ and the model is a close approximation of the reality, we expect the posterior distribution of the discrepancy to be roughly centered around 0. However, under violation(s) of any one or more of these assumptions, the posterior distribution will either suffer a shift in mean, or show inflated variance, be multimodal, or any combination of the above. In this section, we will concern ourselves only with the first type of alternative, that is, when the posterior distribution suffers a mean shift. The goal is to quantify the extent of this shift as a model validation metric. Note that our interests may lie in validating the model for (i) all K communities together, (ii) for a given subset of these K communities, or (iii) for each community separately. We present methods for the second situation here, and it is easy to see that the first and the third situations can be viewed as special cases of the second.
Let us assume that we want to assess the discrepancy of the model M at J out of K communities, X i 1 , X i 2 , …, X i J with i 1 ,i 2 ,…,i J ∈ {1,2,…,K} and 1 ≤ J ≤ K. Let D M be the multivariate random variable (of size J × 1) representing discrepancy at these communities. The Mahalonobis distance for a discrepancy vector D M = d from a given center μ and a given covariance matrix V is The Posterior Tail Probability (p T P ) measures the probability that an observation drawn at random from the posterior distribution of D M lies further away from the mean of that distribution than does 0, the expected discrepancy under the null hypothesis that the Model is true. Effectively, it is a (two-sided) tail probability, calculated based on the relative location of 0 with respect to the posterior distribution of D M .
Let η and V be the posterior mean and covariance matrix for the posterior distribution for D M , and d is the sample mean of discrepancies from the MCMC analysis, while V is a sample estimate for V. Now note that the Mahalonobis distance for a discrepancy sample D M = d from the center η and for the given covariance matrix V is given as

Author Manuscript
Author Manuscript Author Manuscript

Author Manuscript
To estimate p T P from the MCMC data, d j , we:
Note that p T P estimates the probability that Δ η, V D M ≥ Δ η, V (0), that is, Fig. 2 gives us an overview of how the posterior tail probability might look like when discrepancy is univariate (for a single community). Under H 0 , the true center (mean) of the posterior discrepancy is 0, hence the distance between the true center and 0 is also trivially 0, that is, Δ η=0,V (0) = 0. Thus for any sample, its Mahalonobis distance from the true center (or its mean) will be greater than or equal to the distance between the true center and 0, and its posterior tail probability will be 1 (the probability under the shaded region A in the leftmost figure in Fig. 2). On the other hand, under a hypothetical alternative situation when the model is not valid (H 1 ) and that the true center (or mean) is at K, the Mahalonobis distance between a randomly chosen sample and the center will be greater than that between the center and 0 if the sample belongs to either of the shaded regions A and C, and will be lesser than that between the center and 0 if it belongs to region B. Now, observe from Fig. 2, that the area of the region B is exactly 1 − p T P , and that the interval that bounds it forms a credible interval of level p T P for the posterior discrepancy around its posterior mean, that is, the interval (0,2K) is a C p T P for the posterior discrepancy D M . Mahalonobis distance helps to extend this idea to multivariate discrepancy vectors, where p T P gives us the minimum α level of a credible region centered around the multivariate posterior mean that does not contain 0.
Because the MCMC chain consists of autocorrelated samples, we estimate V using an approach based on batch means Wakefield (see Chapter 3 of 2013), Glynn and Iglehart (see Chapter 3 of 1990). We split the output of N MCMC samples into L batches each of length I, with I chosen large enough that the batch means have low serial correlation, and then estimate V using the variance of the batch means. In our simulation examples, the number of MCMC samples, N, is 5000 and thus we have used L = 50 and I = 100. The procedure is given below in steps:

1.
Compute the mean of the function of interest (for us, it is the vector of discrepancies at J communities) within each batch. That is, for l = 1,…,L, μ l = 1 I ∑ i = (l − 1)I + 1 lI d i

2.
The overall mean is then given as μ =

A simulation study
In this section, we study our formulation in the context of HIV epidemic models. We start off this section by describing the model, and then we discuss the simulation settings and the results.

An HIV transmission model
Antiretroviral therapy (ART) has been shown to reduce the infectiousness of HIV infected persons, but only after HIV testing and diagnosis, linkage to care, and successful viral suppression (Hallett et al., 2009;Johnson et al., 2013). Thus a large proportion of HIV transmissions may occur during a period of high infectiousness in the first few months after infection, and is perceived as a threat to the impact of HIV "treatment-as-prevention" strategies. If any intervention is applied within the study period, the model can project the effect of this intervention over the next period, compared to the scenario when no intervention is applied. In our hypothetical case, we assume that a combination intervention program is initiated in 2015, and the HIV prevalence is simulated up to 2030. The interventions that are modeled through this combination package in the model are (i) enhanced antiretroviral therapy (ART), which reduces the likelihood that a positive individual will transmit infection and the mortality rate from late-stage infections (ii) behavioral interventions (for example, counseling on condom use or other safe sex practices), which acts to modulate the force of infection; and (iii) medical male circumcision (MMC), which reduces the risk of men to acquire HIV infection. Model equations can be found in the Appendix (and also in Woods et al., 2018). There are 10 global parameters (same for all communities) in the model integrated in the model structure, namely,

1.
The reduction in the rate of transmission from individuals on ART, ε A .

2.
The reduction in the risk of acquisition of HIV in circumcised men, ε C .

3.
The rate of leaving the population (due to aging out or death due to caused unrelated to HIV), μ 1 .

5.
The rate of progression from infected state I1 to I2, σ 1 .

6.
The mortality rate due to HIV in late stage positive individuals in I2, μ 2 .

7.
The mortality rate due to HIV for individuals on ART, μ 3 .

8.
The rate of transmission of HIV (per partnership), β(t), that controls the forces of infection, λ 1 and λ 2 .

9.
Background prevalence rate of HIV in the population in 1980, p ini , considered same for all communities.

10.
The proportion of people assumed to be already circumcised in 2013, same for each community, c.
The parameter τ, denoting the proportion of men entering the population at a given community who are not at risk, is fitted (using least squares) separately for each community, based on the population dynamics of that community at the time when the communityspecific parameter was recorded (in our case prevalence data for that community in 2013). Note that τ is not considered as community-specific or global parameter, as it is not an input for the model but rather it is fitted based on the input (community-specific and global) parameters. The parameters ε A and ε C directly affect the not-at-risk parameter τ, as well as the forces of infection, λ 1 and λ 2 . The prevalence of HIV, as well as progression and mortality rates due to the disease at a given time t is determined by the (time varying) rate of transmission parameter, β(t), and the proportion of the population who are not at risk, τ. As mentioned before, the model was originally calibrated to HIV epidemics data from SSA, and the best point estimates for these parameters were calculated. We denote this estimate vector by θ 0 .

The analysis setup
Before we can apply the Bayesian validation procedure on M W , it is important to correctly define the response of interest, one which can give us a substantial idea about the fit of the model. In this analysis, the response was chosen to be the HIV prevalence in each community at the beginning of the year 2020, 5 years after the interventions were initiated in these communities. Next we need to define the distributions G X k and ℋ Θ for X k , k = 1,…,6, and Θ respectively. To simplify things, we assume that there is no estimation error in x 0 = {x 1 ,…,x 6 }, that is, the available information on the HIV prevalence in 2013 in these 6 communities reflect the actual truth, or in other words, X k := x k . We next define priors on the model parameters. Since all of the parameters are proportions between 0 and 1, we consider a logit-Gauss prior in our analysis, that is, logit(Θ) ~ N(μ,σ 2 ), with μ := logit(θ 0 ), and σ 2 := p∕(θ 0 (1 − θ 0 )), for a given scale parameter p.
For a given instance Θ := θ and a given timepoint within 1980 and 2030, the model M W outputs the HIV prevalence vector M W x 1 , θ , …, M W x 6 , θ . We assume that HIV prevalence in 2020 for community k can be estimated through a simple random sample of size n k drawn from the population in question. Thus, given a true HIV prevalence of y k at community k in 2020, we have an unbiased estimate y k , which can be written as y k = Z k n k , where Z k B n k , y k with sampling variability V ar y k = y k 1 − y k n k . We chose a sample size n k = 2000 for each community in our simulations. Since prevalence is a proportion between 0 and 1, the HIV prevalence values are logit transformed, so that D M (x,θ) is distributed over the entire real line. We assume Eq. (5) from Section 2 hold with M = logit(M W ) and X k := x k , where,

3.
y F , k = logit y k , for k = 1,…,6. Note that due to this transformation, λ F y R, k = 1/V ar y F , k = n k e y R, k 1 + e y R, k 2 , and also that λ F y R, k = λ F y F , k .
As discussed in Section 2.2.3, we assume a GaSP prior for the discrepancies D M . For the Bayesian analysis, we can specify priors on the GaSP hyperparameters, μ d , β d and λ d , as well. However, due to limited data availability, and owing to the fact that no direct data about model discrepancies are available, introducing too many hyperparameters in the model can increase collinearity between the GaSP parameters. Thus, we allow only λ d to be stochastic, and fix the rest of the hyperparameters at reasonable values, following recommendations of Bayarri et al. (2007), and as discussed in Section 2.2.3. This is achieved in the following manner: the vector y F , 1 − M x 1 , θ 0 , …, y F , 6 − M x 6 , θ 0 is treated as a realization from a multivariate normal with constant mean vector μ d and covariance matrix , where z i = (x i ,θ 0 ) and Λ F is a diagonal matrix with kth diagonal entry λ F (y F,k ). We use standard GaSP fitting software to obtain MLE estimates β d , that are used as the fixed value for β d in this analysis. The mean function μ d is fixed at 0. We assume an inverse gamma prior for λ d with shape parameter α λ d = 1 and scale parameter β λ d = 5λ d where λ d is the MLE estimate of λ d obtained from the GaSP analysis described above, following suggestions from Bayarri et al. (2007). For the Metropolis Hastings algorithm, we also specify proposal distributions for unknown random variables, namely,

•
A Gaussian distribution for D M (x,θ).
• A Beta distribution for Θ (since all are quantities between 0 and 1).

•
An Inverse Gamma distribution for λ d .

The simulation scenarios
We evaluate the model M W under different scenarios, given by the following:

Setting 1: The Null scenario:
The model is completely accurate, that is, both the model structure M and the parameter set θ 0 are accurate, that is, y R,k = M(x k ,θ 0 ), for k = 1,…,6. The priors ℋ Θ ⋅ ; η Θ with hyperparameters η Θ = θ 0 , η Θ * reflect a valid quantification of the uncertainty regarding Θ. Plot for this setting is given in Fig. 4.

2.
Setting 2: Faulty prior information on Θ: The model structure M is accurate, but information collected on one or more of the parameters is erroneous. To create this scenario, we start off with the assumption that θ 0 is still the true parameter vector, but we calibrated the model at an incorrect value θ 0 * , which is created by perturbing some of the elements of θ 0 . The priors are now given by ℋ Θ ⋅ ; θ 0 * , η Θ * . In this scenario the following holds for each community k. In this case, we should be able to falsify the model with running only step 1 of Section 2.3.3. Running step 3 however should recalibrate the parameter set. We can evaluate the accuracy of the parameters after the recalibration step.
To simplify the analyses, in this scenario, we will consider priors only on the faulty parameters, while keeping the others fixed at their estimates (given in θ 0 ). This is, similar to what mentioned in Section 2.3.3, an effective simplification of step 3 of running the analysis, where we update the parameter set in multiple iterations, starting off with ones that we are most uncertain about while keeping others fixed at their priors or point estimates, and then moving onto the next batch if the validity of the model is not still achieved. Plots for this setting are given in Figs. 5-7.

3.
Setting 3: Faulty model structure: The model structure itself is wrong, which means that M(x,θ) does not correctly specify the reality, even if efforts are made to recalibrate the model in its current form with its given community-specific and global parameters. We create this scenario by producing a reality which is a distortion of the model outputs, that is, we define the alternative (true underlying) model structure as (1+c)M W instead of the presumed structure M W , meaning we add 100c% bias to the untransformed model outputs, where c is a constant. Thus, we can write y R, k = M x k , θ 0 + D 2 x k , θ 0 , where D 2 x k , θ 0 = logit (1 + c)M W x k , θ 0 − logit M W x k , θ 0 .
In this case, we should be able to falsify the model with running either of the steps 1 and 3 from Section 2.3.3. Plot for this setting is given in Fig. 8. The value of c chosen for this exercise is c = 0.25.

Results
The results from the above analyses are presented in Figs. 4-8 and in Table 1. Each plot, except Fig. 7, shows the posterior distribution of the discrepancy D M W x k , Θ for communities k = 1,…,6 under the different simulation scenarios. Fig. 7 shows the posterior distribution of the parameter σ 1 , the rate of progression from infected state I1 to I2, with and without performing the recalibration step under setting 2 (Faulty prior on Θ). The analyses are repeated for multiple realizations of the observed data, resulting in different density curves for the discrepancy in each community for each such realization. In Figs. 4-8, 10 such curves are presented for each simulation scenario. We also calculate the posterior tail probability (as discussed in Section 2.3.4) for each scenario, aggregated over the different MCMC runs, and present the results in Table 1. Some of our observations can be summarized as the following: 3.4.1. Null case: both model and priors are accurate-Under the first simulation setting (the null scenario), the discrepancy for all six communities are centered around 0 both when ℋ Θ is not updated, and also when we do update it (see Fig. 4). This is expected, since the discrepancy distribution should be centered around 0 when model assumptions are satisfied. However, some individual runs are not centered at 0 at all communities, even after updating -this might be because in those runs, the sampling errors for the communities in question are large.
Updating the prior, increased p T P from 0.83 to 0.88 Table 1. Under H 0 , p T P should eventually converge 1, given an infinite number of samples. Under finite sample setting, the posterior mean will never be exactly at 0 due to both random fluctuations in the MCMC algorithm and the prior on λ d , and hence the estimated values p T P will always be less than 1. This provides a benchmark for how good we should expect a model to perform with respect to this metric, given that a 'perfect' model scores about 0.85 here.

3.4.2.
Recalibrating faulty priors-Under the second simulation scenario, the prior for σ 1 , the rate of progression from infected state I1 to I2, is misspecified downwards by a margin of 50%. We consider two types of faulty priors, (i) Flat -logitgauss prior with scale p = 0.1 (Fig. 7A), (ii) Narrow -logitgauss prior with scale p = 0.01 (Fig. 7B). Although both of these priors for σ 1 are shifted by the same amount, the flat prior assigns a substantial probability to the true value (black vertical line in 7 than the narrow prior, where it assigns no probability at all).
For both the narrow and flat prior, the model discrepancy is biased substantially away from zero in almost every community when the priors are not updated (as seen in Figs. 5 and 6). However the posterior distribution of D M includes zero with a higher posterior tail probability in the flat prior than in the narrow prior. This is reflected in the moderately low p T P of 0.63 for the flat prior, compared to 5.3 × 10 −4 for the narrow prior.
When the prior of σ 1 is updated in the MCMC analysis, we can obtain a posterior distribution (for σ 1 ) centered around the true value only it is in the support for the faulty prior originally (Fig. 7). As a result, the posterior distribution of the discrepancy shifts towards 0 in the flat prior but not in the narrow prior (see Figs. 5 and 6), after the recalibration step is performed. After recalibration, the p T P of the flat prior improves to 0.81, which is comparable to the null case. On the other hand, for the narrow prior, p T P only improves to 0.088. Thus, even though the model structure is correct, whether we are able to recalibrate the model and verify its validity against the collected data can depend purely on the quantification of uncertainty in the stored value of its parameters, and whether the support of their stored stochastic information contain their true values.

Faulty model structure-
Under the third simulation setting, the underlying model structure itself is incorrect, producing biased model outputs, and as a result, the discrepancy is expected to be centered away from 0 even when updating the prior distributions ℋ Θ for all parameters. Even though that is what we overwhelmingly see in Fig. 8, and that most simulations show a positive bias, one or two curves still appear to be centered around 0. So for those individual simulations, there are not any perceivable difference between the results and model predictions, even though the underlying model structure is faulty. Moreover, this mean shift is often less visible for some communities than others, for example in the case of community 3, which had the lowest prevalence in 2013. Since this scenario was created by inflating the model outputs by 25%, very little discrepancy was introduced in this community, and thus if we were to make a decision about whether the posterior distribution for the discrepancy is centered away from 0 or not, based on individual Bayesian credible intervals at 5% level of error, the hypothesis may be rejected for some communities, but not for others. In addition, for communities with very low prevalence, the sampling error is much higher, and can sometimes dominate the actual discrepancy in the outputs. Under faulty model scenario, p T P values are low, whether we update the priors or not, but one thing that we notice from Table 1 is that updating the priors (recalibrating) always results in higher values of p T P .

Discussion
In this article, we outline a framework for evaluating the predictions of complex epidemiological models and describe experiments that can be used to test them. We propose assessing models by calculating the posterior distribution of the model discrepancy using a Bayesian framework. This allows for rapid identification of communities and/or subgroups for which the model performs poorly, and allows for an overall (or locally for each community) goodness of fit evaluation using the posterior tail probability. This methodology can then be systematically applied to update model priors to improve fit.
We apply this framework to a simple model of a heterosexual HIV epidemic, M W , which was created to investigate how the proportion of early transmission affects the impact of ART on reducing HIV incidence. We test the model under various set of assumptions, including the null scenario when all the assumptions are satisfied, along with alternate scenarios when one or more of those assumptions fail to hold, and discuss the scope of recalibration of its parameters when the model is false.
One interesting finding is the existence of 'uninformative communities', that is, communities that do not provide information on model validity. In our case, community 3 acts in this way. It had very low prevalence initially, and also at each of the subsequent time points. Looking at the discrepancy plots for community 3 in Figs. 4-8, it becomes obvious that even if there is true discrepancy in the model output, due to a faulty model structure or because one or more of its parameter distributions are wrongly estimated, there is little or no visible sign of that in any of these plots, that is, most of the density plots can be seen to be centered around 0. If anything, the posterior distributions show a higher variability than that at other communities, hinting at the uninformativeness of this particular community.
The complexity of epidemiologic models creates a robustness that makes it quite challenging to falsify them. For example, in the posterior distribution plots for the discrepancy for the faulty model scenario (see Fig. 8), we see that even though most simulations show a positive bias, one or two of the curves do appear to be still centered around 0. So for those individual simulations, there are not any perceivable difference between the results and model predictions, even though the underlying model structure is faulty. Moreover, this mean shift is often less visible for some communities than others (like in the case of community 3), and hence if we were to make a decision about whether the posterior distribution for the discrepancy is centered away from 0 or not, based on individual Bayesian credible intervals at 5% level of error, the hypothesis may be rejected for some communities/ subgroups, but not for others. Although these individual inferences are crucial to figure out for which communities the model fails, it is also important to make inference on the overall strength of the model, that is, to aggregate the inferences over these different communities/ subgroups. The measure, p T P , defined in Section 2.3.4 based on Mahalonobis distances, helps mitigate that issue somewhat. Delineating faulty model structure from insufficient calibration is a crucial aspect of model validation. This motivated our focus on these scenarios separately.
The methods developed here are for a fixed point of time at which model outputs are recorded and compared with external data. However, these methods can potentially be extended to incorporate comparison of model projections at multiple time points with real-time data as it becomes available over time. For this, one needs to also account for inter-person correlation and time series effects, and it might be worthwhile to pursue this extension in future research. Also note that the (prior) distribution for λ d , the precision parameter for D M was chosen in a data-driven manner, and the rest of the hyperparameters for the GASP prior on D M were either fixed at pre-specified or data-driven values. It might be an interesting exercise to conduct a sensitivity analysis to explore the effect of this formulation on the analysis, for example by considering non-data-driven priors for λ d with varying width, and see how that affects the results.
Note that there can be many different ways a model can have a faulty structure. A faulty structure for an ODE model means that mechanisms and processes which are essential for the population and transmission dynamics are not properly represented in the model. This can be due to multiple factors, including (i) not capturing essential confounders in the model, (ii) not accounting for human mobility patterns in the ODEs, (iii) unexpected events (for example a pandemic, intervention rollout, changes in standards of care) that might render the model structure outdated. Incorporating those factors in the model requires extensive structural changes which were outside the scope of this project. Instead, we assumed that whenever the model structure is wrong, the final effect will necessarily be seen in the predictions, which will be biased estimates (the bias coming from the faulty structure) of the reality. Hence, we decided to mimic the faulty structure by introducing bias in model outputs, without discussing the source of the bias, which can be due to any of the above reasons, or other sources.
Also note that from Section 2, our methods will continue to work for any model of the form y = M(x, θ), where M(·, ·) is essentially a function with arguments x and θ. Thus, the Bayesian validation framework will work for other types of models such as network models and stochastic agent-based models, as long as their parameter sets x and θ have concrete forms and are estimable (identifiable). However, one limitation the Bayesian validation framework is the fact that the MCMC analysis relies on multiple model runs over different instances of x and θ generated from the priors. ODE models are generally simple to run and computationally less expensive compared to other models (like agent-based models, network models), which might make the validation procedure extremely computationally burdensome, and therefore, infeasible to run in practice.
It is also worthwhile to mention that in the context of HIV transmission models, reduction in HIV incidence is a better choice as an outcome of interest for evaluation of the impact of real-world interventions. Although the HIV transmission model that we use in our simulation settings was designed to project HIV prevalence, our methods are applicable to modeling analyses where incidence is being estimated. In addition, we note that while incidence is a more desirable metric than prevalence, it is also much harder to estimate than prevalence.
Given a model structure and its parameter priors in their current states and the observed data, 'optimal' discrepancy is achieved through re-calibration of its parameters in the Bayesian validation framework but re-calibration is unlikely to reduce the discrepancy to 0. Rather, re-calibration would ensure the smallest possible discrepancy is achieved based on what we knew before (the model) and what we know now (the observed data). To achieve this 'optimal' discrepancy, our suggestion is to re-calibrate the parameters in steps, possibly one at a time, where the order can be determined by running sensitivity analyses to gauge the influence of each of these parameters on the final model outputs, and then updating the parameters in order of importance.
For parameters that have identifiability issues, model validation will have to endure some of the same challenges that model calibration face. For example, multimodality in a parameter distribution might indicate presence of latent groups or other structural issues in the model, which may or may not result in multimodality in the discrepancy distribution as well, all of which are important issues to consider in model validation, and we believe that the Bayesian validation framework can be used to diagnose these issues further. As far as the posterior tail probability is concerned, it should be noted that the posterior tail probability alone should not be the only aspect considered in making a decision on model validation, and other visual aspects like multimodality suggesting presence of latent groups or other issues should be taken into consideration as well. Multimodality in the discrepancy distribution may affect the posterior tail probability, for example in the scenario when the discrepancy distribution is bimodal, and the two modes of the discrepancy distribution lie on either side of 0, and the mean lies in between and closer to 0, we might obtain high p tp indicating good model fit when clearly there are structural issues in the model. This is because p tp in its current form is effective only when the discrepancy distribution is unimodal, as it defined around the mean of the discrepancy distribution assuming unimodality (see Section 2.3.4). So, when the discrepancy distribution is indeed multimodal, it will be more useful to redefine it in terms of the modes of the discrepancy distribution, and run multiple posterior tail probability analyses, one for each mode.
The HIV Modeling Consortium (www.hivmodelling.org) is a large network of mathematical modelers that aims to strengthen the use of models in decision making in HIV. In the past it has brought together models from different groups to quantify and characterize the extent to which different models predict different impacts of the same interventions (Eaton et al., 2015). The project has revealed a large amount of variation in model outputs, which has led to urgent questions being asked about whether models can be validated. To investigate this further, the Consortium developed a protocol for archiving models and their predictions in 2012, and invited researchers to submit their mathematical models for this exercise. With the methods described in this article, these models can be meaningfully tested and validated, and possibly even recalibrated for future use. where S1 denotes circumcised and susceptible men who are not at risk of infection, S2 denotes circumcised and susceptible men who are at risk of infection, S3 denotes uncircumsized and susceptible men who are at risk of infection, S4 denotes uncircumcised and susceptible men who are not at risk of infection, I1 denotes men in the first HIV infection state, I2 denotes individuals in the second (late stage) infection state, AI denotes infected men who will receive the antiretroviral therapy (as part of the intervention package) in 2015, and NAI denotes infected men who will never receive ART or other interventions.
The number of individuals entering the population is denoted by B, while the rate of individuals leaving the population is denoted by μ 1 . The proportion of individuals entering the population who are not at risk is denoted by τ, and c is the proportion of those entering the population who are circumcised men. λ 1 and λ 2 describe the force of infection in circumcised and uncircumcised individuals respectively. σ 1 describes the rate of progression from infected state I 1 to I 2 , and α is the proportion of individuals in state I 2 who will receive ART. μ 2 is the rate of death in late stage positive individuals (I 2 ) and μ 3 is the rate of death for individuals on ART treatment.
The force of infection is described by the following equations: λ 1 = 1 − ε C β(t)η I 1 + I 2 + NAI + 1 − ε A AI S 1 + S 2 + S 3 + S 4 + I 1 + I 2 + AI + NAI λ 2 = β(t)η I 1 + I 2 + NAI + 1 − ε A AI S 1 + S 2 + S 3 + S 4 + I 1 + I 2 + AI + NAI where ε C is the reduction in the risk of acquisition of HIV in circumcised men, β(t) is the rate of transmission, η is used to modulate the force of infection to simulate a behavior change intervention, and ε A is the reduction in the rate of transmission from individuals on treatment. The number of individuals entering the population is described by: where ε represents the population growth rate and P represents the population size 15 years prior to the current time period. P was defined in this way to avoid differences in prevalence caused by interventions affecting the birth rate. The population at the year of the intervention is scaled to ensure that all regions have the same population size at the start of the intervention period.
There is also an additional parameter 'beta trend parameter' which is applied in all communities and is used to scale the rate of transmission β. This is separate from the other parameters so that it may be incorporated into uncertainty analyses. If time t is less than the time when the trend is turned on (t trend ), β is equal to the initial value specified, if time is greater than t trend , β is modified using the beta trend parameter (ω).
If t ≤ t trend , β(t) = β, If t > t trend , β(t) = β 1 − ω t − t trend − 1 where t trend corresponds to the year 2015 and as such this modulation only occurs once the fitting has been completed (we fit the model to 2013 prevalence). Both β and ω are the same across all communities. Posterior tail probabilities (p T P ), as denoted by the colored regions, under the Null and an hypothetical Faulty model structure scenario for discrepancy at a single community: (i) Under the Null, the mean of the discrepancy distribution is 0 itself, so any random sample from the discrepancy distribution will have Mahalonobis distance from its center greater than that between 0 and the center, and thus the posterior tail probability region will cover the entire discrepancy distribution, as shown by the shaded region A. (ii) Under the alternative, when the center of the discrepancy is at K units away from 0, the Mahalonobis distance between a randomly chosen sample and the center will be greater than that between the center and 0 if the sample belongs to either of the shaded regions A and C, and will be lesser than that between the center and 0 if it belongs to region B. Thus the posterior tail probability region will cover regions A and C. Model Diagram, S1 denotes circumcised and susceptible men who are not at risk of infection, S2 denotes circumcised and susceptible men who are at risk of infection, S3 denotes uncircumsized and susceptible men who are at risk of infection, S4 denotes uncircumcised and susceptible men who are not at risk of infection, I1 denotes men in the first HIV infection state, I2 denotes individuals in the second (late stage) infection state, AI denotes infected men who will receive the antiretroviral therapy (as part of the intervention package) in 2015, and NAI denotes infected men who will never receive ART or other interventions.  Setting 1: Posterior distribution of discrepancy in the null scenario for 10 MCMC chains, each run with a different realization of the observed data.