A Bayesian model of acquisition and clearance of bacterial colonization incorporating within-host variation

Bacterial populations that colonize a host can play important roles in host health, including serving as a reservoir that transmits to other hosts and from which invasive strains emerge, thus emphasizing the importance of understanding rates of acquisition and clearance of colonizing populations. Studies of colonization dynamics have been based on assessment of whether serial samples represent a single population or distinct colonization events. With the use of whole genome sequencing to determine genetic distance between isolates, a common solution to estimate acquisition and clearance rates has been to assume a fixed genetic distance threshold below which isolates are considered to represent the same strain. However, this approach is often inadequate to account for the diversity of the underlying within-host evolving population, the time intervals between consecutive measurements, and the uncertainty in the estimated acquisition and clearance rates. Here, we present a fully Bayesian model that provides probabilities of whether two strains should be considered the same, allowing us to determine bacterial clearance and acquisition from genomes sampled over time. Our method explicitly models the within-host variation using population genetic simulation, and the inference is done using a combination of Approximate Bayesian Computation (ABC) and Markov Chain Monte Carlo (MCMC). We validate the method with multiple carefully conducted simulations and demonstrate its use in practice by analyzing a collection of methicillin resistant Staphylococcus aureus (MRSA) isolates from a large recently completed longitudinal clinical study. An R-code implementation of the method is freely available at: https://github.com/mjarvenpaa/bacterial-colonization-model.

This shows that the acceptance probability does not depend on realisation of b * (i.e. the sampled λ) and that the sampling procedure of (n eff , µ, λ) used in the main text is equivalent to performing one blocked Metropolis-Hastings sampling with the proposal r in Eq 1. This ensures that the full algorithm is a valid Gibbs sampler. (In fact, similar arguments are used to show that Gibbs sampling algorithm is generally a special case of Metropolis-Hastings, see Theorem 10.13 in Monte Carlo Statistical Methods by Robert and Casella 2004.) Empirical experiments showed that this sampling approach produces chains with better mixing compared to other possible approaches such as the straightforward implementation where (n eff , µ) and λ are sampled separately. The algorithm might be valid even if a new λ is sampled also when the proposed (n eff , µ) is rejected but we do not investigate this here.

Gibbs sampler step for η i
We derive the conditional density for the parameter η i , i = 1, . . . , N . First of all, neglecting all the terms in the unnormalised posterior that do not depend on current η i shows that from which we further see that We show that the conditional density in the last case (z i2 = 1 and d i > 0) is a finite Gamma mixture. Because d i is an integer, we use binomial formula and some algebra to obtain where c is a normalisation constant and Γ(·) is the Gamma function. The weights are clearly positive and must sum to 1 since the formula must be a valid pdf and thus c = ( di j=0w j ) −1 , wherew j is an unnormalised weight that is obtained from Eq 12 when c = 1. The above weight formula also clearly holds if d i = 0 and then w 0 = 1. There appears to be no simple formula for normalisation constant c for general k, however, we do not actually need to normalise the weights at all if we proceed as discussed in the following paragraph.
We then use the fact that if a 1 , . . . , a n are logarithms of unnormalised bin probabilities of a Categorical distribution, and if random variables g 1 , . . . , g n follow independently the standard Gumbel distribution that has the cdf F (g) = exp(− exp(−g)), then arg max j=1,...,n {g j + a j } has the Categorical distribution with bin probabilities exp(a j )/ n k=1 exp(a k ), j = 1, . . . , n. We use this fact also to generate random numbers from the conditional density of z i .
Sampling from the conditional density of η i can be now done conveniently using the Algorithm 1, wherew j are the unnormalised weights. This algorithm is fast (unless d i is large) as it does not require numerical root finding, evaluating special mathematical functions or integration unlike standard methods such as e.g. inverse transform sampling which could be alternatively used.

Prior distribution for t 0
We specified the prior density for t 0 using hierarchical modelling to be able to learn about these values. However, to select the prior hyperparameters k, α and β to reflect our prior knowledge about reasonable values of t 0i , we need to obtain the probability law for t 0 . The joint prior pdf of t 0 can be obtained by marginalising λ: where 1 N is an N -dimensional vector with all elements equal to 1 and B(·) denotes the multivariate Beta function. Since the marginals can be obtained from Eq 20 by setting N = 1 and it is seen that the marginals are compound Gamma distributions. Specifically, the mean and the variance of t 0i are obtained as The covariance and correlation coefficient are obtained as follows. Consider t 0i , t 0j such that i = j. Then we see that where we have recognised the double integral on the second line as the unnormalised pdf in Eq 20 (but with N = 2, k + 1 in place of k and α − 2 in place of α which also implies that the above integral is finite if α > 2) and used the fact that this density integrates to 1. It now follows that which hold for α > 2. If k ≥ 1, then the mode of this density is The derivation of this fact goes as follows. If k ∈ (0, 1), then the density is infinite at t 0 = 0 and the mode does not exist. If k = 1, then the mode is clearly t 0 = 0 and it is unique. In the case k > 1 the mode can be found by differentiating the logarithm of the density. We obtain Finding the zero of the gradient shows that for all j = 1, . . . , N Symmetry suggests that the solution is the same for all elements j and, clearly, the point t 0 = β(k − 1)/(α + 1)1 N satisfies these equations. To confirm that this is the only solution, we write the equations as a linear system ((N k + α)I N − (k − 1)I N )t 0 = (k − 1)β1 N , where I N is the N × N identity matrix and I N is a N × N matrix with all elements equal to 1. Now, each row sum (with diagonal element excluded) is (N − 1)(k − 1) < (N − 1)k + α + 1 = (N k + α) − (k − 1) so the coefficient matrix is diagonally dominant and thus non-singular. If k > 1 then the density is clearly zero at the boundaries where t 0i = 0 for some i, and so the mode does not lie on the boundary of the support of the density. If k ≥ 1, the mode is thus unique and is obtained from Eq 30.

Prior predictive distribution
We can analytically derive the mean and variance of the prior predictive distribution for a new distancẽ d corresponding to time interval between the sequenced genomest under the different strain case. These facts are used to guide the selection of the prior hyperparameters. The full prior density can be visualised using simulation. Consider the prior µ ∼ U([a µ , b µ ]) and the priors for t 0 and λ as in the main text. The prior predictive mean and variance are then given by where l = (a µ + b µ )/2 and h = (a 2 µ + a µ b µ + b 2 µ )/3. For Eq 33 we must have α > 1 and for Eq 34 α > 2. The derivation of the above facts goes as follows. From Eq 5 of the main text we obtain E(d | µ, t 0i ) = µ(2t 0i +t). Using the law of total expectation shows that E(g(d)) = E(E(g(d) | µ, t 0i )). These two fact now imply The derivation of the variance goes similarly. We use a property of Poisson distribution that tells us that E(d(d − 1) | µ, t 0i ) = µ 2 (2t 0i +t) 2 . Now, using the law of total expectation (with g(d) =d(d − 1)) and the formulas for the mean and variance of t 0i , we obtain The final result now follows by using the fact V(d) = E(d(d−1))+E(d)−E(d) 2 and further simplifications.

Additional results from ABC inference
In this short section we show results obtained with an alternative discrepancy function in ABC analysis of the external data D 0 . The experiment details are the same as before except that we use the Euclidean distance instead of the L 1 distance in Eq 8 to compare the simulated and observed distributions. The results are shown in Fig A. When the data from the patients A-M is neglected in panel B of Fig A, we see that the resulting posterior is similar as before although with a slightly elevated mutation rate. In the full data case in panel A, the mutation rate is estimated to be even larger, although the shape of the density is still rather similar as before. We believe this happens because, compared to the discrepancy based on L 1 distance, the Euclidean distance gives substantially more weight to those single cases in the data with relatively high values of the observed distances. In the fitting this is compensated by a larger mutation rate. The L 1 -based discrepancy is less affected by such outliers. Consequently, we believe that the robust results based on L 1 -based discrepancy are more reasonable and thus chose the corresponding ABC posterior as the prior for the mixture model.

Details on computing the acquisition and clearance rates
We first describe how to obtain the uncertainty estimates for Table 2 of the main text. We denote the same strain probability in case 1) with s ∈ [0, 1]. The amount of cases in 1) is n and the indicator variable of whether the data point i is of the same strain is x i ∈ {0, 1} for i = 1, . . . , n. Because the different ST's can be distinguished from each other, n is considered known and fixed. The number of same strain cases l = n i=1 x i is assumed to follow Binomial distribution so that l ∼ Bin(l | s, n). Using the uninformative prior s ∼ U([0, 1]), we then obtain the posterior p(s | l) = Beta(s | l + 1, n − l + 1) which follows from a standard result of Bayesian modelling.
However, we do not know exactly if each data point is of the same strain but only have the posterior probabilities (obtained from the MCMC samples of p(z | D, D 0 )) which complicates the analysis. However, we see that which shows that the posterior of s is a Beta mixture density where the weights p(l | D, D 0 ) are the posterior probabilities for the potential amounts of same strain cases which can be computed from the existing MCMC chain of the mixture model. We immediately obtain E(s | D, D 0 ) = n l=0 l+1 n+2 p(l | D, D 0 ) and the credible interval can be computed numerically by inverting the cumulative distribution function corresponding to Eq 38. The above computations similarly apply to the other cases as 1).
As a final part of this section, we describe in detail how the acquisition and clearance rates and their uncertainty estimates in Table 3 of the main text are computed. If the samples are consecutive and not negative, then the definitions A-C of Table 3 are used directly. If negative samples are involved, then we must deal with potential "false negatives". Because we cannot know for sure if each negative sample is false negative (or, generally, what is exactly happening between any pair of samples) and to keep our analysis here simple enough, we need to make some strict assumptions. Table A contains related rules that are used to determine the acquisition and clearance events in the cases that involve negative samples. For example, the first line of Table A shows the case where the last sample is negative while the preceding sample is not. We assume that in this case we always have one clearance event (i.e. D in Table 3) although it would be also possible that this is false negative and the patient was in fact still colonised. Additionally, for the events on lines 3-6 and depending on the posterior probability of the same strain corresponding to each case, we assume either that the strain is not decolonised (i.e. each negative sample is false negative) or that there has been one clearance and one acquisition event.
To take into account the posterior uncertainty in the same and different strain classifications, we compute the quantities A-E for each parameter sampled from the posterior and then compute the acquisition and clearance rates for each of these values. (In fact, unlike other values, C is not random since ST's can be determined for sure.) Finally, the mean and the 95% quantile are computed numerically from the resulting set of samples of acquisition and clearance rate values.