Detecting within-host interactions from genotype combination prevalence data

Parasite genetic diversity can provide information on disease transmission dynamics but most methods ignore the exact combinations of genotypes in infections. We introduce and validate a new method that combines explicit epidemiological modelling of coinfections and regression Approximate Bayesian Computing (ABC) to detect within-host interactions. Using genital infections by different types of Human Papillomaviruses (HPVs) as a test case, we show that, if sufficiently strong, within-host parasite interactions can be detected from epidemiological data and that this detection is robust even in the face of host heterogeneity in behaviour. These results suggest that the combination of mathematical modelling and sophisticated inference techniques is promising to extract additional epidemiological information from existing datasets.

Rank distribution for HPV infections. Black dots show data from 5412 sexually active women in the Costa Rica Vaccine Trial reported by [8]. Lines show maximum likelihood fits performed using the bbmle package in R [31].
HPV coinfections may interfere with chronic infection and cancer. For example, when 32 high-risk HPV types coinfect with low-risk types, time to diagnosis is longer and the 33 risk of progression to cancer is lower [21]. To summarise, we do know that HPV types 34 may interact within hosts but it is unclear whether these interactions are sufficiently 35 strong to be detected at the population level. 36 Binary or rank models 37 Most epidemiological models that allow for parasite genotypes to coexist within a host 38 only allow for up to two genotypes per host and do not allow for cotransmission, 39 although there are exception for both [22][23][24][25]. In spite of these simplifications, these 40 models have been instrumental in epidemiology [26]. 41 Studies on macro-parasites have long been focusing on high multiplicity of host 42 infection [27]. They showed that the distribution of the number of macro-parasites per 43 host, which we here refer to as the 'rank' of an infection, can provide information 44 regarding the contact structure within the host population. In absence of heterogeneity 45 of any kind, one would expect rank distributions to follow a Poisson distribution. 46 Interestingly, in many populations, the number of macro-parasites per host tends to 47 follow a negative-binomial distribution, which is often interpreted as evidence for some 48 sort of host population structure or a specific functional response [28][29][30]. 49 For microparasites, similar studies have been developed, where the rank of the 50

3/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31  infection corresponds to the number of genotypes detected in a host. For example, 51 Chaturvedi et alii [8] showed that a Poisson distribution can be rejected for HPV This is consistent with the result of the study that identifies the 'number of lifetime sex 56 partners' as the cofactor the most strongly associated with being infected by multiple 57 types instead of a single type [8].

58
Parasite combination prevalences 59 Intuitively, there should be more information in the prevalence of each combination of 60 genotypes than in the rank prevalence. With 5 circulating genotypes, there are only 6 61 host ranks whereas there are 32 combinations (Figure 2A). Some studies have therefore 62 used combination prevalence data to detect interactions. Their approach was to 63 compare the observed prevalence of each combination to an expected value derived from 64 the total prevalence of each genotype. required time series and not just cross-sectional data (see [33] on how to infer interaction parameters from time series using particle filtering techniques). However, 70 the restricted number of strain they used also potentially limited the power of their 71 conclusion (3 ranks and 2 total prevalences versus 4 combinations).

72
Although longitudinal data is generally richer for epidemiological inference [34], it is 73 not always available and we often need to deal with equilibrium prevalences. To analyse 74 such data, the study by Vaumourin et alii [35] considered systems with a larger number 75 of genotypes using a variety of existing techniques (generalised chi-square, network, and 76 multinomial GLM approaches) and developed a new association screening approach that 77 has the advantage to identify and rank combinations based on their deviation from the 78 expectation (see the Methods). To test the power and accuracy of each method, they 79 used simulated distributions but without an explicit epidemiological model.

80
Inference using explicit modelling 81 We wish to assess whether, in a setting where n prevalent parasite genotypes or species 82 are circulating, the prevalence of the 2 n coinfected host classes gives us more 83 information about the way parasites spread and interact within their hosts than the 84 n + 1 rank prevalences. More precisely, our hypothesis is that modelling epidemiological 85 dynamics explicitly can allow us to distinguish between within-host interactions and 86 other types of heterogeneities generated from the host contact structure. Indeed, it is 87 known that for many infectious diseases, especially sexually-transmitted ones [36], some 88 hosts may act as 'super-spreaders' [37]. Intuitively, these hosts should be more exposed 89 and therefore have higher infection ranks independently of any features of the parasites 90 themselves (as mentioned in the case of HPV above [8]).

91
HPV offers an ideal setting to test these questions because coinfections are frequent 92 and rich data exists. Based on the literature, we use our model to evaluate our ability 93 to test the hypothesis that oncogenic HPV types, also called 'high-risk' (HR) types, 94 have a competitive advantage (or disadvantage) when competing with non-oncogenic 95 types or 'low-risk' (LR) types that tend to cause warts. Given that the probability of 96 HPV transmission per sexual contact is high [38], we assume that any interaction 97 between HR and LR types takes place through the recovery rate. 98

5/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; epidemiological dynamics. This is made possible by a recent analytical framework that 100 can handle an arbitrary number of types in a Susceptible-Infected-Susceptible (SIS) 101 model [25]. In order to assess the ability to infer interactions from the observed 102 coinfection classes, we use a regression-based Approximate Bayesian Computing (ABC) 103 approach [39,40]. We show that our method performs well on simulated data and that 104 existing methods that lack an explicit epidemiological setting cannot distinguish 105 genotype interaction from general host heterogeneity.

107
Associations and interaction strength 108 First we use existing methods developed to detect significant associations between 109 parasites from coinfection data. These have been tested by generating distributions but 110 without any epidemiological model.

111
The chi-square approach exhibits a slightly positive correlation between the 112 probability that the test is significant and the intensity of interaction between types 113 (estimated by fitting the data using a logistic regression model, Fig 3A). However, even 114 with only 1,000 individuals sampled (in black), most of the observed prevalence 115 distributions tend to deviate from the expected one. With 5,000 hosts sampled or more 116 (in gray), most combinations lead to significant tests ( Fig 3A).

117
The GLM approach seems to be more robust to sample size ( Fig 3B) and the 118 positive association between interaction intensity and test significance only occurs if 119 5,000 or 10,000 individuals are sampled. As for the chi-spare approach, most of the 120 associations remain significant. 121 Vaumourin et al. [35] cleverly proposed to analyse coinfection combination data 122 using network-based approaches. For the combination network, we found that 123 non-significant runs exhibited higher interaction intensity than significant runs, which 124 was unexpected (Fig S3A). We also found a slight decrease in connectance with 125 increasing interaction intensity, which could be consistent with some combinations being 126 removed due to genotype interaction ( Fig 3C). Inferring genotype interactions from the distribution of the combination prevalences using the chi-square (A), the GLM (B), the network (C and D) and the association screening (E and F) approaches. The grayscale indicates the size of the target dataset (100 targets for the network approach and 1000 for the others). Lines show a generalised linear model fit. In A and B the data was scattered vertically for clarity. C and D show the combination and parasite network connectances only when significant. E shows the number of significant interactions and F the fraction of correct predictions based on the correlations from the learning dataset (see Fig S1). Parameter values are drawn in the same prior as the ABC (see Fig S3).

7/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; For the parasite network, when only 1,000 hosts were sampled significant runs 128 exhibited strikingly high interaction strengths ( Fig S3B). We also find an increase in 129 connectance with interaction strength, but only when sampling 5,000 or 10,000 hosts 130 ( Fig 3D). This result should be interpreted with caution since parasite network 131 connectance was rarely significant (2, 10 and 15 of the 100 test runs were significant for 132 1,000, 5,000 and 10,000 hosts sampled respectively). In comparison, combination 133 connectance was significant for 21, 31, 32 of the 100 runs depending on sampling 134 intensity.

135
Finally, the association screening approach reports an increase in the number of 136 significant associations (i.e. more or less than expected) with host sample size ( Fig 3E). 137 By computing equilibrium prevalences for 1,000 parameter values, we estimated the 138 correlation between interaction intensity and the prevalence of each host combination 139 ( Fig S1). This allowed us to determine whether the prediction made by the association 140 screening algorithm was correct or not. The fraction of predictions that match our 141 prediction is generally close to 50% with a slight increasing trend with interaction

8/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; Epidemiological model: single runs 146 We first show the fraction of each host combination for two scenarios, one with 147 moderate interactions (parameter set #2 with k ≈ 0.02, Fig 4A) and another with 148 strong interactions (parameter set #7 with k ≈ 0.25, Fig 4B). When the interactions are 149 weak, we clearly see the different ranks with uninfected hosts on the top, then a row 150 with the five singly infected host types, etc. When interaction strength increases, these 151 ranks become impossible to distinguish. Fig 4A also illustrates that each parasite genotype in this model has its own infection duration, since they do not all have the 153 same prevalence in single infection. Importantly, we only show the total prevalence of 154 each combination but these may differ among each of the two host types (prevalence is 155 higher in the high rank combinations in the 'superspreader' population). Our goal is to 156 infer the intensity and sign of the interaction between HR and LR genotypes (parameter 157 k) in a heterogeneous host population. 158

9/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; searching a prior distribution in the parameter space. This distribution is shown for all 160 the key parameters in Fig S2. We drew 50,001 parameter sets in this prior, used them 161 to simulate equilibrium densities (as shown in Fig 4). We assessed the performances of 162 the ABC approach following a leave-one-out cross-validation procedure, where we 163 treated one simulation as observed data and the remaining as learning data.
164 Figure 5 shows the results for parameter set #3 and illustrates how using more 165 summary statistics helps to narrow the distribution from the prior for a dataset with 166 10,000 individuals. If we only use the ranks, we do narrow the prior distribution but its 167 width remains large enough such that 0 (no interaction) cannot be ruled out from the 168 95% Highest Posterior Density (HPD), which can be seen as a credibility interval 5B). 169 Using the combinations in addition to the ranks as summary statistics for the ABC 170 allows us to narrow this interval and to exclude 0 from the 95% confidence interval (5C). 171 Using additional information, for example being able to distinguish between the two 172 host types, would narrow it even more as we will see below.  Logically, the width of the 95% HPD for the estimate of interaction intensity 181 decreased with the number of host sampled (Fig 6A). On the same figure, we see that 182 including more summary statistics also decreased the width of this interval, especially 183 for an infinite sample size.

184
In terms of the relative error regarding the interaction parameter (k), we found a 185 similar effect with a lower error when more host were sampled or more summary 186 statistics were involved (Fig 6B). However, using the combinations in addition to the 187 ranks only improved the analysis if enough hosts were sampled (5,000 or 10,000). In

189
If we focus on the runs for which we could not exclude an absence of interaction 190 (i.e. 0 lied within the 95% HPD), we see that the number of such runs decreased as the 191 number of summary statistics increased (Fig S6). We also see that, in these runs, Finally, the proportion of errors, that is when the target value was outside the 95% 198 HPD was close to the expected 5% (6.25% with the ranks and 5% with both the ranks 199 and the combinations) but it slightly increased with interaction strength (Fig 6D). 200

11/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; Multiple infections are known to affect the virulence of an infection [41], the spread of 202 infectious diseases [5] and their evolution [6]. This is due to the fact that when sharing 203 a host, parasites can interact in various ways [42]. The goal of this study was to 204 determine to what extent the prevalence of parasite combinations can inform us on such 205 interactions.

206
By generating prevalence data from an mechanistic epidemiological model, we were 207 able to first test the power of existing heuristic methods based on the distribution of 208 classes. Overall, these results show that these methods are limited. This is largely due 209 to the fact that we introduced host heterogeneity in the model, which affects the 210 distribution of host classes in a way that cannot be distinguished from interaction 211 between parasite genotypes. This therefore corroborates a limitation often mentioned in 212 such studies, which is that departures from expected distributions need not be due to 213 interaction between genotypes. 214 We then used an ABC approach to infer parameters from the model. We show that 215 this yields more consistent results than existing heuristic methods. Quite expectedly, 216 the accuracy of the method increases with the number of hosts sampled. We also show 217 that using the prevalence of all the combinations of host classes tends to decrease the 218 error made compared to using only the prevalence of infection ranks. Finally, adding 219 knowledge about host type ('super-spreader' or 'normal-spreader') can further improve 220 the power of the inference.

221
The fact that decent results can be obtained by only using the rank of the infections 222 may seem surprising considering the difficulty from existing models to infer interactions. 223 One reason for this could be that we have a mechanistic model, which limits the range 224 of rank distributions that can be explored. Another reason is that we here use the same 225 model to generate the target dataset and the learning datasets, which facilitates the 226 ABC inference. 227 We do not report it here but the accuracy of the inference varied widely across 228 parameters. For the interaction parameter (k), the inference reduced the initial 95% 229 HPD of the prior by 66%. In comparison, this was less than for the transmission 230 probability (β, 75%), but much better than for the assortativity parameter (a, 45%), 231

12/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

232
There are several ways to extend this framework. One would be to use more 233 powerful regression techniques, such as neural networks. However, these may be more 234 difficult to parameterise. Furthermore, even though it contains several parameters, our 235 model remains relatively simple compared to the power of these algorithms. One 236 possibility to address this could be to use a agent-based model with sophisticated agent 237 behaviours to generate a richer dataset. This would be useful in itself to generate test 238 runs with known parameter values to further test the power of our method on more 239 noisy data. It would also allow to control for biases related to the contact network 240 structure between hosts and the dynamical aspect of sexual partnerships that have been 241 shown to interfere with the detection of coinfection interactions [43].

242
Finally, the next step is, of course, to test this model using actual epidemiological 243 data. We here used HPV as a case study but it would be possible to study coinfections 244 between different parasite species, although this might require substantial modifications 245 in the model to capture the life-history of each parasite. Even in the case of HPV, 246 analysing real data will require to add several processes we chose to ignore here. First, 247 HPV detection tests may exhibit cross-reactivity between HPV types, thus inflating the 248 prevalence of some combinations. This effect if well described and can be handled for 249 each detection test. Second, when hosts are infected by many HPV types, some of these 250 may not be detected, thus decreasing the prevalence of high-rank infections. This effect 251 is more subtle and would require to be inferred in the model.

252
Overall, ABC and machine learning allow us to extract the information from the 253 equilibrium prevalence of all the combinations of genotype prevalences. Therefore, 254 combining coinfection modelling with epidemiological data can bring new elements to 255 the controversy regarding the importance of interactions between HPV types. 256

13/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; The epidemiological model 258 The model is based on the deterministic ODE-based framework introduced by Sofonea 259 et al. [25] that allows for an arbitrary number of parasite genotypes to circulate in a 260 host population without assuming any particular infection pattern (see [4] for the 261 importance of this relaxation). Furthermore, the framework enables cotransmission in  [20,44]. Given that we focus on HPV infections in young adults, we 275 neglect infection-induced mortality. 276 Mathematically, the dynamics can be captured in a compact form using the master equation [25]: where y is the vector of densities of the 2 n host classes, denotes the Hadamard

14/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; simplicity, we neglect host demography (births and deaths) and assume that the host 283 population size is constant. Given that infected hosts do not always sero-convert and 284 that natural immunity is much lower than vaccine-induced immunity [15], we neglect 285 immunisation in the model.

286
Population structure The model was enhanced by splitting the host population into two sub-populations that differ in their contact rates ('super-spreader' versus 'normal-spreader' hosts) as shown in Figure 2B. Contact between the two sub-populations follows a classical pattern based on the assortment (a) between host types, the proportion of each host type (p 1 = p and p 2 = 1 − p) and their activity rates (equal to c 1 = 1 and c 2 = h, with h ≥ 1). Overall, the contact rate between a 'recipient' individual from sub-population j and a 'donor' individual from sub-population i is where δ ij is the Kronecker delta and h is the difference in activity between the two host 287 classes.

288
This population structure implies that we have two vectors of host classes (y 1 and y 2 ). If we denote the combined vector y • = (y 1 , y 2 ), the master equation can be written similarly to 1 by updating the matrices in the following way: where 1 denotes the 2 n -dimensional column vector with unit elements, and Φ is 289 obtained by repeating each 2 n × 2 n block Φ [i] of the original 2 n × 2 2n matrix Model simulations The model was implemented and simulated in R. The script is 292 already available upon request and will be published on a repository along with the raw 293

15/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

294
The equilibrium prevalences from the deterministic model were used to generate 295 datasets in finite populations of 1000, 5000 and 10,000 hosts assuming a multinomial 296 distribution where the probability to draw a host with a given genotype combination 297 was equal to this combination's prevalence.

298
HPV interactions For simplicity, within-host dynamics were neglected here and the 299 effect of genotype diversity on the infection parameters was modelled in the following 300 way. First, we assumed that genotype transmission was unaffected by the presence of 301 other genotypes in the host. This was motivated by the very high transmission 302 probability of HPV per contact [38]. Second, we assumed that interactions between 303 HPV types take place through the recovery rates.

304
Even with 5 genotypes, this could mean 20 interaction parameters (e.g. how the 305 presence of genotype A affect the clearance rate of genotype B). To reduce this 306 complexity, we assumed that genotypes could be sorted into two groups. Biologically, 307 these groups can correspond to high-risk (i.e. carcinogenic) and low-risk HPV types, or 308 to any other binary classification. Whenever a genotype from the second group coinfects 309 a host with a genotype from the other group, its individual recovery rate is multiplied 310 by a factor 1 + k, with k ∈ [−0.5, 0.5]). Genotypes from the first group are assumed to 311 be unaffected by the presence of other genotypes (otherwise we would need an 312 additional parameter and assumptions as to the interaction between the two 313 parameters). Depending on whether k is greater or lower than 1, we expect host classes 314 containing genotypes from the second group to be under-or over-represented 315 respectively. We assumed that one of the groups contains 3 genotypes and the other 2 316 but a different partitioning would lead to similar results and should eventually be 317 decided based on the data.

318
Inference from distributions 319 In order to compare our framework to existing methods, we use 4 techniques used by 320 Vaumourin et al. [35], who implemented them in R. These are briefly described here but 321 readers interested in more detailed should refer to the original publication. 322

16/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; count of each combination based on the genotype prevalences [35]. From these 324 simulations, a 95% confidence envelope is calculated for each combination, thus allowing 325 to detect deviation from the expected distribution in the dataset. Connectance was computed using the igraph R package. These scripts are available 338 upon request and will be published on a repository.

340
The methods used here follow that developed in phylodynamics [40] and apply them to 341 different summary statistics. In short, Approximate Bayesian Computation (ABC) is a 342 likelihood-free method to infer parameter values from a given dataset [46]. It consists in 343 simulating many datasets, for which by definition the underlying parameters are known, 344 and comparing them to the target dataset the parameters of which we want to estimate. 345 This comparison is often done by breaking the datasets into summary statistics. We use 346 regression-ABC [39], which is divided into two steps. First, a rejection step, where only 347 the simulated runs that are close enough from the target are kept. Second, a regression 348 model is learnt on the remaining runs. Once we know how to map summary statistics to 349 the parameter space, we can infer the parameters from any target dataset from which 350

17/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/256586 doi: bioRxiv preprint first posted online Jan. 31, 2018; Here, using model (1) and following [25], we calculated the equilibrium prevalences 352 of each of the 64 host classes (32 classes for each host type) for 50,001 parameter   We report three sets of summary statistics:

359
• the ranks set: the rank prevalences and the total prevalence of each genotype, plus all the differences between the combination prevalence and the corresponding 365 rank prevalence (64 summary statistics), that is 148 summary statistics.

366
The first set is intended to be compared to classical methods that ignore combinations 367 of genotypes, the second is based on the type of data that could be easily accessed and 368 the third is for a very optimistic scenario in which we would know which group every 369 host belongs to. 370 We compared several levels of tolerance using a preliminary run of the model (with 371 narrower priors) and identified 50% as an optimal cut-off for the rejection: lowering the 372 tolerance did not improve the inference (measured via the fraction of runs where the 373 target value ended up in the 95% HPD), whereas increasing it decreased the inference 374 quality.
was removed and used as a target, whereas the remaining runs were used to learn the 381 regression model (after performing a rejection step). We repeated the operation 100 382 times to generate 100 target datasets.  Figure S2).

26/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

27/27
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.