Clustering of contacts relevant to the spread of infectious disease

Mathematical


Introduction
Mathematical models of infectious disease transmission require assumptions about mixing between different subgroups in a population that can potentially lead to transmission between infected and susceptible individuals.The simplest assumption is that every-one has the same probability of contacting each other, but this can sometimes lead to misleading results (Keeling and Rohani, 2008).Indeed, many infection control interventions such as vaccinating children (Thorrington et al., 2015) or closing schools during a pandemic (House et al., 2011) are predicated on the assumption that certain subgroups in the population are the main transmitters.
A more realistic assumption is to subdivide the population based on some characteristic, and introduce a matrix of contact rates capable of transmitting infection between each subgroup, called the "who acquires infection from whom" (WAIFW) matrix (Vynnycky and White, 2010).Age is the characteristic most commonly used as a source of heterogeneity in mixing patterns.A model with age-stratified contact rates can be fitted to age-specific data on infection history (such as sero-epidemiological data, which marks the prevalence of previous infection) to estimate the age-specific effective contact rates in the WAIFW matrix.
To inform the elements of the WAIFW matrix, the number of social contacts that individuals in different age groups report can http://dx.doi.org/10.1016/j.epidem.2016.08.001 1755-4365/© 2016 The Author(s).Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).be empirically measured and used as proxies to the actual contact rates underlying transmission (Beutels et al., 2006;Edmunds et al., 1997;Wallinga et al., 2006).The largest such study is a diary-based survey of 7290 participants in eight European countries collected in 2006 as part of the POLYMOD project (Mossong et al., 2008).Since then contact studies have been carried out in other parts of the world using similar methodology.
In these studies, participants are asked to record the contacts that they have made over a single day and classify them using a number of characteristics (such as physical/non-physical, long/short, home/school/work etc.) Since it is unrealistic to measure effective contacts (i.e.contacts that can transmit a particular infection) between individuals directly, some form of self-reported social behaviour such as face-to-face conversation or skin-to-skin contact is used as a proxy.It is assumed that the age distribution of these social contacts is related to the age distribution of effective contacts by a constant proportionality factor, an assumption referred to as the "social contact hypothesis" (Wallinga et al., 2006).Hence the age-specific transmission matrix is completely described by estimating this factor.Subsequent analyses (Melegaro et al., 2011) found that stratifying the WAIFW matrix according to different characteristics of contacts (with a different proportionality constant for each type of contact) significantly improved the goodness of fit of the models to serological markers of past infection for respiratory infections.This implies that different characteristics of social contact contribute differently to infection transmission.
A previous study explored the use of formal clustering algorithms to group POLYMOD survey respondents based on the number and location of their contacts (Kretzschmar and Mikolajczyk, 2009).The study found that respondents across different countries fell into a similar range of contact profiles, with the work, school and household contact profiles most common.However, this does not tell us whether there are certain types of contacts (rather than respondents) which may be particularly relevant for the transmission of particular infectious diseases.Furthermore, the relevant clusters of contacts may be disease-specific, since different infections have separate routes of transmission.
Hence an alternative approach would be to explore what kind of contacts (rather than respondents) are most relevant to infection transmission.The only previous work in this area has focused on single dimensions of contacts (Melegaro et al., 2011).This simply indicated that "intimate" (i.e.physical, home, long-duration and frequent) contacts are better able to explain age-dependent patterns in the acquisition of serological markers for varicella-zoster virus (VZV) and parvovirus B19, the two infection examined in the study.Since the dimensions of intimacy are highly correlated, it may be more informative to take all the characteristics of social contacts into account collectively when stratifying the WAIFW matrix.
In particular contacts made by different respondents could be grouped into clusters, and then examined to identify which clusters best explain patterns of infection acquisition in the corresponding populations.Here, we explore the use of clustering algorithms to determine clusters of social contacts which are similar to each other based on multi-dimensional characteristics of social contacts.
A range of clustering algorithms have been developed such as hierarchical clustering, partitioning clustering and latent class clustering (Everitt et al., 2011).In hierarchical clustering, each element is plotted on a graph to determine the optimal number of clusters.When the number of elements becomes very large (such as the thousands of contacts for each country in the POLYMOD survey), this graphical method becomes very cumbersome.In latent class clustering, a multivariate distribution is imposed on the data and the validation of the clustering results depends on several assumptions such as the parametric form of the multivariate distribution, the dependency between variables and the approximation of likelihood estimation.Since social contact data include different types of variables (binary and ordinal), some of which are structured, it is difficult to identify an appropriate multivariate distribution to describe the data.Hence we used partitioning clustering, and in particular the k medoids method to classify contacts.
We then investigate whether these clusters enhance our understanding of transmission patterns by using them to fit a transmission dynamic model to age-dependent patterns in the acquisition of varicella-zoster virus serological markers, as an example of a childhood respiratory infection with clear, long-lived markers of past infection and no vaccination history at the time of data collection.
Models were fitted to data on the presence of antibodies to VZV from serum samples collected in 1996 from 1300 participants aged 0-19 in Poland, 2091 participants aged 0-20 in England and Wales (fitted to Great British contacts), 2760 participants aged 0-39 in Belgium, 2500 participants aged 0-79 in Finland and 2517 participants aged 0-79 in Italy (Vyse et al., 2004).The samples were collected from unlinked anonymised (apart from age) residual sera following microbiological or biochemical investigations (Osborne et al., 2000), and tested as part of the European Commissionfunded second European Sero-epidemiological Network (ESEN2) (Melegaro et al., 2011;Nardone et al., 2007).Children under 5 years old were oversampled with the sample size in each age group ranging from 117 to 192.For those aged 5-20 years approximately 100 sera in each one-year age group were tested.All serological tests for VZV-specific IgG were performed at Preston Public Health Laboratory using a commercial ELISA assay.
Population data were obtained from national statistics offices in the five countries considered as in previous analyses (Melegaro et al., 2011).

Cluster analysis of social contact data
A cluster is defined as a group of social contacts whose members are more similar to each other than to non-members.The similarity of two contacts is defined using the Manhattan distance measure between the two (Everitt et al., 2011), based on four contact characteristics: proximity (physical, non-physical), duration (<5 min, 5-15 min, 15 min to 1 h, 1-4 h, >4 h), frequency (daily, weekly, monthly, a few times a year, first time) and location (home, work, school, leisure, transport, other).Variables representing each characteristic were recoded so that the distance between any two contacts lies between 0 and 1, so the Manhattan distance becomes the Gower's similarity coefficient which is an appropriate proximity measurement for mixed data (Gower, 1971) (see Appendix A1 for details in Supplementary data).Contacts recorded as taking place in multiple locations were assigned to a single location based on the following hierarchy: home > work > school > leisure > other > transport.The hierarchy was based on the putative duration of contacts in the given settings as suggested by previous investigators (Kretzschmar and Mikolajczyk, 2009).Social contacts were classified into clusters using the Partitioning Around Medoids (PAM) algorithm, the most common realisation of k-medoids clustering.This approach pre-defines a fixed number of clusters, selects typical elements as centroids of each cluster and assigns other elements to a cluster according to their distance to the centroid (Kaufman and Rousseeuw, 1990).
The number of clusters was varied between 2 and 15, with the upper limit included to avoid possible problems with data sparsity.The optimal number of clusters for the PAM algorithm was then determined using average silhouette width (ASW), which measures how well each contact is assigned to its cluster (Kaufman and Rousseeuw, 1990).Replication analysis (Breckenridge, 2000;Walesiak et al., 2008) was performed to assess the robustness of the clustering results, by splitting the dataset into several subsets and using the same algorithm separately for each subset to compare classification agreement.

Fitting dynamic transmission models
We constructed a realistic age-structured (Schenzle, 1984) susceptible-infected-recovered (SIR) dynamic model of VZV transmission (with no natural mortality assumed).In this model, the next generation matrix representing the potential number of infection transmission events per person, is calculated as the product of the (adjusted) age-dependent contact matrix, a proportionality factor q representing the proportion of contacts that can potentially lead to a new infection and vector representing the size of the susceptible population in each age group.For each cluster i of contacts (1 ≤ i ≤ N where N is the total number of clusters), we generated N separate contact matrices C i with corresponding proportionality factors q i .Hence the next generation matrix is the linear combination of all contact matrices C 1 q 1 w + . . .+ C N q N w where w is a vertical vector of the population size in each age group.The basic reproduction R 0 was calculated as the principal eigenvalue of the next generation matrix.
A grid search algorithm was used to find the q i 's which maximised the binomial likelihood of the model given seroepidemiological data for VZV infections (Melegaro et al., 2011;Ogunjimi et al., 2009).We defined relevant clusters as those with corresponding best fitting q i 's that were non-zero, and hence which contributed to the next generation matrix.An iterative method was used to obtain the proportion of immunes at each age group as a function of the q i 's.Quality of model fit with different numbers of clusters was measured using the Akaike Information Criterion (AIC), its small-sample-size corrected version (AICc) and the Bayesian Information Criterion (BIC).Goodness of fit was also compared to models with contacts stratified by a single characteristic only.

Uncertainty intervals
Uncertainty intervals for the proportionality factor q and basic reproduction number R 0 were estimated by bootstrap sampling from both social contact data and sero-epidemiological data with the same sample size as the original data (Melegaro et al., 2011).For social contact data, bootstrap samples were generated by re-sampling participants' identity numbers.Social contact matrices were then estimated for each bootstrap sample.For the sero-epidemiological data, bootstrap samples were generated by randomly drawing from a Bernoulli distribution with probability of success equal to the observed seroprevalence for the specific age group.After this, the newly generated social contact matrices and sero-epidemiological data were matched to repeatedly estimate the transmission parameters.
All analyses were performed in R version 2.15.2 (R Core Team, 2012), using the cluster and clusterSim packages.

Sensitivity analysis
To explore whether between-country differences in fitted models were due to heterogeneity in the age ranges over which seroprevalence data was available, we repeated our analysis using only seroprevalence data in the range 0-20 years (which is the maximum range for which data were available in all five countries).

Optimal number of clusters
Evaluating the optimal number of clusters involves three components: the quality of the clustering (measured using ASW), how well the resulting clusters fit the transmission model of VZV seroprevalence (measured using AIC, AICc and BIC) and the number of clusters that are actually relevant to this model fit (Fig. 1).
In all countries, ASW decreased between 2 and 5 clusters (indicating worse clustering quality), probably because most contact characteristics have either 2 (physical/non-physical) or 5 (other characteristics) levels of encoding.After 5 clusters, the ASW gradually increased, with the increases occurring at the points where the number of relevant clusters increased.The three measures of goodness of fit to VZV seroprevalence data (AIC, AICc and BIC) were highly consistent with each other.
We determined the optimal number of clusters by maximizing both quality of the clustering as well as goodness of fit to the point where further increases in the number of clusters would not make noticeable changes in the quality of clustering, model fits or the number of relevant clusters substantially.The optimal number of clusters is 12 in Great Britain, 14 in Italy, 15 in Belgium, 8 in Finland and 7 in Poland.The related corrected Rand index for the clustering in each country ranges from 89% to 95%, which suggests that the clustering is robust (Breckenridge, 2000).The Rand index represents the average agreement between two data clustering in multiple replications (with ten replications used in this analysis) if contact data are randomly divided into two samples and each sample is clustered into same number of clusters separately.

Comparisons between different stratifications
When contact data are stratified by clusters, only 2-3 clusters in each country are found to be relevant to VZV transmission (i.e. have non-zero q i 's when the model is fitted to VZV serology).Previous work stratifying contacts by single characteristic suggested that more intimate contacts (physical, home/school, frequent, long duration) best explain VZV transmission (Melegaro et al., 2011).When clustering contacts using multiple characteristics, the most relevant clusters are still likely to contain more intimate contacts.However, the picture is more nuanced and varies across the five countries.
Details of the most relevant clusters are given in Table 2, Fig. 2, Fig. 3 and Appendix A.2 (Supplementary data).Briefly, in Finland and Italy, the countries with the most strongly age-assortative overall contact matrices, three contact clusters are relevant.Overall assortativity is driven by two clusters dominated by daily school contacts: one physical and one non-physical.A third cluster, dominated by non-physical leisure contacts in Finland and physical other contacts in Italy, is also assortative but less so than the school contacts.In both countries there are two clusters with home contacts which are not found to be relevant.
In Great Britain, Belgium and Poland, only two clusters are relevant: one consisting mainly of highly assortative daily physical  school contacts, and the second of less assortative physical home contacts.The home clusters show a strong secondary diagonal probably representing inter-generational contacts, particularly in Poland.In all three countries, there are non-relevant home and school clusters more dominated by longer (>4 h) contacts than the home and school clusters actually found relevant.
Table 1 compares the best fitting model with the optimal number of clusters in each country, with models of contacts stratified by single characteristics as used in previous analyses (Melegaro et al., 2011).In every country, a model based on clustering contacts using multiple characteristics has a better quality fit (measuring using AIC) than models with contacts stratified by a single characteristics.However, the improved fit is at the expense of poorer model estimations (larger uncertainty intervals) around the estimated R 0 .
The models based on clusters with multiple characteristics estimate that R 0 ranges from 3.79 (95% range 2.79-9.44) in Great Britain to 7.61 (95% range 5.91-32.33) in Poland.The point estimates are similar to estimates using single characteristics and consistent with a range of 3-8 in the literature (Goeyvaerts et al., 2010;Iozzi et al., 2010;Melegaro et al., 2011;Ogunjimi et al., 2009).Fig. 4 shows model predictions of VZV seropositivity using the optimal number of clusters compared them with the true propor-tion of seropositive samples.On the whole model predictions fit the sero-epidemiological data well.

Sensitivity analysis
When only seroprevalence data for 0-20 year olds is used in Belgium, Finland and Italy to ensure comparability with Great Britain and Poland, the overall results are not greatly altered.The same three highly assortative clusters (dominated by two clusters of school contacts) are relevant in Italy.In Finland, two assortative clusters (dominated by school and leisure contacts respectively) are relevant, so there is little overall change in age-assortativity.
In Belgium, there is a slight shift as the highly assortative cluster dominated by school contacts is dropped but a less assortative cluster dominated by home contacts remains relevant, so overall the contact matrix is somewhat less assortative.Full details are in Appendix A3 (Supplementary data).

Discussion
Understanding social contact rates is essential to most realistic models of infectious disease transmission.Contact studies such as POLYMOD and its successors have advanced the field by allowing such rates to be measured empirically rather than assumed.However, infection transmission events cannot usually be directly observed so assumptions have to be made about the contribution of different types of reported contacts used as proxies to contacts leading to transmission.We have conducted the first study to investigate the contacts that best explain past infection status data, taking into account more than one characteristic at a time using clustering algorithms.
When these reported contacts are clustered using clustering algorithms, only a few of these clusters are found to be relevant to VZV transmission.VZV transmission in all five countries is dominated by a group of highly assortative contacts and a group of less assortative contacts.However, the balance between the two groups changes across countries: the former group dominates in Finland and Italy, while the latter group dominates in Poland.Great Britain and Belgium lie in between.These country characteristics are robust to constraining the VZV data used to fit models to data from 0 to 20 year olds so they do not appear to be driven by country differences in the availability of VZV seroprevalence data.
Interestingly, although school contacts appear to play a key role across all five countries in providing the assortativeness in contacts, non-assortativeness in contacts derive from home, leisure or other contacts.This matches information from VZV surveillance in England that suggests school and preschool children play a key role in transmission (Brisson et al., 2001).This explanation has face validity since VZV is transmitted between, as well as within, households (Organisation for Economic Cooperation and Development, 2008).Of note, the proportion of children 4 years and under enrolled in education in 2006 (when the POLYMOD contact study was conducted) is highest in Belgium, followed by Italy, United Kingdom, Finland and finally Poland.This exactly matches the order in which assortative contacts dominate the clusters relevant to VZV transmission in the five countries, with the exception of Finland.In Finland, although formal education starts late (at age 7), there is high enrolment in publicly subsidised day-care which is not captured in standardised inter-country statistics.
The quality of fit for models based on clusters stratified by multiple variables was only slightly better than for previous models based on contacts stratified by a single variable alone.The similarity between fit quality with single and multiple stratifications of contacts suggests that a single characteristic largely succeeds in capturing the dichotomy between relevant and non-relevant contacts.This is probably because contacts characteristics are highly associated with each other (e.g.contacts that are physical, long duration, home and frequent at the same time are disproportionately represented among all contacts).Also, by considering several contact characteristics simultaneously, our clustering method gives insights into the infection transmission process that may not be seen when stratifying by a single characteristic.For instance, we can determine the relevance of short duration physical contacts for which single characteristic stratification would produce conflicting conclusions depending on whether the stratification was by duration or proximity.
Recent analyses have shown that using an age-specific proportionality factor enables models to better fit VZV serology in many European countries (Goeyvaerts et al., 2010;Santermans et al., 2015).In our analysis, we vary proportionality factors according to other contact characteristics, but assume that they are ageindependent within each cluster of contacts.This was done for simplicity and to allow direct comparison with previous work (Melegaro et al., 2011) but may have resulted in a poorer fit to data.

Conclusions
Using the k medoids clustering algorithms to identify clusters of contacts relevant to VZV transmission gives similar model fit quality and R 0 values as contacts stratified by a single characteristic.However, considering several characteristics simultaneously provides key insights into the type of contacts that are most relevant to infection transmission, and those that seem not to play an important role.Firstly, it confirms that single-characteristic stratifications are able to provide optimal fits to data, probably because they are able to capture the most intimate contacts responsible for most of transmission.Secondly, we find that school-age children play a key role in almost all contacts relevant to VZV transmission across five countries, and contacts at school in particular are most important in countries with highly assortative contact matrices.This kind of analysis may therefore provide an alternative or supplementary approach to previous single characteristic stratification models, as well as validation for previous methods of contact stratification.Extending such analyses to other countries and infections may be useful for future work.

Fig. 1 .
Fig. 1.Quality of clustering (measured using ASW), quality of fit (measured using AIC, AICc and BIC) and number of relevant clusters for models in each country with different number of clusters.Vertical dashed line shows the model with the optimal number of clusters.

Fig. 2 .
Fig. 2.The distributions of the characteristics of contact for each cluster in Great Britain.Each cluster corresponds to one numbers below the x-axis.The red star mark indicates the relevant cluster whose estimated q is not equal to zero.Results for other countries are given in Appendix A2 (Supplementary data).(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3 .
Fig. 3.The corrected age-specific contact matrices for the two identified relevant clusters (the cluster number 1 and 5) with sampling weighting and reciprocal correction.

Fig. 4 .
Fig. 4. Proportions of sample positive for VZV antibodies by age (blue circle with size proportional to sample size), and model predictions of prevalence (red dash line) by age.The grey lines give the 95% bootstrap interval.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 2
Characteristics of the most relevant clusters in each country.