A Direct Comparison of Two Densely Sampled HIV Epidemics: The UK and Switzerland

Phylogenetic clustering approaches can elucidate HIV transmission dynamics. Comparisons across countries are essential for evaluating public health policies. Here, we used a standardised approach to compare the UK HIV Drug Resistance Database and the Swiss HIV Cohort Study while maintaining data-protection requirements. Clusters were identified in subtype A1, B and C pol phylogenies. We generated degree distributions for each risk group and compared distributions between countries using Kolmogorov-Smirnov (KS) tests, Degree Distribution Quantification and Comparison (DDQC) and bootstrapping. We used logistic regression to predict cluster membership based on country, sampling date, risk group, ethnicity and sex. We analysed >8,000 Swiss and >30,000 UK subtype B sequences. At 4.5% genetic distance, the UK was more clustered and MSM and heterosexual degree distributions differed significantly by the KS test. The KS test is sensitive to variation in network scale, and jackknifing the UK MSM dataset to the size of the Swiss dataset removed the difference. Only heterosexuals varied based on the DDQC, due to UK male heterosexuals who clustered exclusively with MSM. Their removal eliminated this difference. In conclusion, the UK and Swiss HIV epidemics have similar underlying dynamics and observed differences in clustering are mainly due to different population sizes.

Within MSM clusters, 25% of transmissions occur within 6 months of infection 2 . Heterosexuals in the UK display far less clustering and slower epidemic dynamics: only 2% of transmissions occur within 6 months of infection 1 . Currently, 100,000 people are living with HIV (0.15% prevalence), one quarter of whom are unaware of their infection 12 . Highly Active Antiretroviral Therapy (HAART) became available to residents in 1996 and to all in 2012. HIV positive people on successful treatment in the UK have normal life expectancy 13 .
Switzerland had the highest HIV prevalence in Europe in the 1980 s 14 . HIV initially spread among MSM and PWID 15 with heterosexual transmission starting to play a role after the mid-1980 s. The number of new diagnoses declined in the 1990 s, owing to needle exchange, heightened awareness, wide-scale testing, and the introduction of HAART in 1996. The number of new HIV diagnoses in Switzerland has fluctuated since 2000 with no clear time trends. The MSM and PWID epidemics display limited overlap. The heterosexual subtype B epidemic appears to be reseeded by PWID, migration, and (to a limited extent) MSM, with the importance of PWID in driving new infections decreasing over time 3,5,16 (less than 3 PWID infections per year in the last 5 years 17 ). Meanwhile only 25% of non-B infections arise within Swiss-specific clusters, indicating a growing role for immigration 5 .
One important focus of phylogenetic analyses is on clusters: groups of sequences more related to each other than to the rest of the tree. Clustered sequences represent epidemiologically linked infections with short durations of time between transmissions and thus clusters represent the leading edge of the epidemic. High clustering of sequences within a country indicates rapid transmission and a bias towards within-country transmission.
Despite the two countries sharing similar epidemic histories, separate analyses have suggested different structures of the two countries' epidemics, including distinct proportions of clustered sequences. In the UK, 24%, 40% and 22% of patients infected with HIV-1 subtypes B 17 , A1 and C 12 , respectively, cluster, whereas the numbers for Switzerland are 55%, 21% and 16% 3,5 . Clearly these differences arise in part due to distinct cluster definitions: the SHCS defines a cluster as ≥ 10 sequences supported by > 80% bootstrap 3 , while the UK studies defined clusters by a genetic distance (GD) ≤ 4.5% and bootstrap ≥ 90% 2 . Variable cluster definitions are a common problem in the literature. Bootstraps from 70% to 99% are used, in combination with GD from 1.5% to 4.5%. Another reason for the disparity could be differences in sampling procedures. However it is also possible that the contact and transmission processes between Switzerland and the UK differ. Given these observed differences, it is unclear to what extent findings from one country can be applied to the other, and more importantly whether results from either can be extended to other less densely sampled European epidemics.
Because access to data from national cohorts is subject to restrictions, and because analyses have been conducted according to in-house bioinformatics pipelines, the differences between the two epidemics have never been elucidated. Here, we present an analysis conducted in parallel on the two epidemics using a standardised approach. We hypothesised that because of its geography, the Swiss epidemic might be more integrated into the European epidemic and less clustered because of unsampled links in Swiss transmission chains. Using the same cluster definition, we compared the cluster distributions (the number and sizes of clusters) of the two countries as an indicator of underlying epidemic dynamics. We determined whether clustering was affected by risk group and ethnicity and compared the degree distributions (the number of linked partners) of heterosexual, MSM and PWID to test whether differences between countries were down to any specific risk group. Finally, we tested whether the UK and Swiss epidemics intermingled with the same foreign countries through sequence analysis.

Results
Baseline demographics. HIV (Table 1). These differences were in part a result of the different subtype composition across the two countries, but even within subtype B there were notable differences, with proportionally more cases among heterosexuals and PWID in Switzerland.
In both countries individuals for whom sequence and epidemiological data were available broadly matched the characteristics of the HIV diagnosed population as a whole, in terms of risk group, sex, ethnicity and age distribution 19,20 . Swiss sequence dates go back to 1995 owing to retrospective sequencing of samples from the SHCS Bio-bank 21 .  Fig. 1). Concordantly, the UK was more clustered in the univariate analysis (Table 2) with odds for being in a cluster 4 times higher at 4.5% GD. At 4.5% GD, subtypes C and A1 were also more clustered in the UK, but there was no significant difference at 1.5% GD. The subsequent analysis focused mainly on subtype B. Because of the difference in sampling time distributions and demographics between the two subtype B datasets, we considered the logistic regression adjusting for those variables. More recent samples were much more likely to cluster than older samples (Table 3) and when the model was adjusted for sample date, the Swiss epidemic was more clustered than the UK epidemic at 1.5% GD. At 4.5% GD, clustering remained higher for the UK but the strength of the association was halved compared with the univariate model. The effect of risk group and ethnicity on clustering was consistent across the two countries, with MSM in both countries showing the highest propensity for clustering. Degree Distributions. Degree distributions for each country and risk group were generated based on cluster size distributions and compositions. For example a cluster containing 3 heterosexuals is equivalent to 3 heterosexuals each with degree 2. Statistical frameworks exist to formally compare degree distributions [23][24][25] and include bootstrapping to simulate the effect of the sampling process and test the robustness of conclusions. Degree distributions for the populations as a whole and for each risk group were compared using the Kolmogorov-Smirnov (KS) tests 24 and the Degree Distribution Quantification and Comparison (DDQC) algorithm 23 . Based on the KS test, there was no difference between the two countries at 1.5% GD. At 4.5% GD, distributions differed significantly for HET, MSM and the population as a whole, but not for PWID (Supplementary Table 2). The difference appeared to be driven by the longer tail of the UK distributions ( Fig. 2), indicating the existence of larger clusters in the UK. The UK epidemic thus comprised not only a higher proportion of sequences in clusters, but also clusters were larger.
Because the KS test is sensitive to network size, we applied the DDQC, which is robust to differences in scale. The DDQC measures the distance between networks based on features extracted from their degree distributions. However, it does not indicate whether distances calculated are significant or not. In order to generate null distributions, the UK and Swiss degree distributions were compared to themselves through bootstrapping. The UK and Swiss degree distributions were each bootstrapped 100 times to simulate the effect of sampling (see Methods) and the DDQC distance calculated between the true data and each bootstrap replicate (Fig. 3). Between country DDQC values were considered significant if they exceeded the 95% percentile of the within country DDQC   We hypothesised that the difference highlighted by the KS test at 4.5% might be the result of a difference in scale between the two epidemics. The UK population (and the HIV+ population) is much larger than that of Switzerland and so the pool of partners available is bigger. To examine the effect of epidemic size on clustering and degree distributions, we down-sampled the UK subtype B datasets to match the size of the Swiss datasets. In parallel, the Swiss datasets were bootstrap sampled with replacement. When these equal-sized resampled datasets were compared, the UK and Swiss degree distributions overlapped for the population as a whole and for the MSM population, but not for heterosexuals (Fig. 4).
In the true and the jack-knife sampled UK heterosexual population, we observed male heterosexuals with high degree (> 20) not present in the Swiss data (Fig. 4). The largest exclusively heterosexual cluster in the UK comprised 27 individuals (bootstrap = 0.9, GD = 4.5%); all heterosexuals with higher degree were in clusters dominated by MSM. When we dropped heterosexuals with degree > 26 from the UK sample (125 individuals out of a total of 1556 heterosexuals), the DDQC distance between the two networks fell within its null distribution (DDQC = 0.17, Fig. 3).

Cross border transmission.
We investigated intermingling between national and foreign sequences by removing the 80% national criterion. We used a tight GD threshold (1.5%, 70% bootstrap) to capture close transmission partners. At this threshold, Swiss sequences clustered with 162 non-Swiss sequences and UK sequences clustered with 353 non-UK sequences. For Switzerland, Western European countries provided over 75% of the links. For the UK, 50% of close links were with other European countries and 20% originated from other Anglophone countries: Australia, Canada and the USA (Supplementary Figure 1).

Discussion
The aim of this study was to compare epidemic dynamics between the two most densely sampled HIV epidemics, the UK and Switzerland, while adhering to data governance procedures and privacy protection requirements. We found that the fraction of sequences in transmission clusters was similar between the UK and Switzerland for a strict GD threshold (1.5%) but that they differed at a more relaxed GD threshold (4.5%). This suggests that the two epidemics resemble each other at a micro-level but differ at a macro-level. Because a statistical framework for comparing cluster distributions directly is lacking, we generated degree distributions based on cluster sizes and compared them through formal statistical tests: the KS test, the DDQC and bootstrapping. Based on the KS test, there were differences between the UK and Swiss subtype B degree distributions at 4.5% GD. However, downsampling the UK dataset to the size of the Swiss dataset rendered this difference insignificant in MSM and the population as a whole, but not heterosexuals (Fig. 4). In parallel, only heterosexuals showed a significant difference based on the DDQC test, which corrects for network size.
The degree distribution of UK heterosexuals had a long tail representing male heterosexuals clustered exclusively with MSM. Previous UK analyses have demonstrated that a proportion of self-reported male heterosexuals are likely to have been infected through sex with men 26 , which appears to be the case here. When those high degree heterosexuals were removed from the dataset, the UK and Swiss no longer differed. Male heterosexuals who have sex with men are likely to also have sex with women and provide a bridge between MSM and heterosexual epidemics. This is a likely route for the spread of non-B subtypes among MSM in the UK 4 . More detailed analyses of the Swiss epidemic have found little overlap between the MSM and HET epidemics 3 ; however, the Swiss heterosexual with the highest degree was similarly part of a HET/MSM cluster comprising 36 individuals, while  the largest exclusively heterosexual cluster contained only 9. In fact, 47% of UK and 38% of Swiss heterosexuals were in HET/MSM clusters. The present analysis cannot determine whether bridging is more common in the UK or whether risk group classification is assessed more thoroughly in Switzerland.
There was more overlap between PWID and heterosexuals in Switzerland than in the UK (23% vs 12%), but the difference in PWID degree distributions was not significant. Although this could be due to sample size, the stemming of the heterosexual epidemic through PWID in Switzerland is likely to be an old process 3 while the bridging between HET and MSM in the UK is ongoing 4,26 .
Our findings were consistent across bootstrap thresholds. At 1.5% GD, we found no difference between the degree distributions of the Swiss and UK epidemics. At tight thresholds mostly pairs and recently infected patients are captured and these groupings are similar across the two countries. At 4.5% GD, the UK was more clustered and so the UK HIV RDB is more likely to capture larger transmission chains. However, the downsampled UK epidemic degree distribution overlapped with the Swiss degree distribution. While the proportion of clustered individuals in the UK is higher, the difference is seemingly due to the UK greater epidemic size rather than because of differences  The top of the coloured bars represent the mean distance of within country comparisons and the whiskers represent the 95% percentiles. The DDQC distance was then calculated between the UK and Swiss degree distributions (black triangles). The distance between countries was considered significant if it exceeded the 95% percentile from the simulated values, which was the case only for HET at 4.5% GD (indicated by *). When we removed heterosexuals who were likely to have been infected through sex with men from the UK dataset, the DDQC distance between the UK and Swiss HET degree distributions fell within the simulated null distribution (orange triangle).
in contact or transmission processes. Both countries are similarly integrated into global unsampled epidemics, and this study underlines the importance of HIV public health interventions at the European and global levels.
Transmission between European countries has been analysed in more depth elsewhere 27 . In agreement with that analysis, we found Spain to be a major mixing partner for both Switzerland and the UK. Germany and the Czech Republic were also identified as significant. We found increased linkage between the UK and other Anglophone countries. In Switzerland strong segregation has been observed between German and French-speaking regions 3 and this language-dependency of HIV transmission warrants investigation at the global scale.
Both countries have noted the subtype diversification of their respective epidemics 9,28 , yet the difference in size between the UK and Swiss subtype A1 and C datasets (18,000 vs 900, respectively) rendered a comparison meaningless. In Switzerland fewer than 25% of non-B infections were acquired in the country 5 , whereas in the UK over 50% of infections in individuals born abroad are thought to have occurred in the UK 11 . Local non-B non-heterosexual transmission appears far more extensive in the UK 4 .
Although degree distributions are a blunt tool for elucidating the dynamics of an epidemic 29 , they allowed us to apply statistically robust methods to compare the two epidemics without the need for exchanging sensitive data. Sequences from national databases were never exchanged and while this precluded a combined phylogenetic analysis of UK and Swiss sequences, one of the strengths of the study stems from undertaking such an analysis without compromising patient privacy. A second issue, the distributions of sample dates differing between the two cohorts, arose because the SHCS has conducted extensive retrospective sequencing on patients diagnosed early on in the epidemic and for whom samples had been stored. The SHCS coverage of older samples explains in part the lower clustering observed in Switzerland. However, clustering remained significantly higher in the UK at 4.5% GD after sample date was adjusted for. Thirdly, international comparisons were based on the LANL database which is in essence a large-scale convenience sample and not necessarily representative. We suggest the apparently important contribution of the Czech Republic to both epidemics may arise from recent submission of large numbers of sequences from that country.
In conclusion, we showed that apparent major differences in clustering patterns between the UK and Switzerland subtype B epidemics can be explained for the most part by differences in size and sampling time. This is the first study leveraging the vast amounts of data available in multiple national HIV databases. We made use of data without breaching data governance procedures and highlighted that transmission trends in these two countries are driven by similar underlying factors. From a methodological perspective, our study highlights the importance of using the same cluster-detection algorithm and correcting for demographic factors when comparing clustering patterns across settings.

Methods
Data. Switzerland. 9,232 HIV pol sequences were retrieved from the SHCS DRDB. The SHCS DRDB aggregates all HIV resistance tests for patients of the SHCS. SmartGene is responsible for data storage and management (http://www.smartgene.com). The DRDB is part of the SHCS, which is an ongoing national clinical cohort of HIV patients aged 16 and above with biannual follow up (http://www.shcs.ch) 20 . Sequences were assigned subtypes using REGA 30,31 ; subtypes B (91%), A1 (5%), and C (5%) were analysed. The SHCS has been approved by the ethics committees of all participating institutions, and written, informed consent has been obtained from participants.
As submission of the UK and Swiss sequence datasets to public databases would permit transmission network identification and thus risk breaching patient confidentiality, we have followed earlier practice 3,33 . A random sample of 10% of each subtype and country has been submitted to Genbank (accession numbers available in supplementary material).
Background sequences. All pol (HXB2 positions 2253-3870) sequences of HIV subtype A1, B, and C longer than 900 bases were retrieved from LANL (January 2014). To limit the size of alignments, the ten closest sequences to each of the local (UK and Swiss) sequences were selected using Viroblast 34 . For this step, UK sequences were removed from LANL alignments before the UK Viroblast run, and Swiss LANL sequences were removed before the Swiss run.
Only the earliest available sequence for each individual was used. All sequences were stripped of 44 sites associated with drug resistance based on the 2013 International AIDS Society list 35 . Tree Building and Cluster Picking. Duplicate sequences were removed. Maximum likelihood phylogenetic trees were constructed for each country and subtype separately (six trees in total) using FastTree v2.0 36 with 100 bootstraps. Initially clusters were selected for further analysis if they were supported by bootstrap thresholds of 70%, 80%, 90% and 95% and maximum GD of 1.5% or 4.5% (8 thresholds total) 22 . Of the initially identified clusters, those in the Swiss trees were further selected to contain at least 80% SHCS sequences, and clusters in the UK trees at least 80% UK sequences. In a separate analysis, we examined all clusters with at least one UK or Swiss sequence (within the respective datasets) to investigate mixing between national and foreign sequences. The automated pipeline included analysis with the Cluster Picker and Cluster Matcher 22 as well as processing through python and R scripts (available upon request). The Cluster Picker was upgraded to recognise IUPAC nucleotide ambiguity codes as matches (version available upon request from the authors), increasing clustering by around 15% in both datasets.
From the Cluster Picker and Cluster Matcher output files, we generated degree distributions (the number of links for each node). As files contained risk group composition for each cluster, it was possible to break down degree distribution by risk group. Nodes were sampled with replacement from the network with information on their cluster membership. Nodes sampled with the same cluster membership were linked together in each bootstrapped network, so that clusters sometimes increased in size, sometimes decreased in size or otherwise disappeared, and degree distribution was re-estimated each time. Jack-knife resampling where the number of nodes sampled was smaller than the full network size was also performed. Statistical analysis. The number of sequences clustering at different thresholds between the two epidemics was compared using Fisher's exact test with Bonferroni correction (24 comparisons across clustering thresholds and subtypes). Degree distributions were compared using the KS test 24 and the DDQC algorithm 23 . The KS test is a nonparametric test which compares the cumulative distribution of two samples to estimate whether they have been drawn from the same distribution and is frequently used to compare degree distributions 25 . The DDQC was developed specifically to compare degree distributions and corrects for differences in population size while the KS test does not. The DDQC extracts a vector of eight values from degree distributions for comparison. In brief, the range of node degrees is divided into eight regions based on the minimum, maximum, mean and standard deviation of the degree distribution. The probability of the degree of any node being contained within each interval is calculated. The distance between two networks is the sum of the absolute differences for each of the eight features extracted.
The UK subtype B dataset used here was 3.75 times larger than the Swiss dataset; the UK MSM dataset was 5.7 times larger and the UK heterosexual dataset was 1.3 times larger (Table 1). To investigate the effect of the difference in size of the pool of possible infectors, the UK dataset was jack-knife sampled to the size of the Swiss dataset. One hundred jack-knife replicates were generated, and in each replicate the degree distribution was re-estimated based on the links present in the sample. A logistic regression model was used to characterise the factors influencing clustering in the two countries. The model was applied with cluster membership as the outcome variable and with the country of origin (UK or Switzerland) as the main exposure variable. Sampling dates, risk group, sex, and ethnicity were adjusted for. Statistical analyses were conducted in R 37 .