How the Dynamics and Structure of Sexual Contact Networks Shape Pathogen Phylogenies

The characteristics of the host contact network over which a pathogen is transmitted affect both epidemic spread and the projected effectiveness of control strategies. Given the importance of understanding these contact networks, it is unfortunate that they are very difficult to measure directly. This challenge has led to an interest in methods to infer information about host contact networks from pathogen phylogenies, because in shaping a pathogen's opportunities for reproduction, contact networks also shape pathogen evolution. Host networks influence pathogen phylogenies both directly, through governing opportunities for evolution, and indirectly by changing the prevalence and incidence. Here, we aim to separate these two effects by comparing pathogen evolution on different host networks that share similar epidemic trajectories. This approach allows use to examine the direct effects of network structure on pathogen phylogenies, largely controlling for confounding differences arising from population dynamics. We find that networks with more heterogeneous degree distributions yield pathogen phylogenies with more variable cluster numbers, smaller mean cluster sizes, shorter mean branch lengths, and somewhat higher tree imbalance than networks with relatively homogeneous degree distributions. However, in particular for dynamic networks, we find that these direct effects are relatively modest. These findings suggest that the role of the epidemic trajectory, the dynamics of the network and the inherent variability of metrics such as cluster size must each be taken into account when trying to use pathogen phylogenies to understand characteristics about the underlying host contact network.


NATSAL-based network formation
We use survey data on number of sexual partnerships over five years to create dynamic contact networks with assortative mixing by activity level, and in which individuals with higher activity levels have shorter partnerships. Edges are created by the following method: 1. Each node (i) is assigned an "aggregate degree" k i , representing the number of contacts the node will have over the five year simulation period. The (long-tailed) distribution of k i values is taken from data from the 2000 National Survey of Sexual Attitudes and Lifestyles (NATSAL) survey, collected in Britain from adults aged 16-44. These data are publicly available at the UK Data Archive (www.data-archive.ac.uk) [1].
2. Next, half the nodes are designated "relationship initiators" and each of these nodes (j) is given k j start times for its attempted relationships. Start times are selected uniformly within the simulation run-time. In order of increasing start time, each initiator attempts to make links to non-initiators n who have not yet formed their desired contacts, or to other initiators if creating an edge representing a homosexual partnership. To incorporate assortative mixing, there is a 70% probability that the end-node is required to have a % similar (within 15%) k i value to that of the start-node. 3. To model the notion that highly active individuals probably have shorter partnerships than individuals with few relationships, edge duration is chosen depending on the aggregate degree, or overall activity level, of the individuals that it links. Specifically, each node has a preferred relationship duration di, which is exponentially distributed with a mean inversely proportional to the node's degree, i.e. d i = M/(ν * k i ) where M is the simulation time. The duration d ij of an edge between nodes i and j is a weighted average of the preferred duration di of node i, and d j of node j: This method leads to the creation of a dynamic contact network whose aggregate contact distribution matches the survey data. We use networks of 50,000 individuals and a simulation period of 260 weeks. Table S1 lists properties of the networks and Figure S1 plots the distribution of aggregate degrees over the simulation period. The networks produced by the NATSAL-based method show a large amount of degree heterogeneity, with contact numbers ranging from 1 to 150 over five years. In order to compare trees produced by sampling populations with different levels of contact heterogeneity, we also produced networks much more homogeneous degree distributions. These networks are formed by considering every valid pair of nodes, and (independently of all other node pairs) creating an edge between them according to a given probability p (resulting in a Poisson distribution of degree). We chose p such that the same duration of infectiousness and transmissibility resulted in broadly comparable epidemic trajectories on the static ER and NATSAL networks (p = 0.0001, giving a mean degree of approximately 2.25; see Table S1).
Edge duration in the homogeneous networks is calculated depending on the nodes' current degrees at the time the edge is created, as nodes are not given a target degree in this case. The same equation, , is then used with k representing actual degree instead of target degree and ν being a constant input parameter. However, using the same value for the duration parameter ν in both networks results in the homogeneous network having a significantly higher mean edge duration, higher concurrency and consequently a larger epidemic. We compensated for this by increasing ν to 3.5.
Static networks of each type were created by discarding all information on edge timing, with all edges being active throughout the entire simulation run-time. Figure S1 shows the prevalence of the pathogen for the simulations from which trees were derived. The top row shows the prevalence for the results given in the main text and illustrates the increased variability in population dynamics for static networks. Transmission was curtailed on static networks to prevent a very rapid explosion in pathogen population.

Supplementary Results
We also compared trees resulting from the same networks but with a reduced pathogen duration of infectiousness of 10 weeks. The results were broadly similar to those reported in the main text, except that here we did observe a difference in tree imbalance, with NATSAL-like networks having higher imbalance in the trees than ER networks. However, the branch lengths were not appreciably different (see Figures S3 and S4).
In addition, we explored the effect of 'bottlenecking' of mutations at the time of transmission. This affects the number of mutations in a sequence passed to a new host. Again, our central results are not affected: in NATSAL-derived trees, there are more clusters, smaller branch lengths and slightly higher imbalance (see Figure S3).

Clustering
The numbers and sizes of phylogenetic clusters depend on how clusters are defined and on where the cut-off is located. For this reason, we determined the numbers and sizes of clusters in the trees for cut-offs varying between 0 and 0.1 substitutions per site, and for cut-offs at particular portions of each tree's total genetic distance. Results are shown in Figures S5, S6, S7 and S8. Figure S4: Cluster count, branch lengths and imbalance for prevalence and incidence matched pair of simulations with duration of infectiousness 10 weeks. Dashed lines indicate the expected value of the imbalance, as in the main text. Figure S5: Numbers of clusters for varying clustering cut-off. As before, NATSAL networks yield phylogenies with more variable cluster numbers than ER networks. Figure S6: Numbers of clusters for varying clustering cut-off, when the cut-off is a portion of the trees' total genetic distance from leaves to root. Figure S7: Sizes of clusters for varying clustering cut-off Figure S8: Sizes of clusters for varying clustering cut-off, when the cut-off is a portion of the trees' total genetic distance from leaves to root.

Phylogenetic error
The mutation rate in the main results was sufficiently high that phylogenetic error should not have played a role in shaping the results we reported. However, particularly for bacterial pathogens in situations of rapid transmission, under some circumstances, mutations may not accrue rapidly enough for genealogies to be reliably inferred from phylogenetic data. We therefore explored lower mutation rates, and Figure S9 shows the results. While we have argued that differences in cluster numbers and sizes are highly variable and therefore are not robust detectors of network differences, they are the most robust to phylogenetic error.  Table S1: Summary of network properties. Results reported for dynamic networks are based on an average of 10 snapshots taken of the networks between 100 and 200 weeks, when relationship dynamics have stabilised. Because they are snapshots and relationships are dynamic, only a few of the existing relationships are present in a given snapshot, so the numbers of edges (at a time) in the dynamic networks are considerably smaller than the numbers of edges in the corresponding static networks. Numbers in parentheses are the standard deviations of reported values over these snapshots.