Insights into the transmission of respiratory infectious diseases through empirical human contact networks

In this study, we present representative human contact networks among Chinese college students. Unlike schools in the US, human contacts within Chinese colleges are extremely clustered, partly due to the highly organized lifestyle of Chinese college students. Simulations of influenza spreading across real contact networks are in good accordance with real influenza records; however, epidemic simulations across idealized scale-free or small-world networks show considerable overestimation of disease prevalence, thus challenging the widely-applied idealized human contact models in epidemiology. Furthermore, the special contact pattern within Chinese colleges results in disease spreading patterns distinct from those of the US schools. Remarkably, class cancelation, though simple, shows a mitigating power equal to quarantine/vaccination applied on ~25% of college students, which quantitatively explains its success in Chinese colleges during the SARS period. Our findings greatly facilitate reliable prediction of epidemic prevalence, and thus should help establishing effective strategies for respiratory infectious diseases control.


Distinguishing regular CPIs from transient CPIs formed by purely random interactions
As shown in Fig. 2 in the paper, the duration of interactions vary greatly. In fact, the real CPIs can be divided into two categories according to CPI duration, namely, the transient CPIs formed by purely random human interactions, and regular CPIs representing meaningful interactions among students [1]. Thus, we employed a mixture statistical model to describe the CPI durations, and determined the optimal duration threshold to distinguish these two categories. As shown in Supplementary Fig. S3, the CPI duration histogram of the SCAU network on November 18, 2011 shows two peaks, where the left peak (denoting transient CPIs) can be approximated by a geometric distribution, and the right one (denoting regular CPIs) can be approximated by a Poisson distribution.
Consider N CPIs with durations denoted as x 1 , · · · , x N . Suppose the parameters of the geometric distribution and Poisson distribution are p and λ respectively. Then the distribution of CPI duration can be described by a mixture model as below.
P r(x n = k) = π 1 (1 − p) k−1 p + π 2 λ k e −λ k! . (1) Here, π 1 , π 2 denote the weight of the two distributions, respectively. The parameters π 1 , π 2 , p, λ were calculated using the expectation-maximization (EM) technique, where the expectation step (E-step) and maximization step (M-step) were alternatively executed until convergence. E-step: At the (t + 1)-st iteration, for each CPI duration sample x n (n = 1, 2, ...N ), we calculated the likelihoods for x n being a transient CPI or regular CPI using the parameters π (t) 1 , π (t) 2 , p (t) , λ (t) estimated at the t-th iteration. In particular, we have: xn! (2) and M-step: Update parameter estimations π (t+1) 1 , π (t+1) 2 , p (t+1) , λ (t+1) as below: and π (t+1) 1 Having determined the two distributions, the intersection point can be used as the distinguishing threshold for transient interactions. In the example, the distinguishing threshold is calculated as 6 Bluetooth scans, which equals to 30 mins since the Bluetooth scan is repeated every 5 minutes ( Supplementary  Fig. S3). Similar observations were obtained from CPI networks on other dates due to the stability of transient CPI distribution.

Construction of idealized CPI networks
Idealized networks were constructed from real CPI networks. Two different scenarios existed here: (1) constructing one idealized network from one real CPI network, which was utilized in network analysis; and (2) constructing one idealized network from several consecutive real CPI networks, which was used in epidemic simulation.
1. Constructing one idealized network from one real CPI network Suppose the real network has n individuals, and m CPIs with weights w 1 , · · · , w m . Then the idealized network to be constructed should have n individuals, and m CPIs with identical weightŵ = 1 m i w i .
2. Constructing one idealized network from several consecutive real CPI networks Suppose there are T real networks, which have identical number of n individuals but different number of CPIs (m t CPIs for the t-th network) with different weight (w t,i for i-th CPI in t-th network). Then, the idealized network will have n individuals andm CPIs with identical weightŵ, wherem = 1 T t m t andŵ = 1 T m t,i w t,i . With determined number of individuals and CPIs with identical weight, we constructed three different idealized network: uniformly-random network, scalefree network, and small-world network. The construction process was based on Erdös-Rényi model, Barabási-Albért model, and Newman-Watts model, respectively [2]. Suppose the real CPI network has n individuals and m CPIs with identical weight w. The algorithms for the construction are as follows, where we use "nodes" to denote "individuals", and use "edges" to denote "CPIs'.

Generate uniformly-random networks
Uniformly-random networks are generated simply from n isolated nodes, and randomly connecting two nodes until m edges are formed (Algorithm 1).

Algorithm 1 Construct uniformly-random network
1: Construct a network with n isolated nodes. 2: while number of edges < m do

3:
Randomly select two unconnected nodes, and connect them with an edge weighted w. 4: end while

Generate scale-free networks
In scale-free network generating process, we first initialize n 0 isolated nodes, and at every step later, we add a new node and connect it to ∆m old nodes with w-weighted edges. The first newly added node should be connected to all of the n 0 old nodes to avoid isolated nodes in the final network. Thus, if we connect every new node to n 0 old ones, we have m n − n 0 = n 0 , and we get n 0 = n− √ n 2 −4m 2 . To make sure n 0 is an integer, we set Since we have connected n 0 edges in the network when the first new node is added, we set the number of new edges in the following every step as If ∆m is not an integer, we apply a small trick to achieve our goal of m total edges, by adding ∆m edges with probability of ∆m− ∆m , and adding ∆m edges with probability of ∆m − ∆m. The construction process is shown in Algorithm 2. Let ∆m = ∆m 1 with probability p 1 , or ∆m = ∆m 2 with probability p 2 ; Select ∆m nodes randomly with probability proportional to their degrees, and connect them to the newly added node with weight w. 14: end while

Generate small-world networks
In small-world network generating process, we first initialize a regular circle with n nodes and m 0 edges, and then randomly add m 1 edges. The number of edges used in the initial circle must be a multiple of n, thus m 0 = n m n , and m 1 = m − m 0 . The detailed construction is shown in Algorithm 3.

Periodicity analysis of CPI networks
Taking the school schedule into considerations, one might expect strong periodicity from CPI networks in schools. To investigate periodicity of our CPI networks, we calculated the similarity between two aggregation CPI networks N i and N i+∆t , where N i denote the CPI network at i-th day. If there is a periodicity of ∆t, we will observe considerable similarities between N i and N i+∆t Randomly choose two unconnected nodes, and connect them with an edge weighted w. 6: end while 7: for every node do 8: for every edge connected to the node do 9: Rewire the edge towards another unconnected and randomly chosen node with probability of 0.01.

10:
end for 11: end for for all i. The similarities were measured using the following 4 features.
• Degree distribution similarity: where i is defined as i = j+∆t, d i k denotes the degree of the k-th individual in N i , andd i denotes the average degree of N i .
• Strength distribution similarity: where i is defined as i = j + ∆t, s i k denotes the strength of the k-th individual in N i , ands i denotes the average strength in N i .
• Average fraction of common neighbors: denotes the number of common neighbors of the k-th individual shared by N i and N j .
• Average fraction of repeated contacts: where i is defined as i = j + ∆t, s i k denotes the strength of the k-th individual in N i , s i kl denotes the number of contacts between the k-th and the l-th individuals on the i-th day, and RC(k, N i , N j ) = n l=1 min{s i kl , s j kl } denotes the number of repeated contacts of the k-th individual in N i and N j .
For the consecutive 28 SCAU aggregation CPI networks, these similarity features were calculated and shown in Supplementary Fig. S6. The figure demonstrates a periodicity of 7 days, which is consistent with the school schedule. However, the periodicity is not so strong; thus, the long-term CPI networks cannot be constructed by simply repeating one week of CPI networks.

The relationship between CPI data characteristics and data-collection coverage
The CPI data in SCAU were gathered from 174 volunteers, which cover only ∼10% of the whole undergraduate community. The small sample size might affect the calculation of some network properties [3], and a large-scale data collection program in the future would help clarify the extent to which the sample size affects the network properties estimation and understanding of the disease spread. However, this is not an inherent drawback as we can demonstrated, theoretically and experimentally, that the network characteristics of CC, strength distribution are insensitive to the sample size. In other words, the sample size in the study is sufficient for most of the network analysis ( Supplementary Fig. S10). Suppose that the underlying CPI network (denoted as A) among the whole community has N nodes and M edges, and the acquired CPI network (denoted as B, which is a random sample of A) has n nodes and m edges. We will prove that the density and CC of the sampled network B are good approximations to those of the underlying network A. Proof. Let denote the N nodes of A as A 1 , A 2 , · · · , A N , and B i (i = 1, 2, · · · , n) and b ij (i = 1, 2, · · · , n, j = 1, 2, · · · , n) are defined similarly. Consider the sampling process as a permutation π, where a subset C of n nodes are first randomly selected from the N nodes in A, and are then permuted randomly. Then we have Note that b π(i)π(j) is binary and the probability Pr(b π(i)π(j) = 1) = since the nodes in B are randomly selected from the nodes in A. Thus we have Therefore, the expectation of the density of B equals to the density of A. Proof. There are two different definitions of clustering coefficients (CC). Here, we prove the theorem under two definitions, respectively.
1. The clustering coefficient of a network is defined as the mean CC of all nodes. Consider a specific node v with degree D in A. Suppose there are a total of M edges among the D nodes. Then the CC of v is defined as: Suppose in the network B, the node v has a degree of d and there are a total of m edges among the d nodes. Then the CC of v in B can be calculated as: 2. Another definition is that CC can be calculated as three times the ratio of the number of triangles and triples.
Suppose in the sampling process to generate B from A, every node is selected with the probability of p. Thus, every connected triple, as well as every triangle, is selected with the probability of p 3 . Hence, we have

The network-based strategies for disease control
At present, the popular approaches to control the transmission of respiratory infectious diseases are targeted quarantine/vaccination, i.e., selecting a collection of individuals for quarantine or vaccination. The percentage of selected individuals is denoted as quarantine/vaccination coverage. A variety of network-based selection strategies have been proposed according to individual's characteristics calculated based on CPI networks, including degree (contact number), strength, betweenness, clustering coefficient (CC), and the primary eigenvector.
In this study, we evaluated the following network-based strategies for disease control.
1. Degree strategy: The quarantine/vaccination subjects were selected according to the number of CPI partners, i.e., the node degree in CPI networks. The individuals were ranked with descending order of degrees, and were removed in this order until the required vaccination coverage was achieved. The rationale underlying this selection is to perform quarantine/vaccination on the individuals with frequent CPIs with others.
2. Strength strategy: The quarantine/vaccination subjects were selected according to the total duration of interaction with others. The ranking and removing policy is identical to that used in the degree strategy.
3. CC strategy: For an individual, CC denotes the fraction of the number of its CPI partner pairs that have interactions over the total possible number of its CPI partner pairs. A high CC usually means that the individual is unlikely to constitute a bottleneck for disease spread; thus, we chose nodes with lower CC for quarantine/vaccination, leading to the opposite policy. The removing policy is identical to that used in the degree strategy.
4. Betweenness centrality strategy: An individual has high betweenness centrality if it lies on shortest paths of many individual pairs. Specifically, for an individual i, its betweenness centrality is defined as C B (i) = s =t =i σst(i) σst , where s and t denotes two individuals in the network, σ st denotes the total number of shortest paths between s and t, and σ st (i) is the number of those paths passing by i. All individuals ranked in nonincreasing order of betweenness centralities, and the top k individuals were selected for quarantine/vaccination such that the required coverage was achieved.

Betweenness centrality strategy (greedy version): After the individual
with the highest betweenness centrality was removed, the betweenness centrality of every remaining individual was updated, and the next candidate for removing was selected accordingly. This process was repeated until the required coverage was achieved.
6. Eigenvector centrality strategy: Eigenvector centrality of a vertex is the corresponding component of the primary eigenvector of the adjacency matrix A, where entry a ij denotes the CPIs duration between individuals i and j. Intuitively, an individual with high eigenvector centrality is more likely to be a hub of the CPI network. The ranking and removing policy is identical to that used in the degree strategy.
Besides these network-based strategies, class cancelation is also commonly applied as a disease control strategy in boarding schools. During class cancelation period in boarding schools, all students are ordered to stay within the campus, with all classes cancelled, thus showing a CPI pattern nearly identical to that at weekends. In the study, class cancelation was simulated by simply replacing the CPI networks at weekdays with the CPI networks at weekends.    Figure S1: The characteristics of SCAU CPI networks on weekdays (A) and weekends (B). The characteristics include density (blue), CC (magenta), efficiency (green) and modularity (red). It can be observed that all CPI networks on weekdays show similar characteristics, and so do the CPI networks on weekends. The only exception is the CPI network on November 12th -there was an activity in SCAU, which results in relatively frequent transient interactions. These frequent transient interactions further lead to significantly low modularity. Figure S2: Real CPI networks acquired on November 07 (Monday) and November 12 (Saturday), 2011 in USTB, denoted as (A) USTB1107 and (B) USTB1112, respectively. Here, a node represents an individual student, and if there is a CPI between two students, an edge is drawn between the two corresponding nodes with edge width proportional to the aggregate CPI duration in the entire day. Figure S3: Histograms of CPI durations among SCAU volunteers on November 18, 2011. The histogram is approximated with the mixture of a geometric distribution and a Poisson distribution, plotted with red and blue lines, respectively. The inset shows in details two lines interacting at CPI = 6.5 scans, which implies that the contacts lasting less than or equal to 6 bluetooth scan (or 6 × 5 = 30 minutes) are transient interactions.  Figure S4: Degree distribution of real CPI networks acquired on November 1 (Tuesday, panels A and F), November 2 (Wednesday, panels B and G), November 3 (Thursday, panels C and H), November 4 (Friday, panels D and I), and November 6 (Sunday, panels E and I), 2011 in SCAU, denoted as SCAU1101, SCAU1102, SCAU1103, SCAU1104 and SCAU1106 respectively. Panels A-E are shown in linear scale, while panels F-J are shown in log-log scale. It can be observed that the CPI networks on weekdays (SCAU1101, SCAU1102, SCAU1103, SCAU1104) significantly deviate from idealized Poisson distribution of the counterpart uniformly-random network with the same number of nodes and edges. In contrast, the network on the weekend (SCAU1106) shows a power-law distribution.   At each coverage level, a total of 100 random subsets of vertices were selected and 100 subgraphs were induced from these vertices. The figures indicate that the expected value of density and CC of subgraphs approximate that of the original networks. As expected, the higher coverage, the smaller variance of the characteristics.