Inferring network properties based on the epidemic prevalence

Dynamical processes running on different networks behave differently, which makes the reconstruction of the underlying network from dynamical observations possible. However, to what level of detail the network properties can be determined from incomplete measurements of the dynamical process is still an open question. In this paper, we focus on the problem of inferring the properties of the underlying network from the dynamics of a susceptible-infected-susceptible epidemic and we assume that only a time series of the epidemic prevalence, i.e., the average fraction of infected nodes, is given. We find that some of the network metrics, namely those that are sensitive to the epidemic prevalence, can be roughly inferred if the network type is known. A simulated annealing link-rewiring algorithm, called SARA, is proposed to obtain an optimized network whose prevalence is close to the benchmark. The output of the algorithm is applied to classify the network types.


Introduction
Graphs are the underlying structures of many systems and many dynamic processes on those systems can be modeled by a spreading process on their underlying graphs (Pastor-Satorras et al. 2015;Anderson et al. 1992;Harris 1974). The difference in the underlying graphs may lead to contrasting dissimilar behavior of the process. One wellknown result is that the mean-field epidemic threshold of the spreading process vanishes with the size of the scale-free network (Pastor-Satorras and Vespignani 2001;Chatterjee and Durrett 2009), while the threshold of a sparse homogeneous network is non-zero.
Another key difference is that a near-threshold spreading process is localized just above the threshold in a heterogeneous network, but delocalized in a homogeneous network (Goltsev et al. 2012;Liu and Mieghem 2019). Moreover, the autocorrelation of the infection state of each node in a regular graph is irrelevant to the curing rate in the steady state (Liu and Van Mieghem 2018). In a real scenario, reviewing of the spreading data of cholera in London in 1854 under the susceptible-infected-susceptible (SIS) model indicates that the trajectory of the prevalence reflecting network properties supporting the hypotheses that the Broad Street pump was the source of the cholera outbreak and that cholera does not spread via the air (Paré et al. 2018). Since the dynamics of different networks behave differently, the inverse question raises: "How much can we deduce about the underlying contact network by measuring the dynamics on the network?" The inverse question is meaningful when the direct measurement of the underlying graph is unavailable. For example, a disease control agency usually has the statistics of disease infection, but the underlying graph bearing the spreading of the disease is generally unknown.
Much work on the inverse problem exists (Mateos et al. 2019;Dong et al. 2019). Most of the papers focus on reconstructing the underlying graphs by measuring the timedependent dynamical state of each node (Shandilya and Timme 2011;Berry et al. 2012;Timme and Casadiego 2014;Nitzan et al. 2017;Prasse and Van Mieghem 2018;Netrapalli and Sanghavi 2012;Myers and Leskovec 2010;Sefer and Kingsford 2015;Gomez Rodriguez et al. 2010). With the complete dynamics of each node, the network may be approximately reconstructed by different heuristic algorithms, e.g., the Bayesian methods (Friston 2002;Pajevic and Plenz 2009), the conflict-based method (Ma et al. 2015), statistical inference based method (Ma et al. 2018) and the compressed sensing or lasso methods (Shen et al. 2014). Different networked dynamical processes have been studied, such as the evolutionary game model ), the SIS model (Shen et al. 2014) and the Ising model (Ma et al. 2018). Apart from reconstructing simple networks, there are many works on the reconstruction of the stochastic temporal networks , multilayer networks (Mei et al. 2018), weighted networks (Ching et al. 2015) and directed networks (Hempel et al. 2011).
All of the above methods are based on the data from all or at least most nodes, but in real scenarios, individual-level observations of spreading are hard to obtain while most of the epidemic data are population-level (Shaman and Kohn 2009;Shaman et al. 2010). Motivated by the incompleteness of realistic situations, we study how much about the underlying network can be deduced with incomplete measurements. We assume that only the prevalence, which is the average fraction of infected nodes in the network is measured, but not the infection state of each node. Under this setting, network reconstruction does not seem possible, but inferring some network properties may be possible, in particular, when additional information apart from the prevalence is available. In this work, we confine ourselves to four types of classical network models: the scale-free (SF) graphs (Goh et al. 2001), the Barabási-Albert (BA) graphs (Barabási and Albert 1999), the Erdős-Rényi (ER) random graphs (Erdős and Rényi 1959) and the Watts-Strogatz (WS) small-world graphs (Watts and Strogatz 1998). The network size N of these networks considered in this work is not larger than 2000. Additionally, we focus on the SIS epidemic process on networks, which is one of the basic models resembling the dynamics of many networked systems and assume that the infection and curing rate of the SIS process are known. Under our assumptions, part of the network properties can be inferred, provided that the network type is additionally given. Furthermore, the network type among the four above-mentioned graphs can be identified, given the network size N and the average degree E [ D], which is also emphasized by recent work from a different approach (Di Lauro et al. 2019): the ER, regular and BA graphs are distinguished by the epidemic prevalence.
The paper is organized as following: In "The SIS process" section, we briefly review the SIS process on networks. In "Correlations between the SIS prevalence and network metrics" section, we evaluate the correlation between the network metric difference and the corresponding SIS prevalence difference given the network type. A high correlation implies that, if an estimated network, whose prevalence is close to the benchmark prevalence, can be found, then the metric of this estimated network may be also close to the metric of the benchmark network. We further verify the possibility of estimating the network metrics, whose differences are highly correlated with the prevalence difference. In "Distinction between network types" section, we propose a simulated annealing linkrewiring algorithm (SARA) to find a possible network whose prevalence is close to the benchmark. The output of the algorithm is applied to classify the network types. In "Estimating the topology of small networks and prevalence" section, we test the performance of SARA by inferring the structure of small networks and by forecasting the future trend of the prevalence. Finally, we conclude in "Summary" section.

The SIS process
We consider the SIS process on an unweighed, undirected network without self-loops. In the network, all the nodes are divided into two compartments: infected nodes and susceptible (healthy) nodes. An infected node can infect each healthy neighbor with rate β and the infected node can be cured spontaneously with rate δ, both as Poisson processes. If we denote the infection state of node i at time t by a Bernoulli random variable X i (t), with X i (t) = 1 being infected and X i (t) = 0 being healthy, the exact SIS process of node i in an N-node network is governed by the following equation, where a ki ∈ {0, 1} is the element of the adjacency matrix A of the network. In the brackets of the right-hand side of (1), the first term represents the curing process and the second term represents the infection process. If the effective infection rate τ β/δ is above an epidemic threshold, then the infection can persist in the network; below the threshold, the epidemic dies out exponentially fast for sufficiently long time. The endemic phase and allhealthy phase are identified by the time-dependent prevalence y(t) = 1 . In this paper, the SIS prevalence is generated by an event-driven simulation based on the Gillespie algorithm (Gillespie 1977;Liu and Van Mieghem 2017;St-Onge et al. 2019).

Preliminaries
Two different networks may produce a similar prevalence, and thus we need to understand which network properties are important factors in the SIS process. If the SIS prevalence is sensitive to a specific network metric, then the prevalence generated by two networks with different values of this metric may be distinct. Assume that we have a benchmark network with a metric M b and an estimated network with the metric M e . If the time series of the prevalence on the benchmark and estimated networks are {y b (i t)} i=0,...,T−1 and {y e (i t)} i=0,...,T−1 , respectively, then their correlation can be evaluated by computing the prevalence difference and the metric difference If we have n corresponding realizations of the differences (D pi , D Gi ) for i = 1, . . . , n, then we can compute their correlation by the Pearson correlation coefficient ( Van Mieghem 2014, p. 26), Only if ρ(D p , D G ) approaches one, then the metric M and the prevalence y(t) are highly correlated, which indicates that inferring the metric from the prevalence may be possible.

Evaluated network metrics
The graph metrics considered in this section are shown in Table 1.
The assortativity ρ D , which is the degree correlation between connected nodes (Van , can be calculated as where d i and d j are the degrees of nodes at the end of a link i ∼ j, and L is the number of links.
The average clustering coefficient C G , which is the probability that the node pairs with same neighbors are also connected, can be computed as where i is the number of triangles containing node i. Some of the above metrics can be strongly correlated with the prevalence y(t). For example, the epidemic threshold τ HMF c derived from the heterogeneous mean-field (HMF) approach (Pastor-Satorras et al. 2015) is where D is the degree of a randomly selected node and the epidemic threshold τ (1) c derived from NIMFA (Mieghem et al. 2009) is Many graph metrics can also be bounded. For example, the average degree follows E[D] λ 1 in connected graphs ( Van Mieghem 2010) and the largest eigenvalue of the

Correlation analysis
For any pair of networks, the prevalence difference D p and the metric difference D G can be calculated based on (2) and (3). For each network metric, we calculate the correlations via (4) between a set of metric differences D G and their corresponding prevalence differences D p on four network models: the SF graphs (Goh et al. 2001), the BA graphs (Barabási and Albert 1999), the ER random graphs (Erdős and Rényi 1959) and the WS small-world graphs (Watts and Strogatz 1998). Specifically, the SF graphs are generated by the configuration model (Goh et al. 2001;Catanzaro and Pastor-Satorras 2005) and the degree exponent parameter γ is uniformly at random chosen in the interval [2.5, 3.0] in this paper. Specifically, we first randomly generate the four kinds of networks each with 100 realizations. The network sizes N and the average degrees E [ D] are chosen uniformly at random in the interval [1000,2000] and [4,12], respectively. The effective infection rate is set as τ = 3.0, which is above the epidemic threshold of every network realization. Two kinds of initial state are chosen: y 0 = 0.2 or y 0 = 1.0, which means that 20% of the nodes are randomly chosen to be infected or all nodes are infected initially. For each network and initial state, a corresponding time series of the prevalence is obtained by averaging over 100 realizations of the SIS simulation. We mark the prevalence difference D p under initial condition y 0 as D p (y 0 ). We further denote the metric difference D G for one specific metric as D G (metric). All metrics shown in "Evaluated network metrics" section are considered and the Pearson correlation coefficients ρ D p (y 0 ), D G (metric) are calculated by Eq. (4). The sample size of each correlation coefficient is 100 2 = 4950. Table 2 and     However, there are relatively weak correlations between the difference of the prevalence D p and the differences of the network size N, the largest degree d max , the algebraic connectivity μ N−1 , the assortativity ρ D and the average clustering coefficient C G . Moreover, the initial state has very slightly influence on the correlations. To summarize, if the type of the underlying graph is given, then inferring the network properties, whose differences D G are highly correlated to the difference of the prevalence D p , is possible. A straightforward method is randomly generating the network realizations by the corresponding network model and selecting the one realization produces minimum prevalence difference D p .

Inferring network metrics given the network type
We further try to infer the network metrics based on the prevalence from a single realization of the SIS process given the network type. Specifically, for each network type, we first generate 1000 benchmark networks whose network sizes N and average degrees E [D] are chosen uniformly at random in the interval [200,500] and [4,8], respectively. For each benchmark network, one corresponding benchmark prevalence is generated from only one realization of the SIS process.
We then try to estimate the network metrics of each benchmark network as follows. For each benchmark, 1000 networks with the same network type as the benchmark network are generated. The network sizes N and average degrees E[D] of the generated networks are also chosen uniformly at random in the interval [200,500] and [4,8], respectively. The network with the smallest prevalence difference D p to the benchmark is selected as the estimated network. The metrics of this estimated network are regarded as the estimated metrics of the benchmark network.
We measure the performance of the metric inference under the mean absolute error (MAE) and the mean squared error (MSE). The MAE and MSE for n underlying graphs is given by and where M ei and M bi denote the estimated and real metrics of the benchmark network G i , i = 1, 2, · · · , n.
Tables in the Additional file 1 show MAE and MSE of each network metric for different network types (the ER random graphs, the WS small-world graphs, the BA graphs and the SF graphs). For the treatment group, we calculate MAE and MSE of each metrics which are estimated by selecting the network whose prevalence is closest to the benchmark. For the control group, we calculate MAE and MSE of each metrics which are estimated by randomly generating a network whose network sizes N and average degrees E [D] are chosen uniformly at random in the interval [200,500] and [4,8]. For the network metrics whose differences are closely correlated with the prevalence difference D p , i.e., the average degree E[D], the second moment of degree E[D 2 ], the average shortest path length E[H], the global efficiency E[1/H] and the spectral radius λ 1 , their MAE and MSE of the treatment group are much less than those of the control group, which indicates that these metrics can be roughly deduced based on the prevalence given the network type. However, for the network metrics whose differences are weakly correlated with the prevalence difference D p , i.e., the network size N, the largest degree d max , the algebraic connectivity μ N−1 , the assortativity ρ D and the average clustering coefficient C G , their MAE and MSE of the treatment group are close to those of the control group.

Distinction between network types
In this section, we try to distinguish the type of the underlying network given the time series of the prevalence {y b (i t)} i=0,...,T−1 , the network size N, the number of links L and the effective infection rate τ . We propose a simulated annealing link-rewiring algorithm (SARA) to optimize a network whose prevalence can be close to the input prevalence benchmark and the performance difference between different rewiring mechanisms in SARA can be applied to identify the graph type.

Simulated annealing link-rewiring algorithm
The basic principle of SARA is that the links of an estimated network are continually rewired based on different rewiring methods to minimize the prevalence difference D p between the optimized network and the benchmark network.
The algorithm operates iteratively and a random network is initialized. In each iteration, the network will be renewed by rewiring the links of partial nodes. A new corresponding time series of the prevalence {y e (i t)} i=0,...,T−1 can be generated by simulating the SIS process on the network and its difference D p to the benchmark time series of the prevalence {y b (i t)} i=0,...,T−1 is calculated. If the difference D p decreases, then the rewired network will be accepted. If D p increases, then the rewired network is accepted with an acceptance probability p and rejected with rejection probability 1 − p to prevent local optima. Moreover, a stable final converging result is obtained provided that the acceptance probability p decreases with the iterations. The final result of this algorithm is an estimated graph, whose corresponding prevalence is almost the same as the benchmark prevalence. Inspired by the generation processes of ER graphs and BA scale-free graphs, we consider two different rewiring methods: randomly connecting (RC) and preferential attachment (PA). In RC, the selected nodes are rewired uniformly at random to the rest of the nodes in the network, and in PA, the selected nodes are rewired to a node with probability proportional to the node's degree. The pseudo-code of SARA is shown in Algorithm 1.

Algorithm 1: Pseudo-code of the simulated annealing link-rewiring algorithm (SARA)
Input : {y(i t)} i=0,...,T−1 , N, L, τ , initial temperature V tmp , cooling rate 0 < r < 1 and step length S N Output: Estimated network G e , final prevalence difference D p 1 An initial network is chosen uniformly at random from the set of all networks with N nodes and L links.

for iteration bound do 3
Uniformly randomly choose N c = round(S N × D p ) nodes.

4
Delete all links of each chosen node and then rewire these links to new neighbors.

5
If we randomly choose new neighbors without preference (RC), the probability p i that the rewired link is connected to a neighbor i is p i = 1/(N − N c ), where node i belongs to the N − N c uncollected nodes.

6
If we rewire links based on preferential attachment mechanism (PA) , the probability p i that the rewired link is connected to a neighbor i is where d i is the degree of node i in residual network and the sum is made over all unselected nodes.

7
If n isolated nodes appear after the rewiring process in step 5 or step 6, we remove n links uniformly at random and rewire them to the isolated nodes based on the RC or PA mechanism, respectively. This step continues until there is no isolated node in the network.

8
Simulate the SIS process on the new network and calculate the prevalence difference D 2 to the benchmark.

Distinction between the network types
We try to distinguish four kinds of graphs (the SF graphs, the BA graphs, the ER random graphs and the WS small-world graphs) based on the optimized prevalence curves generated by SARA. The experiment and the results are as follows. For each network model, we generate 100 network realizations with N = 1000 nodes and L = 4000 links as the benchmark networks. For the SF graphs, the degree exponent γ ranges in the interval γ ∈[2.5, 3.0]. For the SW graphs, the rewiring probability p r ∈[0.5, 1.0]. The corresponding time series of the prevalence are obtained by averaging 10 realizations with effective infection rate τ = 1, which is above the epidemic threshold for benchmark networks. For each benchmark graph realization and corresponding prevalence, we apply SARA with RC and PA mechanisms separately and obtain two corresponding prevalence differences D RC and D PA from the final output of the optimization, respectively. The performance difference between these two rewiring mechanisms provides a possibility of identifying the types of underlying graphs by applying different rewiring methods in SARA. We then try to classify the networks by the difference value D RC − D PA . Figure 1 shows that these four kinds of networks can be almost exactly classified by the difference value D RC − D PA . Indeed, D RC > D PA for almost all SF and BA graphs while D RC < D PA for almost all ER and SW graphs as shown in Fig. 1a. We exam the classification performance by the receivers operating characteristic (ROC) curve, which is a curve of the True Positive Rate (TPR) where  The area under the ROC curve (AUROC) depicts the accuracy of classification. If AUROC = 1, then the classification is perfect. In Fig. 1b and Fig. 1d, the ROC curves of the difference value D RC − D PA between any two kinds of networks show that these networks can be distinguished almost exactly.

Estimating the topology of small networks and prevalence
The network output of SARA: an example In this section, we test the feasibility of approximately reconstructing small graphs from the prevalence. We show example output of SARA under the benchmark of a small tree network and a small wheel network. In SARA, the initialized networks are chosen uniformly at random from all networks with the same number of nodes and links as the benchmark networks. The rewiring methods are selected to be the one with a smaller difference of the prevalence in the output. As shown in Fig. 2, the main features of the benchmark networks are captured fairly well by the final output of SARA.

Forecast the future trend of epidemic prevalence
Any benchmark prevalence from either homogeneous or heterogeneous networks can be fitted well by SARA. Therefore, we can further analyze the feasibility of predicting the future prevalence evolution by fitting the few initial prevalence observations. Fig. 2 The reconstruction of a tree network and a wheel network. The left parts are the benchmark underlying network and the right parts are the estimated networks. The curves are the difference of prevalence against the number of iterations. The difference of prevalence is already small when the number of iterations is around 150. The prevalence curves are obtained by averaging 500 realizations and only the central node is infected initially We fit only the initial part (10%) of the time series of the prevalence {y(i t)} i=0,..., T/10 −1 generated by four different benchmark networks and compare the whole prevalence output of the algorithm with the benchmark prevalence. RC rewiring is applied for ER and WS graphs, and PA rewiring is applied for BA and SF graphs. As shown in Fig. 3a about the ER and WS graphs, the estimated prevalence (dashed curves) are close to the benchmark (solid curves). However, as shown in Fig. 3b, the prediction is inaccurate for BA and SF graphs.

Summary
We study the feasibility of inferring properties of the underlying graphs based on the SIS prevalence. Pearson's correlations (4) between the differences of prevalence and the network metrics are evaluated. Given network type, the difference of the epidemic prevalence is highly related to the differences of some network metrics, such as the average degree E [D], the second moment of degree E[D 2 ], the average shortest path length E[H], the global efficiency E[1/H] and the spectral radius λ 1 . If the network type is known, then these metrics can be roughly estimated by finding a network whose prevalence curve is close to the benchmark. To distinguish the network type, we further propose an algorithm SARA, which can find a network whose epidemic prevalence is close to the benchmark. Given network size and the number of links, four network types (the SF graphs, the BA graphs, the ER random graphs and the WS small-world graphs) can be classified by different rewiring methods combined with SARA. Visually, the output network of SARA captures the features of small benchmark networks well. Finally, we show that it is possible to predict the later prevalence from the initial stage prevalence for homogeneous networks.
The prevalence in the SIS model resembles the population-level observations. Population-level observations lose details of nodal infection but may still provide information about the underlying network. In real scenarios, the population-level observations are available for many different infectious diseases, such as influenza, Ebola virus disease, Zika virus disease, etc. Disease control agencies may take advantage of the populationlevel observations to understand the detailed spreading pattern, further forecast the outbreaks more accurately and control the diseases more efficiently. For example, a small The results about BA graphs and SF graphs. Two kinds of initial states are chosen: y 0 = 0.2 and y 0 = 1.0. The time series of the prevalence are the mean of 10 propagation diameter of the network inferred by the population-level observations implies that modern transportation plays a role; a large clustering coefficient means that spreading is effectively exploring a community or geographical area; using the initial stage prevalence, it is possible to approximately reconstruct the small-size local network containing the initial infections. Limitations like those in our experiments, such as the demanding of extra parameters apart from the prevalence, still exist, but on the other side, additional known knowledge, e.g. population distribution, may be available and helps the inference of the network properties.