Epidemic inference through generative neural networks

Reconstructing missing information in epidemic spreading on contact networks can be essential in prevention and containment strategies. For instance, identifying and warning infective but asymptomatic individuals (e.g., manual contact tracing) helped contain outbreaks in the COVID-19 pandemic. The number of possible epidemic cascades typically grows exponentially with the number of individuals involved. The challenge posed by inference problems in the epidemics processes originates from the difﬁculty of identifying the almost negligible subset of those compatible with the evidence (for instance, medical tests). Here we present a new generative neural networks framework that can sample the most probable infection cascades compatible with observations. Moreover, the framework can infer the parameters governing the spreading of infections. The proposed method obtains better or comparable results with existing methods on the patient zero problem, risk assessment, and inference of infectious parameters in synthetic and real case scenarios like spreading

infections in workplaces and hospitals.

Introduction
Discrete-state stochastic compartmental models have been traditionally used to model infectious diseases (1)(2)(3) and provide a simple and unified mathematical framework for a wide variety of spreading processes occurring in social and technological systems (4).The time-forward simulations of most epidemic models, even those incorporating detailed demographic and mobility data, can be efficiently performed using Monte-Carlo based sampling techniques or, at least at the meta-population level, exploiting approximation methods, such as stochastic differential equations and moment closure schemes (5).These computational methods have been largely applied to large-scale epidemic forecasting (6)(7)(8)(9) and containment (10)(11)(12).Their effectiveness crucially depends on the capacity to exploit the available information on the past behavior of the epidemic outbreak.At the meta-population level, such information, represented by temporal series of aggregate quantities (e.g.daily number of newly infected individuals inside a reference population) can be rather easily included within traditional Bayesian computational frameworks based on Monte Carlo sampling techniques (13)(14)(15).The COVID-19 pandemic has motivated the interest for the large-scale adoption of epidemic surveillance techniques and digital contact tracing through smartphone applications (16, 17), which could make it possible to access/use a large amount of (possibly inaccurate) individual-based observational data, traditionally available only for case studies in rather small and controlled environments (18-20).The availability of individual-based observational data unveils a crucial limitation of traditional Monte-Carlo based inferential techniques.While the number of possible epidemic realizations generated by a specific epidemic model on a given contact network scales exponentially with the systems size and the duration of the process, those compatible with individual-based observations are just an exponentially small fraction of them (21).It follows that inferential methods based on the direct sampling of epidemic realizations on individual-based contact networks rapidly become inefficient as the size of the outbreak increases (21).As an example, let us consider several simulated epidemic realizations (with the same initial condition, consisting of a single infected individual, and the same epidemic parameters) in a real graphs of temporal contacts between patients and staff members of a hospital (22) (see Fig. 1).We define the daily configuration of the system as the daily epidemic state of the individuals (infected/not-infected) during the epidemic process.
The plots in Fig. 1 indicate a fast divergence of the configurations of the simulated epidemics, even though they start from the same individual with the same epidemic parameters.Choosing, for instance, the final configuration of one epidemic cascade as the individual-based observation of the system, it is very unlikely to obtain the same configuration from a direct sampling of the epidemic model.
A step forward in this field is represented by the introduction of efficient algorithms for Bayesian inference based on Belief Propagation (BP) (23,24).In the Bayesian inference framework, the objective is to approximately compute the posterior probability of the system, assuming the epidemic model as a prior and the individual-based observations as the evidence.
BP-based algorithms make it possible to obtain estimates of the local marginals of the posterior distribution and, as shown in (23,25,26), this approach outperforms competing methods on sparse contact networks on a variety of inference problems.In particular, the integration of such algorithms in the framework of digital contact tracing for COVID-19 was recently shown to provide a better assessment of the individual risk and improve the mitigation impact of nonpharmaceutical interventions strategies (27).
BP-based algorithms may experience non-convergence issues, for instance in dense and very structured contact networks, a phenomenon that calls for the search of alternative inference methods which could overcome such a limitation while maintaining comparable performances on sparse networks.Here we propose to use generative neural networks, specifically autoregressive neural networks (ANN), to learn the posterior probability of an epidemic process and efficiently sample from it.In practice, the autoregressive neural network can generate realizations of the epidemic process according to the stochastic dynamical rules of the prior model but compatible with the evidence.Deep autoregressive neural networks are used to generate samples according to a probability distribution learned from data, for instance for images (28), audio (29), text (30,31) and protein sequences (32) generation tasks and, more generally, as a probability density estimator (33)(34)(35).Autoregressive neural networks have recently been used to approximate the joint probability distributions of many (discrete) variables in statistical physics models (36), and applied in different physical contexts (37)(38)(39)(40).In this work, we show how to use a deep autoregressive neural network architecture to efficiently sample from a posterior distribution composed of a prior, given by the epidemic propagation model (even though the parameters of such model can be contextually inferred), and from an evidence given by (timescattered) observations of the state of a subset of individuals.Neural networks have already been applied to epidemic forecasting (41)(42)(43) but rarely to epidemic inference and reconstruction problems.Two recent preliminary works apply neural networks to epidemic inference problems: in (44) the patient zero problem is tackled using graph neural networks, while a similar technique is applied to epidemic risk assessment in (45).The presented approach allows to address successfully a large class of epidemic inference problems, ranging from the patient-zero problem and individual risk assessment to the inference of the parameters of the propagation model under a unique neural network framework.We believe this to be a strong point in favour of this technique.In all such problems, the proposed autoregressive neural network architecture provides results that are at least as good as all other methods considered for comparison and outperforms them in most cases.The implementation of the algorithm and the instructions to reproduce the results are available at (46).

Materials and Methods
The posterior probability of the epidemic process.The dynamics of epidemic spreading in a contact network is commonly described by means of individual-based stochastic models in which individuals can be in a finite set of possible states, usually called epidemic compartments.For the sake of concreteness, consider the discrete-time SIR model, in which x t i ∈ X = {S, I, R} stands for the individual i being at time step t in the Susceptible (S), Infected (I) or Recovered (R) state.The infection of a susceptible individual due to a contact with an infected individual occurs with rate λ, while infected individuals recover in time with rate µ (heterogeneous epidemic parameters can be considered as well if necessary).In the epidemic propagation model, both the epidemic parameters and the temporal structure of the underlying contact network are assumed to be given and known, so that the individual transition probability of the corresponding Markov chain reads as follows and is the indicator function that is equal to 1 when x i t = X and zero otherwise, x t ∂i represents the set of individuals that are in contact with node i at time t.Defining the individual epidemic-state trajectory as x i = {x 0 i , . . ., x T i }, the probability of an epidemic realization x = {x 1 , . . ., x N } between time 0 and time T is where p(x 0 1 , . . ., x 0 N ) is a prior distribution on the initial state, which is usually assumed to be factorized, i.e. p(x 0 1 , . . ., x 0 N ) = i p(x 0 i ).
Suppose that some information about the individual states at different times is available (e.g because individuals exhibit symptoms or undergo medical tests).We will assume that this information comes in the form of a set of independent observed variables O r following known probabilistic laws p r (O r |x tr ir ).For example, if an individual i r has been observed in state X r at time t r , we will have p r (O r |x tr ir ) = 1 x tr ir = X r .False negative and positive rates in tests can be easily represented generalizing the expression of p r .
Under the assumption of independence between observations, the conditional probability of , and the posterior probability of an epidemic cascade x given the observation O becomes where Here, x ∂i represents the set of individuals that have been in contact with i at least one time during the interval [0, T ].The quantity Z = x i Ψ i (x i , x ∂i ) is the normalization constant of the posterior probability or model evidence.Several quantities of interest can be computed from the posterior distribution.
For instance, the problem of identifying the initial source of an outbreak in a population (the so-called patient-zero problem) requires to estimate the marginal probability p(x 0 i = I|O) = x 1(x 0 i = I)p(x|O) for every individual i.On the other hand, if the present time is T , the marginal probability p(x T i = I|O) provides a measure of the current epidemic risk for every individual i.The exact computation of probability marginals requires a sum over all admissible realizations of the process, which becomes unfeasible for more than a few dozens of individuals because their number typically grows exponentially with the number of individuals.In the next paragraph, we describe a method to approximately compute the marginals of the posterior distribution in Eq.6 using Autoregressive Neural Networks (ANNs).
Learning the posterior probability using autoregressive neural networks.Given a realization x of the epidemic process and a permutation π = {π 1 , π 2 , . . ., π N } of the individuals of the system, which imposes a specific ordering to the variables {x i }, the probability of the realization x can be written as the product of conditional probabilities (chain rule) in the form where x i = {x 0 i , . . ., x T i } and x <i = {x j |π j < π i } is the set of epidemic-state trajectories of individuals with label lower than i according to the given permutation π.The distribution p(x) can be approximated by a trial distribution q θ (x) with the same conditional structure which can be interpreted as a (possibly deep) autoregressive neural network depending on a set of parameters θ = {θ i }.From the analytical expression of the probabilistic model p(x), and thus that of the posterior distribution p (x|O) defined in Eq.6, the operation of parameters learning can be performed using a variational approach proposed in Ref. (36), in which the (reversed) Kullback-Liebler (KL) divergence is minimized with respect to the parameters θ of the trial distribution q θ (x).The minimization of the KL divergence can be performed using standard gradient descent algorithms in which the gradient of the KL divergence with respect to the parameters of the trial distribution reads where we used that x q θ (x) ∇ θ log q θ (x) = x ∇ θ q θ (x) = 0 and log(Z) x ∇ θ q θ (x) = 0 due to the normalization x q θ (x) = 1.The computational bottleneck of this calculation is that in the equations 9 and 12, the sum runs over all possible epidemic realizations, a set that grows exponentially with the size of the system.This issue is avoided by exploiting the generative power of autoregressive neural networks by training them using generated sample data through ancestral sampling.This means that the averages over the autoregressive probability distribution can be approximated as a sum over a large number of independent samples extracted from the autoregressive probability distribution q θ , in which the conditional structure of the autoregressive neural network allows to use the ancestral sampling procedure (47), see Fig. 2.
A common way to represent the conditional probabilities in Eq.8 is by means of feedforward deep neural networks with sharing schemes architectures (35,48) to reduce the number of parameters.Due to the possible high variability in the dependence of p(x i |x <i ) on x <i (40), instead of adopting a sharing parameters scheme we reduce the number of parameters by limiting the dependency of the conditional probability to a subset of x <i .The subset considered is formed by all x j ∈ x <i such that x j is at most a second-order neighbor of i in the graph induced by the contact network, i.e. the one in which there is an edge between two individuals if they had at least one contact during the epidemic process.The approximation employed turns out to be exact when the interaction graph is acyclic (see SM for a proof).The function to minimize, the Kullback-Leibler divergence in Eq.9, could attain large (or even infinite) negative values, causing convergence issues in the parameter learning process.As illustrated in the SI, this is avoided by introducing a regularization parameter, a fictitious temperature, and an annealing procedure to improve the convergence.
Inferring the parameters of the propagation model.In a real case scenario, the epidemic parameters governing the propagation model are usually unknown and they should be inferred from the available data.Calling Λ the set of these parameters (e.g. for uniform SIR models Λ = (λ, µ)), the goal is to estimate them by computing the values Λ * that maximize the likelihood function given the set of observations O, i.e.
The quantity Z is the same normalization constant introduced in Eq.6, where the dependence on the parameters was dropped.Formally, Recalling that P (x|O) = Z −1 i Ψ i (x i , Λ) and thanks to Gibbs' inequality we have that where we first replaced the probability function P (x|O) with the variational probability distribution q θ (x) and defined the energetic and entropic terms Since S q does not depend from Λ, minimizing H q with respect to parameters Λ corresponds to maximizing log Z(Λ).The quantity H q and its derivatives w.r.t.Λ can be computed efficiently, in an approximate way, by replacing the sum over all configurations with the average on the samples extracted by ancestral sampling from the autoregressive probability distribution q θ .Therefore, we use the following heuristic procedure, inspired by the Expectation-Maximization (EM) algorithm, to infer the parameters, while minimizing the KL divergence between q θ and the posterior probability p(x|O).During the learning process, two sequential steps are performed: 1. Update the parameters {θ i } of the autoregressive neural network to minimize the KL divergence in Eq.9.
2. Update the parameters Λ to maximize the quantity H q .
These steps are repeated until the end of the learning process.

Results
As a preliminary illustration of the ability of the proposed Autoregressive Neural Network (ANN) to sample epidemic realizations from a given posterior distribution, we reconsider the example in Fig. 1, focusing on the blue epidemic cascade.We train the ANN to learn the posterior probability composed by the prior, i.e. the epidemic model that generates the blue cascade, and the evidence, i.e. its final configuration at day 12.The result is shown in Fig. 3.The epidemic cascades generated by the ANN have Hamming distances from the reference one that reduce to zero at day 12 (central-bottom plot) and a fraction of them have prior probabilities larger than the probability of the (blue) epidemic cascade taken as reference, right-bottom plot in Fig. 3.This example suggests that the ANN approach can generate epidemic cascades compatible with the observations and sampled according the prior epidemic model.
In the following, we exploit the ability of the ANN to sample epidemic cascades from a posterior distribution to tackle three challenging epidemic inference problems: i) the patientzero problem, in which the unique source of a partially observed epidemic outbreak has to be identified, ii) the risk assessment problem, in which the epidemic risk of each individual has to be estimated from partial information during the evolution of the epidemic process, and iii) the inference of the epidemic parameters.Results are compared with those obtained using already existing methods in the field of epidemic inference.We also evaluate how the efficiency of the ANN algorithm depends on the size of the epidemic outbreak, measuring the number of generated epidemic samples necessary to obtain nearly optimal results.The comparison between different inference techniques, the Autoregressive Neural Network (ANN), a Belief Propagation based approach (SIB) (17, 33), together with the Soft Margin estimator (SM), is carried out on both random graphs and real-world contact networks.The Soft Margin (SM) estimator ( 21) is based on Monte Carlo methods in which samples are weighted according to the overlap between the observations and the generated epidemic cascade (see SM for details).The Belief Propagation approach (23), implemented in the SIB software (27,49) provides exact inference on acyclic contact networks and performs very effectively on sparse network structures.In the present work, we focus instead on real-world contact networks, which turns out to have relatively dense interaction patterns.The first contact network, taken from the dataset InVS13 (50), is related to a work environment (work), while the second one was collected in a hospital (hos-pital) (22).In both cases, the dataset used is the temporal list of contacts, respectively between 95 and 330 individuals, for a period of two weeks.Since the real duration δ t i,j of each contact is known, the probability of infections between individuals i, j at time t is computed as , where γ is the rate of infection.For comparison, we also consider synthetic (AUC) obtained in all cases considered.The improvement is also evident when analyzing the fraction of true patients zero correctly identified by each algorithm (left bar plots in Fig. 4).As an example, in the hospital case, ANN identifies correctly the patient zero in the 74% of the instances, SIB in the 54% and SM in the 35% of them.In all cases, the performances the ANN algorithm are comparable or better than those of the other approaches.The results on the patient zero problem reveal the ability of the ANN algorithm to efficiently generate epidemic cascades according to the posterior probability defined in Eq. 6.
Scaling properties with the size of the epidemic outbreak.From the results presented in the previous paragraph, Autoregressive Neural Networks seems to be very effective in tackling classical epidemic inference problems, particularly on dense contact networks, where the performances of BP-based methods are expected to decrease.It is, however, critical to check how the convergence property of the learning processes scales with the size of the epidemic outbreak.
For this analysis, we consider the patient zero problem on a tree contact network with a unique epidemic source and where the state of the system at the final time T is fully observed.With this choice, we ensure that the probability marginals computed by the SIB algorithm are exact; hence they can be taken as a reference to compare the performances of the other algorithms.
The ANN algorithm with a second-order neighbors approximation is exact when the interaction graph is acyclic (see SM for details), assuming that the architecture of the neural networks used is sufficiently expressive to capture the complexity of posterior probability.On the other hand, since the SM algorithm based on a Monte Carlo technique, it can give estimates of marginal probabilities with arbitrary accuracy when a sufficiently large sample of epidemic cascades is generated.
In the case of complete observation of the final state, the larger the epidemic size (i.e., the total number of infected individuals at time T ), the larger is the number of epidemic cascades that are compatible with the observation.For instance, in an epidemic realization of duration T time steps in which n I individuals are tested infected and N − n I tested susceptible at time T , the number of epidemic configurations compatible with the observations scales as T n I .Both ANN and SM rely on sampling procedures, so their performances could suffer from convergence issues when the epidemic size (n I ) increases.We compute the total number of samples generated by the ANN during the learning process and the number of samples of epidemic cascades generated by SM in the Monte Carlo procedure.In both cases, we assume that convergence is is the estimated marginal probability that individual i is infected at the initial time according to each method.The results on the scaling properties of the ANN and SM as a function of the epidemic size on a tree contact network with degree and depth both equal to 6 (tree) are shown in Figure 5.Here we set the duration of the epidemic cascades to T = 15 days.The ANN algorithm has a quasi-linear dependence with the epidemic size; conversely, the SM algorithm exhibits a very sharp increase in the number of simulations necessary for good estimates of the marginals, and already for epidemic sizes of order ten individuals, good estimates are difficult to obtain.
Epidemic risk assessment.The risk assessment problem consists in finding the individuals who have the highest probability of being infected at a specific time given a partial observation O.In particular, we consider here a realization of the SIR model with µ = 0 (i.e.only the states S and I are available) where half of the infected individuals are observed with certainty at the final time T .The results of the risk-assessment analysis obtained by means of the ANN algorithm are compared with those provided by the SIB algorithm and two other methods recently proposed in (27).The Simple-Mean-Field (SMF) algorithm is based on a mean-field descrip-rrg 1 src rrg 2 src proximity 1 src work 1 src ANN 0.710 ± 0.010 0.670 ± 0.009 0.734 ± 0.010 0.889 ± 0.005 SIB 0.710 ± 0.010 0.671 ± 0.009 0.732 ± 0.010 0.886 ± 0.005 SMF 0.704 ± 0.010 0.671 ± 0.009 0.724 ± 0.009 0.796 ± 0.007 CT 0.685 ± 0.009 0.659 ± 0.008 0.711 ± 0.008 0.790 ± 0.006 Table 1: Epidemic risk assessment results.Area under the Receiving Operating Characteristic (ROC) curves for the risk assessment problem on random regular graphs (rrg) with 1 and 2 sources, on the proximity random graphs and work real-world contact network.The results are averaged over 100 different epidemic cascades generated with the same epidemic parameters.For each case, the ROC curve for the classification of the unobserved infected individuals at the final time is computed.In the rrg case, the epidemic parameters are λ A measure of the ability to correctly identify the unobserved infected individuals at the final time T is represented by the area under the Receiving Operating Characteristic (ROC) curve.This quantity, averaged over 100 instances of the epidemic realizations, is reported, for the methods considered above, in Table 1, for different contact networks (rrg, proximity and work).All algorithms perform similarly on random graphs, whereas ANN and SIB outperform the other two methods in the case of the work contact network.
Epidemic Parameters inference.The parameters Λ governing the epidemic process can be simultaneously inferred during the learning process of the ANN algorithm using a heuristic method inspired by Expectation Maximization (see Materials and Methods).Other iterative algorithms, such as SIB, can incorporate such a parameter likelihood climbing step during their convergence (51).A comparison between the performances of the two algorithms in learning the infectiousness parameter governing the spreading process on different contact networks (tree, rrg, proximity and work) is displayed in Fig. 6 (left plot), in which we adopt the same setting of the patient-zero problem where the states of all individuals are known at the final time T .The ANN algorithm largely outperforms SIB in rrg and proximity graphs, obtaining comparable results for the tree and work instances.We also test the performance of parameters inference in a more challenging scenario where the population is split into two classes, with two different rates of infections γ 1 , γ 2 (which could correspond, for instance, to a simplified scenario of vaccinated/not-vaccinated individuals).The states of all individuals at final time T = 14 are observed for ten epidemic cascades on the hospital contact network.Then we infer the parameters with two different epidemic models: in the first one, the population is correctly divided (we call this the true model); in the second, we split the population randomly (null model).The goal is to verify whether the true model has a larger likelihood than the null model, that is it can better explain the observations.In the central plot of Fig. 6, we observe how well the true model can infer the correct values of the infections rate of the two sub-populations.As expected, the two values of γ inferred with the null model are similar to each other but different from the correct ones.From the rightmost plot of Fig. 6, we observe that the log-likelihood of the true model is much larger of the one of the null model, indicating the former better explains the observations.This example shows how the ANN approach can therefore be used to select the epidemic model that best explains the observations based on the estimate of their log-likelihood.

Discussion
This work shows how significant individual-based epidemiological inference problems defined on contacts networks can be successfully addressed using autoregressive neural networks.In problems such as patient zero detection and epidemic risk assessment, the proposed method exploits the generative power of autoregressive neural networks, in order to learn to generate epidemic realizations that are sampled according to the epidemic model and, at the same time, are compatible with the observations.When the parameters of the model are unknown, it is also capable of inferring them during the learning process.The approach is flexible enough to be easily applied to other epidemic inference problems and with different propagation models.
The proposed architectures for the autoregressive networks significantly reduce the number of necessary parameters with respect to the vanilla implementation.Moreover, convergence properties are improved by means of a regularization method which exploits the introduction of a fictitious temperature and an associated annealing process.
According to the results obtained on three different problems (patient zero, risk assessment, and parameters inference) on both synthetic and real contact networks, the proposed method equals the currently best methods in the literature on epidemic inference, outperforming them in several cases.In particular, the ANN approach is computationally less demanding than standard Monte Carlo methods and very effective also on dense contact networks, where more efficient algorithms based on message-passing methods might experience convergence issues.
For this reason, the method seems very promising for epidemic inference problems defined in small communities such as hospitals, workplaces, schools, cruises, where contact data could be available.In such contexts it could be used to detect the source of an outbreak, measure the risk of individuals being infected to improve contact tracing, or to estimate the channels of contagion as well as the infectivity of classes of people thanks to the possibility of inferring the propagation parameters.
Even though showing suitable scaling property with the system size, the proposed approach, reasonably, needs new architectural schemes and learning processes to be applied in epidemic inference problems regarding more than several thousand individuals.Work is in progress in this direction, possibly guided by symmetries and regularity of the prior epidemic models.ANN 1 q (x 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " < l a t e x i t s h a 1 _ b a s e 6 4 = " < l a t e x i t s h a 1 _ b a s e 6 4 = " p g h n w P w 4 O Y g C + m 3 5 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " F 6 i 3 a U i n t w 1 n 8 f p l 0 6 j X n r N a 4 b l S b 9 X k z R X J I j s g J c c g 5 a Z I r 0 i J t w s k 9 e S L P 5 M V 6 t F 6 t N + v 9 Z 7 R g z T M H 5 A + s j 2 / h E 5 j v < / l a t e x i t > x 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 5 Figure 2: Ancestral sampling of epidemic cascades.Left.Ancestral sampling of epidemic cascades using artificial neural networks.For each individual i there is a neural network ANN i that computes the probability q x i |x i−1 . . .x 1 of its time trajectory x i given the time of previous individuals.The time trajectory x i is extracted from the conditional probability q x i |x i−1 . . .x 1 and passed to the following neural networks.Right.Each neural network is composed of several fully connected layers (see SM for details).Epidemic cascades generated by the ANN trained on a posterior probability composed by a prior, the epidemic model that generate the blue cascade, and the evidence, its final configuration at day 12.The contact network is a real contact graph measured in a hospital (22).An example of the epidemic cascade generated is shown in second row (1 AN N , orange).The third row represent the hamming distance between them (see caption   We consider the estimation of the marginal probabilities to be infected at time zero with interactions graphs given by a tree of degree and depth both equal to 6 and spanning 15 days of duration.The epidemic cascade are generated with µ = 0 and different values of λ (λ ∈ [0.1, 0.6]).For the ANN algorithm, we consider the number of samples generated during the learning process, that is 10 3 samples for each annealing steps (see SM for details).For each instance, we run the annealing process with 2 n number of steps with n ∈ {5, 6, . . ., 18}.Each point is a single instance, if the algorithm converges between 2 n−1 and 2 n steps, the number of samples reported in the plot is the number of steps multiplied by 10 3 samples extracted at each step.For the SM algorithm, each point in the plot is the average number of simulated epidemics necessary to reach convergence to a good estimate of the marginals (worst 10% results were discarded).No point is reported when more than ten infected individuals are observed, because more than 10% of the instances did not converge within 2 • 10 8 simulated epidemics.overcome this issue, an annealing procedure is adopted, in which a fictitious inverse temperature β is introduced in the computation the gradient of the KL divergence, The minimization procedure starts with β = 0, where all the configurations are allowed with uniform probability, then the parameter β is slowly increased until it reaches β = 1, at which the original expression of the loss function is recovered.

Neural network architecture
In the proposed approach, each conditional probability function p i (x i |x <i ) has to be approximated with a neural network q θ i i (x i |x <i ).For monotonic models such as the SIR epidemic model, the state-space representation of temporal trajectories as a sequence x i ∈ X T +1 of T + 1 individual states is redundant, and it turns out to be more efficient to represent them by the time instant t I i ∈ (0, T +1) at which individual i becomes infected and time instant t R i ∈ (0, T +1) at which it recovers, the time T + 1 corresponding to the case of no infection or recovery occurred for i in the interval [0, T ].More precisely, the case t I i = T + 1, t R i = T + 1 corresponds to individual i being susceptible at every time of the epidemic process, and t I i = T + 1, t R i = T + 1 corresponds to individual i being infected but not recovering in the time interval [0, T ].
This parametrization reduces the state space that is explored by our method, decreasing the number of incorrect time trajectories generated (for instances, time trajectory with transitions from state I to S).
In order to compute the probability of a trajectory of a single individual, we employ two neural networks, one for the time of infection, and one for the recovery time.Each of these networks is given as input a subset of the time trajectories of individuals with lower sorting index π i (see next section), and the one of the recovery time is also given as input the infection time of the same individual.The input of the networks are the infection and recover times of previous individuals, encoded in one hot scheme.Therefore, the (conditional) probability distributions represented by the networks are, for each individual i, q for the infection times, and q for the recovery times, where θ i,I and θ i,R are the weights of each network.
In our implementation, each neural network is a multi-layer perceptron (MLP) composed of three hidden layers plus one output layer; each layer is fully connected, and can be written as where L k is the input vector (output of layer k), W k ∈ θ is the matrix of the weights, b k ∈ θ is the bias vector, and σ is the activation function, which is Relu for the hidden internal layers and Softmax for the last layer.The width of each layer (the number of neurons) varies linearly from the input size of each network to the output size.
The size of the input and outputs of the neural networks used for each individual depends on the observations and the contact graph in the following way: a) the input size depends on the number of individuals with lower index that are considered (depending on the approximation made, see section 2.1) and b) the observations made on an individual restrict the phase space of the possible instants of infection (for example, if we observe individual i is in the infected state at time t O , this means that his/her infection time t I i ≤ t O and the recovery time t R i > t O ).This last condition affects the size of the output of each network, but also the input size of the networks for the individuals with higher index, whose network depends on it: for example, if individual i is observed in state S at the final time (T ) with certainty, this implies that the infection and recovery times are fixed, t I i = T + 1 and t R i = T + 1, and the network will always give the same value for them.Thus, in this case, for all individuals j with π j > π i , the dependency on i will effectively disappear.

Graph approximations
We now discuss the approximation to the dependency of the conditional probabilities for the individuals' trajectories.The probability of a whole epidemic cascade can be written in the following autoregressive expression: The dependencies in the factors q θ i i (x i |x <i ) can be reduced, in the case of acyclic contact networks, to only those corresponding to the next-nearest neighbors with lower index (we relegate the derivation to section 3), but in the general case of contact networks with cycles it is just an approximation, which we employ for all contact networks analyzed in the present work.
In order to check this approximation, we have run several tests on the patient zero problem (see main text for the definition), applied to the proximity random contact graphs.For this purpose, we train the ANN either (a) considering the full dependencies on all nodes with lower permutation index (we call this version full graph), (b) consider only the dependency on the first and second neighbors (called next nearest neighbors approximation) with lower permutation index, which is the approximation used in all the rest of this work, (c) considering only the first neighbors (nearest neighbors) with lower permutation index, or (d) ignoring the dependency on the rest of the graph (called mean field approximation).
Figure 1 shows how the accuracy in finding the patient zero is influenced by the approximation chosen, while in figure 2 the reduction of the number of parameters in the network, with respect to the full graph case, is shown.We see that the next nearest neighbors approximation gives estimates which are on par with the full graph case, while employing less parameters, thus reducing the space and time needed for the training.
Figure 1: Accuracy in finding the patient zero with different approximations of the dependencies between individuals, compared with sib (Belief Propagation) method and random guessing of the source among the infected.The "full graph" case is when no approximation on the dependencies is made.The average is made over 100 epidemic cascade on contacts network generated from a proximity random graph (see main text) with N = 100 individuals.The epidemic parameters used fro the generation are λ = 0.03 and µ = 0.02, with T = 15 time instants.The legend also shows the Area under each curve (AUC).
Figure 2: Percentage of parameters needed, on average, when considering the approximation of the dependencies between individuals, with respect to the case of no approximation as described in section 2.1.See figure 1 for the accuracy and generation details.

Training procedure
To train the whole network, we do 10 000 steps of the annealing procedure described in the main text (except for the hospital case where we do 20 000 steps), increasing β linearly from 0 to 1.At each step, we generate 10 000 epidemic cascades before updating the weights using the ADAM optimizer (1) with learning rate lr = 0.001.
In the case of the risk inference problem, we also include a prior probability in the model, with strength decreasing linearly with β.This prior is computed from the probabilities of each individual being sampled as S, I or R at the last time instant t = T , and it is designed in such a way that, when β = 0, the probabilities for the three states are equal.
The running time of the ANN algorithm on a single instance ranges from one hour to two days on a single GPU (Nvidia TITAN RTX), depending on the structure of the observations O and that of the underlying contact network.Our implementation using the PyTorch framework (2) is publicly available in the repository (3).The results presented in the main text and in SI can be reproduced using the following repository (4).

Simplified Conditional probability on acyclic graph
In this section we show that in the case of epidemic spreading in a acyclic interaction networks, it is possible to restrict the dependence of previous nodes in the conditional probabilities Eq.7, to the second neighbors with lower index.Let us consider the configuration x of N variables, a probability distribution p(x) = 1 Z a ψ a (x a ) factorized over a set of factors {a}.We first demonstrate the following statements: Lemma 1 (Markov blanket).Let p (x) = 1 Z a ψ a (x a ), and I ∪ J ∪ K = {1, . . ., N } be disjoint and assume no factor depends on x I and x K simultaneously.Then p( Proof.Considering proportionality ∝ with respect to x I only p (x I |x J , x K ) ∝ p (x I , x J , x K ) ∝ a ψ a (x a ) ∝ a∈∂I ψ a (x a ) and p (x I |x J ) ∝ p (x I , x J ) ∝ x K a ψ a (x a ) ∝ a∈∂I ψ a (x a ).As both distributions are normalized wrt x I , they must be equal.This implies that where the last line follows from the derivation above.
Lemma 2 (Separated neighborhood).Let p (x) = 1 Z a ψ a (x a ), and G = (V ∪ A, E) be the associated bipartite factor graph, with E = {(i, a) ∈ V × A : ψ a depends on variable x i }.Let I ∪ J ∪ K ⊆ {1, . . ., N } be disjoint and assume that every path from I to K in G must pass through J (equivalently, removing vertices in J leave I and K separated).Then p(x I , x K |x J ) = p(x I |x J )p(x K |x J ), and p(x I |x J∪K ) = p(x I |x J ) Proof.Let I be the connected component of I in G\J and K = V \(I ∪J).As all paths from I to K pass through J, no factors can depend on variables both in I and in K .By Lemma 1, Corollary 1 (Restricted autoregression).By calling I = {i}, J = {j < i : j ∈ ∂i} and K = {j < i : j / ∈ ∂i}, we obtain that for an ordering of nodes such that J separates i from K we get p(x i |{x j : j ∈ ∂i, j < i}) = p(x i |{x j : j < i}).
The last corollary is defined for a single node i.In considering the problem of approximating the posterior probability of an epidemic spreading process we distinguish two graphs: the first one, the contact graph, the nodes are the time trajectory of the states x i of each individuals and the edges are the contacts between them.The second, the factor graph, has as nodes, again, the time trajectory of individuals and as factors those ({Ψ i }) in the equation 6 in the main text.If the contact graph is acyclic and the nodes are topological ordered 1 then, thanks to the corollary 1, the following identity holds: where x ∂ 2 <i define the sets of nodes up to the next nearest neighbors (in the contacts graph) with index lower than i.The set of nodes x ∂ 2 <i in the contact graph correspond to the x ∂<i in the factor graph.To better visualize the two graphs, in Fig. 3 we show two cases of contact acyclic graphs.The first one represents a linear chain defined by the contacts among individuals.The corresponding contact graph is acyclic but the factor graph contains cycles.In this case, for instance, considering the conditional probability of node 7 we have that P (x 7 |x <i ) = P (x 7 |x ∂ 2 <i ) = P (x 7 |x 4 , x 6 ).The node 4, 6 separated the nodes 7 from the previous nodes in the factor graph, as the corollary 1 require.Similarly, the same concept apply to the second case, where we consider a tree graph.In both cases the nodes are ordered topologically so the equation ( 4) holds for each conditional probability.

Soft margin estimator
We also use the Soft Margin estimator from (6) for finding the patient zero.This is a Monte Carlo based estimator, which applies Bayes formula to estimate the source probability P (s = i): given an epidemic x(i) which is the result of the simulation, starting from individual i, and the set of observation x O on the epidemic, the probability of node k being the source of the epi-1 A topological ordered graph is such that for each pair of nodes (i, j) with i < j, we have that d i ≤ d j , where d i is the distance of node i from the node zero.In practice, defining a starting node (a root) we enumerate the nodes according to the distance from the root (in a acyclic graph there is a unique path connecting each nodes) (5).We have extended the method to the SIR model.To evaluate P (x O | s = k), we run M Monte Carlo simulations in which the source of the epidemic is k and compare the resulting epidemic cascade x i with the observations.For this we use the Jaccard similarity function φ X (x i , x O ), relating how many individuals are in state X (either infected, X = I or recovered, X = R) in the generated cascade and the observations.Clearly, if no individuals are observed in state X, then φ X = 1 regardless of the realization.The probability of observing a certain configuration, given the source of the epidemic, can then be written as: contact networks: a random regular graph (rrg) with N = 100 individuals and degree equal to 10, and a random geometric graph (proximity), in which N = 100 individuals are randomly placed on a square of linear size √ N .In the latter, the probability that individuals i and j are in contact is e −d ij /l , where d ij is the distance between i and j and l is a parameter (set to l = 10) that controls the density of contacts.For both synthetic contact networks, we generated epidemic processes (SIR epidemic cascades, see Material and Methods for details) with a duration of 15 days.The patient zero problem.Given the exact knowledge of the final state of the epidemics at time T , the patient-zero problem consists in identifying the (possibly unique) source of the epidemics.In a Bayesian framework, this problem can be tackled by computing for each individual the marginal probability of being infected at time t = 0 given a set of observations O.This quantity can be estimated from the posterior distribution (Eq.6 in the Material and Methods) with all three algorithms (ANN, SIB and SM) considered in this work.For each contact network (rrg, proximity, work, hospital), results were computed averaging over 100 realizations of the epidemic model.The three algorithms were used to rank infected and recovered individuals in decreasing order according to the estimated probability of being infected at time zero for each epidemic realization.Fig. 4 displays, for each algorithm, the normalized cumulative number (over 100 instances) of patient-zero individuals actually found as a function of the accordingly ranked fraction of infected or recovered individuals.The ANN algorithm outperforms all the other methods as indicated by the larger values of the area under the curve

( 1 )
rrg = 0.035 for the single source and λ (2) rrg = 0.03 for the double source case.For the proximity random graphs, λ prox = 0.03.In the case of the work network, the model has rate of infection γ work = 10 −3 .tion of the epidemic process in which information about the observed individuals is heuristically included.The Contact Tracing (CT) algorithm computes the individual risks according to the number of contacts with observed infected individuals in the last τ = 5 time steps (days).

Figure 1 :
Figure1: Simulated epidemic cascades in a hospital contact network.One thousand epidemic cascades simulated (with the same epidemic parameters) on a real contact graph measured in a hospital(22) (detailed information about epidemic models and contact networks are listed in the results section).The epidemics started from the same individual.Two samples (blue, and orange) of epidemic cascades are shown in the first and second rows of the figure.The third row represents the distance between them, where in this case the blue dots are the infected individuals present in the cascade 1 but not in cascade 2 and the orange ones are those present in cascade 2 but not in cascade 1.In the third row, the total number of blue and orange dots gives the Hamming distance between the two daily configurations.Left-bottom plot.Cumulative number of infected individuals for 1000 epidemic cascades started from the same individual.Right-bottom plot.Hamming distance (δ(1, i)) between the cascade 1 and all the others i ∈ [2, 3 . . .1000].
r k W a D 9 C e 0 I D I s s u B C k U 5 0 w w R t K C M 8 1 S M 0 8 D y a R 7 u / P e L p F E u u W e l y n W l U C 3 P k s m R Q 3 J M i s Q l 5 6 R K r k i N 1 A k n D + S J P J M X 6 9 F 6 t d 6 s 9 5 / S J W v W c 0 D + w P r 4 B r b A n b A = < / l a t e x i t >

Figure 3 :
Figure3: Epidemics cascades generated by the ANN.Epidemic cascades generated by the ANN trained on a posterior probability composed by a prior, the epidemic model that generate the blue cascade, and the evidence, its final configuration at day 12.The contact network is a real contact graph measured in a hospital(22).An example of the epidemic cascade generated is shown in second row (1 AN N , orange).The third row represent the hamming distance between them (see caption Fig.1.)Left-bottom plot.Cumulative number of infected individuals for epidemic cascade simulated (blue curve) and generated by the ANN (i AN N ∈ [1, 2 . . .1000].Central-bottom plot.Hamming distance (δ(1, i AN N ) between blue epidemic cascade and those generated by the ANN i AN N ∈ [2, 3 . . .1000].Right-bottom plot.Distribution of the values of the prior probability of the generated epidemic cascades (P AN N ).The blue vertical line is the value of the prior probability of the blue cascade (log(P i ).

Fig. 1 .
Figure3: Epidemics cascades generated by the ANN.Epidemic cascades generated by the ANN trained on a posterior probability composed by a prior, the epidemic model that generate the blue cascade, and the evidence, its final configuration at day 12.The contact network is a real contact graph measured in a hospital(22).An example of the epidemic cascade generated is shown in second row (1 AN N , orange).The third row represent the hamming distance between them (see caption Fig.1.)Left-bottom plot.Cumulative number of infected individuals for epidemic cascade simulated (blue curve) and generated by the ANN (i AN N ∈ [1, 2 . . .1000].Central-bottom plot.Hamming distance (δ(1, i AN N ) between blue epidemic cascade and those generated by the ANN i AN N ∈ [2, 3 . . .1000].Right-bottom plot.Distribution of the values of the prior probability of the generated epidemic cascades (P AN N ).The blue vertical line is the value of the prior probability of the blue cascade (log(P i ).

Figure 4 :
Figure 4: Results of the patient zero problem.For each case, the right plot shows the fraction of patient-zero discovered (in 100 different epidemic cascades) as a function of the fraction of infected or recovered individuals ranked according to the probability to be patient zero given by the three algorithms ANN, SIB, and SM (the values of the area under the curve [AUC] are shown in the insets).The left bar plots represent the fraction of patients zero correctly identified at the first position of the ranking given by the algorithms.For the rrg we consider the following epidemic parameters λ = 0.04 and µ = 0.02 and for proximity λ = 0.03, µ = 0.02.The epidemic parameters for (work) and (hospital) are respectively γ = 10 −3 , µ = 0.02 and γ = 2 • 10 −4 , µ = 0.02.

#Figure 5 :
Figure5: Scaling properties with the size of epidemic cascades.Number of samples generated by the ANN and SM algorithms to reach convergence.We consider the estimation of the marginal probabilities to be infected at time zero with interactions graphs given by a tree of degree and depth both equal to 6 and spanning 15 days of duration.The epidemic cascade are generated with µ = 0 and different values of λ (λ ∈ [0.1, 0.6]).For the ANN algorithm, we consider the number of samples generated during the learning process, that is 10 3 samples for each annealing steps (see SM for details).For each instance, we run the annealing process with 2 n number of steps with n ∈ {5, 6, . . ., 18}.Each point is a single instance, if the algorithm converges between 2 n−1 and 2 n steps, the number of samples reported in the plot is the number of steps 2 n +2 n−1

Figure 6 :
Figure 6: Inference of epidemic infectiousness parameters.Left plot.Average relative error in the inference of the infectiousness parameters over ten epidemic cascade per interaction graph.On tree, rrg and proximity networks, the discrete-time SIR model has infection probability respectively equal to λ tree = 0.35, λ rrg = 0.04, λ proximity = 0.03.The work case has rate of infection γ work = 10 −3 .The initial conditions for the parameter learning process were set to λ init = 0.5 for tree, λ init = 0.1 for RRG and proximity networks and to γ init = 10 −2 for the work network.Central plot.Box plot for the case of two classes of individuals with different rate of infection γ 1 , γ 2 inferred by the ANN.We consider two inference model where the population is divided according the propagation model (true model) and randomly (null model), see the text for details.The true model is able to correctly infer the parameters with only ten different epidemic cascade.Right plot.Box blot of the log-likelihood difference between the true and null model.

Figure 3 :
Figure 3: Two example of contact graphs ordered topologically.The two cases represent a linear chain on the left, and a tree on the right.The square represents the factor Ψ i of the posterior probability of the dynamical process we want approximate.The circle represents the time trajectory of the states x i of each individuals.The dashed lines are the contacts among individuals, and in both cases they generate a acyclic contact graph.Instead For instance the nodes 7 is separated (in the factor graph) from the rest of graph containing the nodes with minor index by the orange nodes.