Introduction

The current outbreak of coronavirus disease 2019 (COVID-19) has become a global crisis due to its quick and wide spread over the world. As of August 24, 2020, the outbreak of COVID-19 has caused 84,981 confirmed cases and 4634 fatalities in Mainland China1. For the purpose of control and prevention, various containment measures have been implemented in Mainland China since January 19, 2020, including traffic restrictions, contact tracing, mandatory face masks in public spaces, entry or exit screening, isolation, quarantine and awareness campaigns. Especially on January 23, 2020, a strict travel restriction was introduced in Wuhan, Hubei province, and the city has been locked down since then2.

A good understanding of the epidemic dynamic would greatly enhance the control and prevention of COVID-19 as well as other infectious diseases, while dynamic model is probably one of the oldest mathematical tools to study the law of epidemic development whose history can be traced backed to the well-known Susceptible–Infected–Removed (SIR) model proposed in 1950s3. Due to the usefulness and advantage in prediction, and especially inference, SIR and its modified models are still widely applied in the study of SARS4, H1N15, and particularly the COVID-19 pandemic6,7,8,9. We hereby present a brief review on some of the representative works as follows.

Recently, Tang et al.10 proposed a deterministic compartmental model by taking the clinical progression, epidemiological status, and the intervention measures into account. However, it implicitly assumed that the disease is not infectious during incubation period, which is not the case in COVID-19. In addition, it assumed that quarantine was implemented as soon as the infection occurred, which fails to reflect the inevitable latency brought by medical tracking. In the study of Wu et al.11, it proposed an extended SEIR model by considering transmissions among cities. However, it did not take the control measures into account such as tracing and quarantine. Furthermore, it also assumed that COVID-19 is not infectious before symptoms onset. For more similar or simple deterministic ODE models to COVID-19, we refer to Liu et al.12 for an overview. Yang et al.13 employed a discrete time difference equation (DE) model to predict the epidemic trend of COVID-19. The proposed model correctly took the infectious incubation into account. However, this model did not consider the time needed for medical tracking or the time lag between symptoms onset and diagnosis. Besides, the rationale behind the assumption of the equal transmission probability between symptomatic and asymptomatic virus carriers was questionable.

In contrast to the deterministic models (ODE or DE) summarized above, the transmission of disease between individuals in real world is inevitably random in nature. As a result, numerous stochastic dynamics models have been developed since the pioneering randomization of SIR model14. In fact, a deterministic ODE model can often be seen as the mean-field equation of the corresponding stochastic counterpart. Under certain conditions, the mean-field equation may represent the evolution of the expectation of the corresponding stochastic model. In some more generalized cases, the mean-field equation is a large scale approximation of the corresponding stochastic model, which can be seen as a process version of Law of Large Numbers. However, if the size of outbreak is not comparable to that of the total population, the randomness is more significant, and hence a stochastic model is a better choice to quantify the uncertainty in estimates and predictions in such case. Furthermore, stochastic dynamic model is also known for its expandability to incorporate individual variations15, or even spatial structures16, which may not be fully captured by its mean-field equations. To our knowledge, stochastic dynamic modeling for COVID-19 is yet relatively rare comparing to its deterministic counterparts, though preliminary approaches such as statistic exponential growth models were considered in recent studies17,18. Recently in the study of Chinazzi et al.19, an existing discrete time stochastic model was employed to estimate the “effect of travel restrictions on the spread” of COVID-19. However, the unique features of COVID-19, such as the infectious incubation and asymptomatic carriers, as well as control measures such as medical tracking, are still yet to be captured in their work.

To remedy the aforementioned issues in the existing studies, and depict a more realistic transmission mechanism, we propose a novel stochastic compartmental model which captures the unique transmission dynamics of COVID-19 and the effects of intervention measures implemented in Mainland China. Our proposed stochastic model aims to study the COVID-19 outbreak in the following aspects: (1) estimation of key epidemiology parameters; (2) prediction of epidemic development; (3) estimation of unobservable carriers and epidemic containment date; and (4) assessment of control measures.

The rest of this paper is structured as follows. In “Methods” section we describe the data used in this study, and introduce the proposed stochastic dynamic model and parameter estimations. “Results” section presents our findings. We discuss our results, advantages and limitations in “Discussion” section.

Methods

Data sources

Data used in this study include numbers of confirmed diagnosis, recoveries and fatalities in the following major provinces and cities of China: Beijing, Shanghai, Chongqing, Guangdong, Zhejiang and Hunan. These public available data were retrieved from local Health Commission based on a daily update20,21,22,23,24,25. The corresponding population of residents in each region is collected from China National Bureau of Statistics26. Note that we exclude Hubei province which was the epicenter due to the following reasons: (1) the medical resources in Hubei province were overburdened at the beginning of the epidemic, and not all individual with confirmed diagnosis could get immediate hospitalization; (2) the diagnostic criteria were changed overtime in Hubei which resulted in a massive surge of confirmed cases in mid February27; and (3) the fatality rate in Hubei province was much higher than other regions in China. These features distinct the dynamic model in Hubei from other regions of China, which should be considered in our future studies.

Model description

In our study, none of selected provinces and cities has more than 2000 accumulated confirmed cases by now (see table S3 in Supplementary F). These number, though alerting, are not comparable to the total population in provinces or cities, which are of an order 10 million100 million (see Table S4 in Supplementary F). Hence, a novel stochastic dynamic model is designed to capture the unique features of the COVID-19 outbreak, where the unique features here refer to

  1. 1.

    Infectious incubation period: unlike SARS, COVID-19 is infectious before symptoms onset28.

  2. 2.

    Large portion of asymptomatic virus carriers: it has been found that the proportion of asymptomatic infected population is non-negligible29.

  3. 3.

    Unprecedented contact control and medical tracking measures: various containment measures have been implemented in Mainland China since January 19, 2020; especially on January 23, strict travel restriction was introduced in Wuhan at an unprecedented scale, and the city has been locked down since then2; at the same time, great efforts have also been taken contact tracing and quarantine, for example, forty thousand close contacts was successfully tracked more than in Zhejiang Province30,31.

To our knowledge, these features have not yet been fully captured by the existing stochastic dynamic models for the epidemic.

Under mild assumptions that (1) motions of all individuals in the system are independent, and (2) the total population in the system is a fixed number of N, we propose a new stochastic model with state variables S, E, Q, IN, IH, R and D which stand for susceptible, exposed, quarantined, symptomatic, hospitalized, recovered and dead population respectively. Some states can be further divided into substates, see Supplementary A for more detail. Note that each individual can be classified into one of the above states at a specific time. The evolution of population size in each state over time forms a continuous time Markov Process can be described as follows:

  1. i.

    Infection: Every infected case in E or IN passes a pathogen to its secondary case at Poisson rate λE = λINθ or λIN respectively. To be specific, a primary case chooses an individual randomly from the total population, and the individual would be infected if it is of state S. At each transmission event,

    • with probability ρ, the secondary case would be symptomatic in the future at a Poisson rate of rs, meanwhile, this contact is traceable with probability q;

    • with probability 1 − ρ, the secondary case would NOT be symptomatic in the future meanwhile, this contact is traceable with probability q.

  2. ii.

    Quarantine: If the contact is traceable, the corresponding secondary case would be quarantined, namely lose its infectivity, at a Poisson rate of rq. Note we assume such individuals would be quarantined or hospitalized till recovery or death.

  3. iii.

    Hospitalization: Every symptomatic patient in IN would be admitted to hospital (IH) at a Poisson rate of rH. With probability pl, it would be a light/mild case and with probability 1 − pl, it would be a severe case.

  4. iv.

    Symptoms relief: A severe case relieves symptoms to light/mild symptoms at a Poisson rate of rb.

  5. v.

    Recovery: Asymptomatic patients, symptomatic but yet hospitalized patients and hospitalized patient with light/mild symptoms would recover at Poisson rates of γA, γIN and γIH respectively.

  6. vi.

    Death: Symptomatic but yet hospitalized patients and hospitalized patient with serve symptoms would die at Poisson rates of δIN and δIH respectively.

The process can be illustrated by Figure S1 in Supplementary A.

Estimation of model parameters

The proposed model in “Model description” section provides a comprehensive and realistic description to the transmission mechanism of the current outbreak of COVID-19. However, with limited information retrieved from the public available data, a state-collapsed version of the stochastic process in Fig. 1 is used for the purpose of parameter estimation, which would ease the identifications of the initial values in the model. See Supplementary B for rationale behind such simplification.

Figure 1
figure 1

States-collapsed version of the stochastic process.

The sizes of IH and a substate of R (namely RH in Supplementary A) over time t can be observed directly from the collected data, that is, the number of existing confirmed cases and reported recoveries at t respectively; while the remaining states are latent, namely not observable. Among the latent states, the initial value S(0) can be approximated by the population of permanent residents in the city or province, Eq(0) is zero as there was no quarantine implemented before January 23, 2020 and RN(0) can be set to any number as it would not affect the estimation and prediction of the model. However, the initial values, IN(0) and E(0), are also non-observable and could be a challenge to determine a priori32. In this study, IN(0) and E(0) are treated as unknown parameters and to be estimated together with other model parameters as described below.

There is a total of 9 model parameters in the proposed model for each selected region. They are λIN, θE, ρ, q, γIH, γA, rs, rq and rH, among which, rs, rq and rH are related to the clinical characteristics of the disease and can be prefixed through existing studies. To be more specific, rH is the inverse of the average time from symptoms onset to diagnosis, rs is the inverse of the mean incubation period, while rq is the inverse of mean difference between infectious period and serial interval. Based on preliminary trials, we find that there is very limited information of γA which can be obtained from the data, and the estimate is highly influenced by the choices of prior. A possible explanation is that γA is less related to the observations. Hence, instead of estimating γA with large uncertainty, we prefix γA = 1/10. Sensitivity analysis is conducted on the different choices of γA.

The rest of parameters would be estimated from the model. The parameters, ρ and θE, are directly related to the nature of the disease, and hence are considered as constants in China, while, λIN, q and γIH may vary in different regions depending the local medical resources, population densities and containment measures. Furthermore, it is more realistic to consider λIN and γIH as time varying parameters to reflect the effect of intervention measures and improvement of the medical treatment. In this study, a simple setting for the time varying function is used, that is, λIN(t) = 1{t<T1}λIN + 1{t>T1}IN and γIH(t) = 1{t<T2}IH + 1{t>T2}γIH. The time T1 is set to be January 29 as there was an obvious change of rates occurred on January 29 illustrated in Figure 2 of You et al.33 We use the observed \(\frac{{\Delta R_{{H_{t} }} }}{IH(t)}\) to approximate γIH on day t, where ∆RHt = RH(t + 1) − RH(t). The time T2 is selected to be the time when γIH has a significant change for each province or city. See Supplementary C for more detailed estimation method including the construction of likelihood functions.

Results

Parameter estimations

A summary of the estimated model parameters is given in Table 1, from which we find that

  1. 1.

    The estimate of ρ is not sensitive to the choice of γA, about 30% infected individuals are asymptomatic.

  2. 2.

    The estimate of θ decreases as γA decreases, but the change is not significant, patients with symptoms are about twice as likely to pass a pathogen to others as asymptomatic virus carriers.

  3. 3.

    The estimate of q increases slightly as γA decreases. Zhejiang has the highest q in the selected regions, which is consistent with the remarkable efforts made by the Government of Zhejiang, which till March 2, 2020 has successfully tracked more than 40,000 close contacts30.

  4. 4.

    The estimated initial populations for states E and IN in each region vary over different choices of γA, but are still within the same order of magnitude.

  5. 5.

    The estimates of λIN and a change slightly over different choice of γA.

Table 1 Parameter estimation.

Prediction of the epidemic trend

Based on the estimated parameters, trajectories of the epidemic are simulated using the proposed stochastic dynamic model. For each region, 1000 simulations are conducted to produce the 95% confidence interval for the epidemic evolution of some key populations. In this section, the predictions of populations of states, the containment time of the outbreak, the controlled reproduction number Rc, and a test on the effectiveness of the current medical tracking policy are reported.

Figure 2 plots the 95% confidence interval of:

  1. 1.

    accumulated confirmed cases, namely, the sum of IH,RH and D.

  2. 2.

    the population of state IH, representing the people in hospitals.

  3. 3.

    population of active virus carriers, consisting of the states E and IN.

Note that the first two populations can be directly observed, while populations of E and IN is not observable. Note that despite more data are now available since the first submission of this work, we decide to use data collected before February 22 for model fitting. This is because that the first wave of COVID-19 pandemic in China has been under good control since mid February, and the number of daily confirmed cases was under 5 in most provinces after February 22. The collected data after February 22 would be used for evaluation of the fitted model. In Fig. 2, the observed accumulated numbers of confirmed cases perfectly lie in the calculated confidence intervals of, while the number of standing hospitalized cases seems to be overestimated. A possible explanation to it is that the mean recover time was shortened at later stage due to the improvement in treatment. Nonetheless, our model provides a good understanding towards the transmission mechanism of COVID-19 in China.

Figure 2
figure 2

Predicted confidence interval for key states.

The containment time of the outbreak is defined as the time when the number of active virus carriers is, for the first time, less than a threshold Tc, here we let Tc = 10. Figure 3 shows the 95% confidence interval of the containment time of the outbreak for each region. Among the six regions in this study, Shanghai is predicted to have the earliest containment time of February 28, while the containment time in Guangdong is predicted to be the latest, around March 15. Comparing the prediction with the observed data, we find that in Beijing, Shanghai and Guangdong, the prediction is consistent while in Chongqing, Hunan and Zhejiang, the containment time is slightly overestimated.

Figure 3
figure 3

Prediction of the containment time of the outbreak. The containment time of 1000 simulations is plotted as a histogram and is fitted with normal distribution for each region. The y axis represents the density of containment time.

The controlled reproduction number, Rc, reflecting the transmission ability of the epidemic, is one of the most important quantities in epidemiology. We refer readers to Supplementary D for the approximation of Rc in this study. Simulation results for the approximated Rc in each region is in Fig. 4. In most provinces and cities our estimated Rc is between 2 and 3 before control measures, and it drops rapidly to about 0.2 between January 29 and February 1 in all selected regions.

Figure 4
figure 4

Predicted time-varying Rc curve.

Finally, we evaluate the effectiveness of the current medical tracking policy with a hypothetical controlled test by setting the probability of quarantine q = 0 in the proposed model with the rest of estimates unchanged. Under this setting, the epidemic would still be contained, due to the reduction of contact rate and diagnosis waiting time. However, there are significant delays in the dates of containment if q = 0, indicating the current medical tracking policy contribute significantly to the containment of the epidemic (see Fig. 5).

Figure 5
figure 5

Prediction of the containment time of the outbreak with q = 0. The containment time of 1000 simulations is plotted as a histogram and is fitted with normal distribution for each region. The y axis represents the density of containment time.

Discussion

In this article, we propose a novel stochastic dynamic model to depict the transmission mechanism of COVID-19. In comparison with some existing dynamic models on COVID19, our model features the employment of a stochastic dynamic as well as a comprehensive account for the infectious incubation period, the asymptomatic virus carriers, and the contact tracing measure with time latency. Moreover our proposed model also sets the foundation for further studies with individual/network based models, which may not have an exact mean-field counterpart.

Based on our proposed model, we find that (1) about 30% of infections are asymptomatic which is lower than what was estimated in Tang et al.10 but consistent with the finding of 29.2% in Hu et al.34 among a small sample and 30.8% in Japanese evacuation data35; (2) virus carriers with symptoms are about twice as likely to pass a pathogen to others as asymptomatic virus carriers which is consistent with the finding in the study of Li et al.36; (3) the current containment measures are effective to reduce the contact and transmission rate; (4) the containment time of the outbreak is around late February to early March; (5) the time-varying Rc was estimated to be around two, which is of the same magnitude as reported in Wu et al.11 and Liu et al.7 at the beginning of the epidemic, and it drops rapidly due to the implementation of containment measures; and (6) besides the control measures on exposure rate, the current contact tracing policy contributes significantly to the containment of the epidemic. Furthermore, the proposed model fits well in other region in China, and can be easily extended to regions outside China, see Supplementary E.

With the above findings, we suggest all nations to unite in action on the agenda of containment of COVID-19 by (1) allowing testing without symptoms; (2) introducing contact tracing and quarantine; and (3) conducting measures on reducing exposure rate. Though the epidemic is under good control in China currently, we could not let our guard down. The simulated result shows that if containment measures are relaxed after 3 weeks of the containment date, the epidemic has a probability of 0.415 to resurge in Beijing, China. The probability goes up to 0.658 if containment measures are relaxed after 2 weeks of containment date and 0.878 if 1 week. The calculation of resurgence probability was inspired by Hao et al.7.

However, we acknowledge that there are limitations in the propose model,

  1. 1.

    Given the limited available data, certainly parameters, especially those “far away” from observation in the proposed model may have a potential risk of identification issues.

  2. 2.

    The proposed model does not apply if significant changes apply to the current epidemic control/treatment measure in the future;

  3. 3.

    The proposed model needs further modification if a non-negligible portion of the asymptomatic patients remain infectious after the end of quarantine.

  4. 4.

    Parameter estimates may lose precision if the stochastic model differs excessively from its simplification described in “Discussion” section.

In our future study, we propose to complement/generalize our current stochastic model from the following aspects:

  1. i.

    Improved medical tracking dynamic In this ongoing work, we will introduce a more realistic dynamic for medical tracking, such that medical tracking is triggered in a more physical manner by the affirmative diagnosis of the transmission source. In such model, the quarantine process of each exposed agent depends on his/her contact history. Thus an individual based, rather than compartment dynamic is needed.

  2. ii.

    Introduction of medical service capacity The maximal capacity of the medical service system will be considered, which might be overloaded when facing a massive outbreak. This idea was first inspired by an Amateur Demonstration.37 In fact, this was the key feature of what happened in Wuhan at the beginning phase of the COVID-19 outbreak. We may also allow such capacity to be time/configuration dependent to model the contribution of cabin hospitals in Wuhan.

  3. iii.

    Population flows over cities We will model migration of people over different regions, which could have played an essential role in the spread of the epidemic before the Chinese New Year of 2020. We will also consider the transmissions on board and even allow the population flow to react on the information they have about the epidemic situation.