Projecting the course of COVID-19 in Turkey: A probabilistic modeling approach

Background/aim The COVID-19 pandemic originated in Wuhan, China, in December 2019 and became one of the worst global health crises ever. While struggling with the unknown nature of this novel coronavirus, many researchers and groups attempted to project the progress of the pandemic using empirical or mechanistic models, each one having its drawbacks. The first confirmed cases were announced early in March, and since then, serious containment measures have taken place in Turkey. Materials and methods Here, we present a different approach, a Bayesian negative binomial multilevel model with mixed effects, for the projection of the COVID-19 pandemic and we apply this model to the Turkish case. The model source code is available at https://github.com/kansil/covid-19. We predicted the confirmed daily cases and cumulative numbers from June 6th to June 26th with 80%, 95%, and 99% prediction intervals (PI). Results Our projections showed that if we continued to comply with the measures and no drastic changes were seen in diagnosis or management protocols, the epidemic curve would tend to decrease in this time interval. Also, the predictive validity analysis suggests that the proposed model projections should have a PI around 95% for the first 12 days of the projections. Conclusion We expect that drastic changes in the course of COVID-19 in Turkey will cause the model to suffer in predictive validity, and this can be used to monitor the epidemic. We hope that the discussion on these projections and the limitations of the epidemiological forecasting will be beneficial to the medical community, and policy makers.


Introduction
Coronaviruses are enveloped viruses with a positive-sense single-stranded RNA genome. Seasonal coronaviruses (HCoV-229E, HCoV-OC43, HCoV-NL63, HKU1-CoV) are some of the foremost causes of the common cold, and SARS-CoV and MERS-CoV are responsible for severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), respectively. These pathogens are the ones with the greatest impact on human health within the family Coronaviridae [1].
On March 11th, 2020, the Ministry of Health of the Republic of Turkey announced the country's first confirmed COVID-19 case 4 . According to the official numbers, as of June 5th, 2020, there were 168,340 confirmed COVID-19 cases and 4,648 deaths in Turkey. Of the currently active cases, 592 patients were treated in intensive care units, and 269 of them were followed with invasive mechanical ventilation support 5 .
From the date when the first confirmed patient was announced until today, a large number of social, political, economic, legal, military, religious, and cultural preventive measures were taken to slow the spread of the epidemic in Turkey; implementing curfews in metropolitan cities, establishing awareness of social distancing measures, national and international travel restrictions, closing of nonessential businesses, interrupting collectively religious ceremonies and postponement of summons, referral, and discharge procedures in military barracks are some examples of these measures. The full chronological list of the interventions is available upon request, as a resource for further studies.
Immediately after the announcement of the COVID-19 epidemic in China, dissemination dynamics of the virus, and measures to prevent the spread, along with how the healthcare services should respond, were the urgent questions for researchers. Various modeling studies were initiated to tackle this task, such as the first modeling study on COVID-19 carried out by Wu et al., where they investigated the number of cases exported from Wuhan internationally to infer the number of infections in Wuhan from December 1st, 2019 to January 25th, 2020. They reported an estimated number of 75,815 individuals infected with SARS-CoV-2, which was much higher than the official numbers. Additionally, according to a remarkable finding, researchers stated that a 50% reduction in transmissibility would push down the viral reproductive number to about 1.3, which can significantly slow the epidemic and prevent a sharp peak during the first half of 2020 [3].
Researchers in the MRC Center for Global Infectious Diseases Analysis of the Imperial College of London have also been reporting their findings on the COVID-19 epidemic in China since January 2020. So far, they evaluated striking topics such as estimating the total number of patients, the efficiency of nonpharmaceutical interventions, the degree of online community involvement, the potential impact of the COVID-19 epidemic on other diseases such as HIV, tuberculosis, and malaria, and using mobility data to estimate transmission dynamics [4][5][6][7][8]. Meanwhile, in mid-March 2020, the Institute for Health Metrics and Evaluation (IHME) of the University of Washington published its empirical model [9]. The IHME started live forecasting at the state level for the USA and the national level for 17 selected countries 6 . They later expanded the number of countries projected to 50. The IHME started sharing their projections for Turkey COVID-19 recently, on May 15th, 2020. Likewise, the Robert Koch Institute (RKI) in Berlin published a new mechanistic model called SIR-X based on the confirmed cases for COVID-19 epidemic in China [10]. Their forecasting for 98 countries is also publicly available 7 . Later, Jianxhi Luo and their team from Singapore University of Technology and Design published foresight for 131 countries between April 18th, 2020, and May 11th, 2020 (white paper) based on a conventional mechanistic model known as SIR 8 . Several other real-time projections are also publicly shared online by different groups during the COVID-19 pandemic.
In this study, we have implemented a Bayesian negative binomial based multilevel mixed effects model inspired by IHME's COVID-19 model for the projection of COVID-19 pandemic in Turkey from June 6th to June 26th. While presenting our projections here, we would like to open a discussion on the utility of these models for monitoring the dissemination and analysis of the effects of interventions during the COVID-19 pandemic in Turkey.

Materials and methods
We model the progression of the epidemic in Turkey using a top-down empirical approach. The approach is similar to the COVID-19 model of the Institute of Health Metrics and Evaluation (IHME) of the University of Washington [9]. The COVID-19 projection of the IHME is a curvefitting approach where the cumulative death curves in different states are fit with a logistic curve. Specifically, the scaled cumulative distribution curve of the Gaussian distribution. The base function is therefore of the form of (Eq. 1), where p is the scaling factor and determines the asymptotic limit, i.e., the ultimate number of total events (deaths or cases), α is the rate of increase in the number of events at the center of the epidemic wave, which in turn is β. One can thus think of the epidemic wave as a Gaussian kernel on the peak day (the day where the number of events is highest). The β parameter would then be the mean of this curve in terms of days (which is the unit of the independent variable t), and α, proportional to the reciprocal of the standard deviation.
In the COVID-19 projection of the IHME, this curve is fit on the cumulative death data for each state in the U.S., Chinese provinces, and some European countries. They use the Gaussian kernel fit to the cumulative death rate of each location as a base kernel and generate 13 shifted versions of this base kernel. They then linearly combine them in a hierarchical generalized linear model (GLM) using the location (state/province/country), social distancing and lockdowns enforced for each location, and, more recently, cell phone mobility data as covariates. The COVID-19 projection of the IHME estimates the time-to-death from the day of infection, the case fatality rate, and thus the number of cases retrospectively, by working backward from the number of deaths in a particular location at a specific day to go back to the day those infections occurred. Projections of expected cases are calculated similarly by working backward from the predicted number of deaths.
As the number of confirmed cases is noisier and affected by the scale of testing and the testing policies in each region, the IHME uses the number of deaths. They contend that the number of deaths is a more reliable metric and thus have to perform this lagged estimation of cases. In the case of Turkey, the number of reported COVID-19 deaths are coupled to the number of confirmed cases. Specifically, for a death to be reported as a COVID-19 death, the patient has to be a confirmed case. So, in the case of Turkey, the number of confirmed cases is no more or no less reliable than the number of deaths. Therefore, we modeled directly on the confirmed case data and did not calculate the retrospective inference.
We also performed fitting differently than the IHME's model. In addition to using a maximum likelihood curve fit, we modeled the uncertainty of our model using Bayesian regression. The IHME's model calculates uncertainty by fitting the cumulative death numbers and generates confidence intervals from the parameter covariance and residuals of the MLE fit. One issue with this approach is that the cumulative numbers, by nature, are not independent. Each successive day's sum is dependent 9  on the previous days' sums, as well as the underlying latent process. This dependency causes the projections of the IHME to underestimate uncertainty, which is also noticed by other researchers. One study has shown that the COVID-19 projections of the IHME (as published) have predictions that fall outside the 95% prediction interval in 49%-73% of the time [11]. The IHME team has updated their methodology subsequently to address these concerns, but those updates currently are not yet documented.
Instead of using cumulative numbers in calculating the predictive interval, we used the daily numbers of confirmed cases to prevent the serial dependency mentioned above. We used a Bayesian formulation with the generative model given in Eq.
Here, each day (t) of the epidemic for each location (l) is a draw from a negative binomial (NB) distribution with mean m l,t, and reciprocal dispersion r. Namely, we model the number of cases each day and in each location as the count from a Poisson random variable with rate m l,t . We allow overdispersion in this variable, hence we allowed for the choice of negative binomial distribution instead of Poisson. This mean count is m l,t , which is derived from the base model in Equation 2. Here, N l is the population, and p l , α l , β l are the parameters for location l as discussed above. ε is the unbiased error term drawn from a normal distribution with 0 mean, parametrized by the covariance matrix of random effects.
The model is essentially a 2-step negative binomial Bayesian regression where the posterior is parametrized by the expectation for the number of cases each day in each country, which is calculated by the output of the scaled Gaussian given in Equation 2.
The parameters for the above model have been estimated with the Hamiltonian Monte Carlo sampling, using the Stan (v2.19) [12] probabilistic programming platform under R (v3.6.1) 9 . Default weakly informed priors provided by Stan were used for the parameters and the covariance matrix. The regression was done in the log space (using Stan's neg_binomial_2_log parametrization), using 12 MCMC chains, with 2000 burn-in iterations, and 5000 sampling iterations in each chain.

Results
We have fit the above model to the daily confirmed COVID-19 nationwide case numbers officially released by the Ministry of Health, Turkey until June 5th, 2020 10,11 . We used corresponding data for countries similar in epidemic progression to Turkey to estimate the random effects. The countries defined as the "locations" for the model are Turkey, Belgium, France, Germany, Italy, Spain, Sweden, and the United Kingdom. Day-zero for each country is when the number of cases surpassed 3 per 10 million population during the epidemic for that country. Figure 1 shows the 3-day moving averages of the case numbers in these countries shifted to match their day-zero and also scaled by their populations. Table 1 summarizes the key information for these countries. The ideal set of locations to use would have been different provinces in Turkey. Unfortunately, those data is not made publicly available. Therefore, we assume that this list of countries will allow us to estimate random effects relatively accurately. Our justification for this assumption is presented in the predictive validity section.
We ran the Hamiltonian Monte Carlo sampling on the official daily cases for Turkey from day-zero (March 17th, 2020) through to June 5th, 2020 (the present, as of this writing). The sampling converged very well with agreement 10  among the chains, and there were no divergent traces. The description of the estimated posterior distribution parameters ( r, p, α, and β ) are as follows: r is the reciprocal dispersion parameter of the negative binomial; P is the asymptotic limit of the sigmoidal growth and is indicative of the total number of cases to be expected for this wave; α is the rate of growth at the steepest point of the curve, and β is the estimated center of the wave (the 40th day of the epidemic is April 26th, 2020 for Turkey) ( Table 2 and Figure 2).
These posterior distribution parameters were estimated and used to sample the posterior predictives for the following 20 days after June 5th, 2020. The uncertainty (i.e., the prediction) intervals were found by taking the respective (80%, 95%, and 99%) quantiles of the posterior predictive sample. Figure 3 presents the prediction bands and the maximum likelihood point estimate for daily cases. Figure 4 is the cumulative form of the preceding figure and shows the predicted cumulative predictions until June 26th. The case number and cumulative number estimates of the model for the first, mid, and last day of the projection are listed as an example to show how our results   Figure 3-4 and Tables  3-4 ).

Evolution of the model parameters over time
We calculated the estimates of the parameters p, α, and β daily from April 5th, 2020, up to June 5th, 2020. For each day, the observations from the first day of the epidemic up to that day were fit, and the resulting trends plotted alongside a 3-day moving average of the daily case numbers. The resulting plot is shown in Figure 5. The noteworthy aspect of this analysis is that it demonstrates how much in flux the model parameters are until the day of maximum cases per day is reached. The parameter estimates stabilize after the peak, although there is still some drift.

Predictive validity
The predictive validity of the model is evaluated by rerunning the analysis only for the confirmed cases up to May 5th, 2020, holding the information for the last 30 days (between May 6th,2020 and June 5th, 2020) out of the analysis. The percentage of held out observations that remained inside the different prediction bands of the posterior predictive of the May 5 model was calculated. The results show that, nominally, the predictions are reliable within 10 days to 2 weeks into the future. The 95% prediction interval starts failing (i.e., the days that fall outside the interval become more than 5%) after 13 days. Likewise, the 99% prediction interval starts failing after 23 days ( Figure 6). We, therefore, stipulate that our model would not be appropriate in making projecting for periods longer than about 20 days. The 20-day predictions as of June 5, 2020 are presented in Figures 3 and 4.

Discussion
Projecting the COVID-19 pandemic presents a challenge as it is a novel virus and even the dynamics of worldwide transmission are not known precisely. The basic reproduction number (R 0 ) for COVID-19 is reported to vary from 2.2 to 5.7. While the main course of transmission is person-to-person, additional mechanisms of contact transmission with surfaces, objects, or even animals are  Table 2. Distribution summaries of the model parameters as of June 5th, 2020: r is the reciprocal dispersion parameter of the negative binomial, p is the asymptotic limit of the sigmoidal growth and is indicative of the total number of cases to be expected for this wave. α is the rate of growth at the steepest point of the curve, and β is the estimated center of the wave (the 40th day of the epidemic is April 26nd, 2020 for Turkey).  The probability density plots of the Bayesian estimates of the model parameters as of June 5th, 2020. r is the reciprocal dispersion parameter of the negative binomial. p is the asymptotic limit of the sigmoidal growth and is indicative of the total number of cases to be expected for this wave. α is the rate of growth at the steepest point of the curve, and β is the estimated center of the wave (the 40th day of the epidemic is April 26nd, 2020 for Turkey) transmission [8,9]. For example, using masks decreases the infection rate by 70%-95% 14 . After the announcement of the first COVID-19 case in Turkey, crucial interventions were set by the Turkish government like in many countries. Most of these interventions were implemented simultaneously or successively.
While struggling with these unclear conditions, many researchers and groups still try to produce mathematical models to forecast the future of the pandemic. RKI and SUTD have published their projections based on conventional epidemiological models. However, it is seen that epidemiological models applied to real-time  modeling of epidemic or pandemic periods are very sensitive to the initial assumptions on multiple factors with significant variations. Modeling epidemics like the COVID-19 pandemic requires long-term analysis and high dimensional data. Hence, the central assumption of the SIR and SIR-X models, where all susceptibles were dropped from the transmission process by either infection or containment, is not valid, as no one will stay isolated entirely for extended periods.
Giordano et al. published a study in which possible scenarios of the implementation of countermeasures were modelled, and they showed that restrictive socialdistancing measures should be combined with widespread testing and contact tracing to control the pandemic [18]. For instance, if the lockdown is weakened in Italy, the number of patients may start to increase. Moreover, Ngonghala et al. showed in their modeling study that early termination of social-distancing measures might cause a new devastating wave in New York [19]. Prem et al. also highlighted the importance of physical distancing measures in their modeling study [20]. While these interventions are taking place, it will not be possible to analyze the course of the pandemic with the same assumptions, as the real-world circumstances are rapidly changing, and the projections of long-term case estimates can result in misleading results. Based on these arguments, RKI and SUTD have discontinued publicly publishing their worldwide projections based on the SIR-X and SIR models, respectively.
The COVID-19 projections of the UW/IHME, which inspired our model, assume a Gaussian distribution for the distribution of events (deaths or cases). However, as seen in Figure 3, the distribution of the confirmed cases in Turkey was not symmetrical as the Gaussian distribution assumes, and the number of new cases increased sharply. Still, it shows a gradual decrease causing the right skewness in the distribution. Therefore, in our generative model, we prefer to use negative binomial distribution, as stated in Equation 2. Even though we do not perform curve fitting for the distribution of the confirmed cases, we produce future projections of COVID-19 cases within reliable uncertainty bands. Several applications of negative binomial models are proposed, such as the assessment of the COVID-19 pandemic risk [21], demographic associations [22], or estimation of the distribution of the infection time [23]. Different from these studies, we have applied the Bayesian negative binomial multilevel model with mixed effects for COVID-19 case number modeling.
The conventional epidemiological models result in unrealistic overestimations, especially at the early stages of the spread. Similarly, the retrospective evaluation of our model showed a high flux in the model parameters before the day of maximum cases per day. Even though our projections showed a high variation in the early stages, as the spread continues with the accumulation of new data, it can project with lower flux for the estimates in 95% PI ( Figure 5). In the proposed model, we defined a 20-day forecast for Turkey with 95% PI. We anticipate that if we continue to comply with the measures and no drastic changes are seen in diagnosis or management protocols, the epidemic curve will tend to decrease in this time interval. During this period, we aim to investigate the epidemic curve dynamically by observing if the confirmed cases stay within the prediction intervals and monitor the course of the epidemic to give feedback on the effects of possible interventions to give insights into planners and policy-makers. An unexpected drift outside the PI bands will indicate the presence of a recent change in the course of COVID-19 spread in Turkey. These drifts in parameters are more likely to happen due to changes in the local interventions, such as business and curfew hours/days, public transportation, etc.
There are several limitations to the study that need to be addressed. First, it should be noted that with the proposed methodology, we are not necessarily modeling reality but rather modeling the numbers. The model only reflects the real magnitude and the timeline of the pandemic to the extent the provided data are representative of the reality. Secondly, one can use many lagged covariates in the model: mobility information (from sources like Apple and Google), changes in the climate parameters, or lockdowns enforced, etc. We do not have precise information about these covariates, and about the lag between the exposure and the presentation of the symptoms. The inclusion of such lagged covariates is thus left as future work. Also, the model attempts to model random effects even though we missed data on the daily case numbers of individual provinces. We tried to overcome this limitation by using the data from a group of European countries to estimate the random effects. We cannot access data on mobility or preventive measures at the provincial level in detail. If these missing data are provided, the proposed model can also measure the efficacy of preventive measures independently.
Lastly, the proposed model is ultimately a "single wave" model. If the current wave coincides with a new epidemic wave of significant size, both the accuracy and precision of the projections will drop dramatically. This limitation allows us to use the model as an early detection tool. If the model suddenly starts suffering significantly in predictive validity, this may indicate the beginning of a Figure 5. Evolution of the model parameter estimates over time, for regressions run each day from April 5 -June 5, 2020. The gray bands have the 95% confidence intervals. Note that the last plot (p) has a logarithmic y-axis. new wave. A multiwave model of the same form as the one we present here is possible: one in which the basis of the negative binomial is not a single Gaussian but a mixture of Gaussians. This type of extension to our approach and its ensuing research problems, like finding the minimum number of Gaussians needed to model a given set of observations, is also part of future work.

Conclusion
In this study, we propose a new methodology for the projection of COVID-19 pandemic inspired by the COVID-19 projections of the IHME. Intensive data requirements of epidemiological models and the fact that IHME's COVID-19 projections tend to underestimate uncertainty led us to form our model. As a second wave is expected due to seasonal variations of coronaviruses, understanding the dynamics of the COVID-19 pandemic during the first wave through our model projections will be beneficial, and maybe also essential, for forecasting the efforts in the next stage, and the assessment of the response strategies.
All models projecting COVID-19 provide estimations, and they should be utilized for assessing the effectiveness of various interventions rather than giving precise predictions. Currently, not only Turkey, but also many countries are progressively lifting their containment measures. The implementation of the reopening will mark the second phase of the pandemic, and monitoring based on the model projections is expected to be valuable to develop a well-defined strategy for the management of removing containment measures with a particular order and timeline.