Seeing the unseen—the iceberg phenomenon in the first months of the COVID19 pandemic

We analyze the evolution of the COVID19 infections in the first months of the pandemics and show that the basic compartmental SIR model cannot explain the data, some characteristic time series being by more than an order of magnitude different from the fit function over significant parts of the documented time interval. To correct this large discrepancy, we amend the SIR model by assuming that there is a relatively large population that is infected but was not tested and confirmed. This assumption qualitatively changes the fitting possibilities of the model and, despite its simplicity, in most cases the time series can be well reproduced. The observed dynamic is only due to the transitions between two infected compartments, which are the unconfirmed infected and the confirmed infected, and the rate of closing the cases (by recovery or death) in the confirmed infected compartment. We also discuss some relevant extensions of this model, to improve the interpretation and the fitting of the data. These findings qualitatively and quantitatively evidences the “iceberg phenomenon” in epistemology.


Introduction
In the absence of a vaccine and efficient antiviral drugs, the containment of the COVID19 pandemics relies mainly on non-pharmaceutical interventions. Such interventions depend upon the correct determination of the parameters that characterize the spread of the virus (like the transmission, recovery, and death rates), which are fed into mathematical models to predict the time evolution of the pandemics. But obtaining the correct numbers is difficult while the pandemics unfolds, because of incomplete data and uncertainties in their interpretation. Nevertheless, a careful analysis of the data may guide one to adapt the statistical models to consistently describe the observations and to recover the missing information. The data is only the tip of the iceberg [1]. Our challenge is to calculate the real size of the block of ice hiding beyond our observation possibilities-that is, to really find how fast the virus spreads and how many people are affected.
The mathematical modeling of the spreading of infectious diseases is centuries old [2] and eventually started with the work of Daniel Bernoulli on the variola spreading [3]. These models may be grouped in two categories: stochastic and deterministic. Whereas the stochastic models, based mainly on simulations, may be more suitable for smaller communities, the deterministic models are usually employed for large communities, like countries or cities, where the fluctuations around averages are small and the time evolution is accurately determined by a small number of parameters. Because the analyzed population is divided into compartments, the deterministic models are also called compartmental models [4]. The most simple such model is the SIR model, where p and λ will be called the infection rate and the recovery rate, respectively. In general, p and λ may also be functions of t (for example, p is influenced by regulations imposed to the population, such as social distancing, whereas λ is influenced by the response of the public health system and the time passed since the infection), but since we discussed only the early stage of the pandemic, when the conditions did not improve significantly, we shall assume that they are constants. From Eqs. (1), only the first two are independent, the third one being a consequence of the normalization condition s + y + r = 1, which yieldsṙ = −ṡ −ẏ. Equations (1) may be solved with the initial boundary conditions s(0) = s 0 and y(0) = y 0 (see Appendix A for more details).

Method 1 (M1): fitting the time series of the active cases
With the formalism presented above, one may try to describe the evolution of the COVID19 pandemics. The official numbers of infected, dead, and recovered persons may be found on several databases. We use the one from GitHub (https://github.com/CSSEGISandData/COVID-19).
In Fig. 1 we plot y(t) (Active cases) for Romania, Hubei (China), Germany, and South Korea, together with the fit of the SIR model (minimum square deviation), with constant p and λ (or a). This simplified model does not fit the data very well and more complex models should be employed. But although the details of fitting the time series of the Active cases are important, here we shall focus on a more conspicuous problem. In Fig. 2 we show the time series of the Closed cases of the same countries and on the same periods as in Fig. 1. We observe that the SIR results for the same parameters as in Fig. 1 differ from the recorded data by orders of magnitude. Such discrepancies cannot be mended by adjustments of the parameters p and λ or by taking into consideration reasonable time variations of them. We did similar fits also for many countries and we could not find any country in which the M1 method would provide a good approximation for the early stages of the pandemic evolution.
2.2. Method 2 (M2): fitting the time series of the total infections One may try another approach. The time variation of s, and therefore of y, is determined by the number of infectious persons, which may not be accurately described by the number of active cases, as it is assumed in the SIR model. This may have happened because there was no medical procedure that could eliminate a patient from the list of active cases immediately when it becomes noninfectious. Rather, one was eliminated from this list after a certain interval of time, when it was tested negative for the COVID19 disease. This may have allowed for an overestimation of the number of active cases at any moment in time. For this reason, we may assume that the (total) number of infected people, which is y(t) + r(t) = 1 − s(t), is more appropriate for the fitting procedure. In this way we obtain the plots of Fig. 3. Obviously, this fits the time series of the infectious population much better than Method 1, but it fails to fit (by orders of magnitude) the time series of the Active cases, as one can see in Fig. 4. Furthermore, for longer time intervals, the fitting functions for Romania and Germany of Fig. 3 are likely to differ significantly from the data. We investigated many countries but obtained similar results. Therefore, we observed that the SIR model cannot simultaneously fit the time series of the active cases and the total number of infections (that is, the active cases plus the closed cases). We may expect that by changing the constant p and λ parameters with functions of time we may be able to better fit the active cases or the closed cases, but not both of them at the same time, because of the huge differences observed between the predictions and the time series. This prompts us to make some substantial changes in the model.

Methods
To explain the discrepancies observed in the previous section, we extend the SIR model by assuming that the number of infected people is significantly different from the one we know, because, on one hand, a small fraction of the population has been tested and, on the other hand, the number of infected people was still very small. For simplicity, we shall refer to this as the SIR-Iceberg model or SIRI. We assume that beside the known infected population I 1 , there are I 2 infected persons that we do not know about, such that the total infected population is Similarly, the know and unknown recovered populations are denoted by R 1 and R 2 , respectively (see Fig. 5 for the schematic representation of this model). If the total population is again denoted by N , then the susceptible population is If we introduce the densities s ≡ S/N , y 1 ≡ I 1 /N , y 2 ≡ I 2 /N , r 1 ≡ R 1 /N , r 2 ≡ R 2 /N , then the system of coupled differential equations becomeṡ  In the system (2) we assumed that the rates of contamination due to the infected populations I 1 are I 2 , denoted by p 1 and p 2 , respectively, are different, because the population I 1 is much better monitored and isolated from the healthy population (one may safely assume that p 1 p 2 ). Furthermore, a contaminated person first enters the I 2 population and it has the rate ξ (assumed constant) to enter the known infected population I 1 (by random testing or because of severe symptoms). The rates by which the cases are closed (by recovery or death) in the populations I 1 and I 2 , denoted by λ 1 and λ 2 , respectively, may be different because of the following two (competing) reasons: (1) the population I 1 gets better health care and (2) the symptoms of I 1 may be (on average) more severe than the ones of the population I 2 , because they may have led to the observation of COVID19. Further, the rate ζ of transferring population from R 2 to R 1 is mostly due to the postmortem detection of the SARS-Cov-2 virus, because in the early stages the pandemic antibody tests were only being developed. In practice, this may be very small, so we set ζ = 0 in the numerical calculations.
The initial conditions are s 0 ≡ s(0), y 1,0 ≡ y 1 (0), y 2,0 ≡ y 2 (0), r 1,0 ≡ r 1 (0), and r 2,0 ≡ r 2 (0). Since not all these parameters are independent (s 0 + y 1,0 + y 2,0 + r 1,0 + r 2,0 = 1), we may choose, for example, r 2,0 to be determined by the other ones. To fit the data, according to the arguments above we choose p 1 = p 2 /10 or p 2 /20 (the final differences are insignificant), λ 1 = λ 2 = ap 2 , and ζ = 0. In this way we are left with the fitting parameters p 2 , a, and ξ. The results of the fit are shown in Figs. 6 -9.   We observe that the SIRI model fits significantly better all the countries, despite its simplicity. Nevertheless, in some cases discrepancies between the model predictions and the data still exist. In Figs. 6 -9 we show two contrasting situations. While the fit for European countries are quite good (see, for example, Romania and Germany in Figs. 6 and 7, respectively), the data for Hubei

Discussion
Whereas the SIR model with constant p and λ (infection and healing rates) completely fails to describe the data because of the large discrepancies between the fits of different time series (if we fit one time series, another one may differ from the fitting function by more than one order of magnitude in significant parts of the fitting range), the SIRI model fits well the available data for most of the countries studied (e.g. Figs. 6 and 7 and Appendix B), although some discrepancies still remain for some countries (e.g. Figs. 8 and 9). We also notice that s decreases fast. According to our assumptions, s cannot be directly calculated from the available data, since we do not know y 2 and r 2 (or, equivalently, I 2 and R 2 ). We estimate it in the model. If s becomes too small, then the time variations of y 1 and r 1 are not determined by the actual infections, but by the rate of people transfer from compartment I 2 into compartment I 1 (COVID19 diagnosis at some time after the infection) and from I 1 to R 1 (recovery or death of people confirmed with COVID19).
The discrepancies between the data and the SIRI estimates may be due to the fact that the parameters that we use in the models should be replaced with functions of time, according to the time evolution of the pandemic (people awareness, health system improvement, governmental measures, time dependence of the recovery rates, etc.) and the testing of the population. Another important aspect is that the spread of the disease is not geographically homogeneous and one should split countries (or regions) into more localized containers in which the infections are more uniformly distributed. Furthermore, even in the same geographical area, some people are more exposed to the virus than others, which would lead to further splitting of the population     into containers. These aspects are not incorporated into our model, but our description, while it remains simple, is reliable. Nevertheless, maybe the most important conclusion of this work is that there was a significant part of the population which was infected, but was not detected. Taking such aspects into account would significantly improve the estimation of parameters like the disease contagiousness, virulence, and healing rate, improving also the prediction of the new waves of pandemics.

Acknowledgments
We are grateful to Prof. Dan Pîrjol for drawing our attention to this subject and for introducing us to the basic SIR model. The work was supported by the UEFISCDI project PN 19060101/2019 and Romania-JINR collaboration projects.
Authors' contributions DVA formulated the concept and the SIRI model, contributed to the numerical calculations, to the interpretation of the results, and wrote the first draft of the paper. ITAA discussed the models, contributed to the numerical calculations (he wrote and used the C++ codes), to the interpretation of the results, and to the writing of the paper (which includes drawing most of the figures).
Appendix A. The SIR model As explained in the main text of the article, the SIR mdel consists of three compartments, Susceptible, Infected, and Recovered, each of them being populated by S, I, and R persons. The where p and λ have been defined in Section 1. By dividing the system (A.1) by N , we obtain the system (1). If we divide the Eqs. (1) by p and denote x ≡ pt and a ≡ λ/p, Eqs. (1) get the simpler form [7] s In the equations above, one can neglect r 0 , since this is in general either zero or very small. We see that D(w, a, y 0 , r 0 ) ≥ 0. Furthermore, the integral diverges if and only if the denominator D becomes zero at the lower integration limit. If we denote this limit by w ∞ , then D(w ∞ , a, y 0 , r 0 ) = 0 and lim x→∞ w(x) = w ∞ . Therefore, w ∞ may be determined from the equation Since s < 0 for any x, then w < 0, so w ∞ < w 0 < 0.   Figure A1. The function f a (w) (continuous lines) and the constants a ln(1 − y 0 ) (dashed lines) from the equation that gives the asymptotic value of w, when x → ∞ (we assume r 0 = 0).
To find the solution of Eq. (A.7), we denote f a (w) ≡ 1 + aw − e w and we observe that f a (±∞) = −∞ and f a (0) = 0. The derivative f a (w) = a − e w is monotonically decreasing, from lim w→−∞ f a (w) = a to lim w→∞ f a (w) = −∞. This implies that f a (w) = 0 has two solutions, one of them being w = 0 and the other one being smaller or bigger than zero, depending on a. The function f a (w) has a maximum at w m , which satisfies the condition 0 = f a (w m ) = a − e wm , that is, w m = ln a. Therefore, if a < 1, w m < 0, so the second solution is negative, whereas if a > 1, w m > 0 and the second solution is positive. If a = 1, then both solutions coincide with w m = 0. These aspects may be seen in Fig. A1.
From Eq. (A.7) and Fig. A1 we can see that if a ≥ 1, then lim y 0 +r 0 →0 w ∞ = 0, whereas if a < 1, then lim y 0 +r 0 →0 w ∞ < 0. In epidemics a < 1, that is, a small initially infected population may spread the disease in a finite fraction of a large population. In this case, since at the beginning y 0 + r 0 1, then s 0 1 and y (x = 0) of Eq. (A.2b) is positive. On the other hand, if a ≥ 1, then y (x) < 0 and the number of active cases decreases monotonically.
Appendix A.1. Fitting the data with SIR To find the approximation of w for x 1, we may start either from Eq. (A.5) or from Eq. (A.6). From Eq. (A.5), for example, using w (x 1) ≈ w (0) = −y 0 , we obtain directly We assume that y 0 1 − a < 1, so we can use the approximations 1 − y 0 − r 0 − a ≈ 1 − a and ln(1 − y 0 − r 0 ) ≈ 0. If an epidemic had evolved for some time and is far from the beginning, as is the case of the COVID19 pandemic, one should fit the whole time series. The evolution of COVID19 was very complex and determined by many factors. We do not intend to describe here the whole evolution, but we analyze only time intervals shorter than 100 days. Day 1 is the day in which the number of confirmed cases exceeded 100 in the analyzed country, such that too large fluctuations that appear in the beginning due to the bad statistics may be partially avoided, while we still analyze the evolution from an early stage. During such a time interval, the evolution of the number of cases in each country may be modeled by parameters that do not vary in time (the parameters are different from country to country), since the measures taken by governments and governmental institutions were only being implemented and conditions did not vary much. The COVID19 started spreading in different countries or regions at different times, so we fit each country or region separately. As explained before, we take the data from https://github.com/CSSEGISandData/COVID-19 and form the vectors of data S d , I d , and R d , corresponding to the compartments S, I, and R, respectively. The unit of time is 1 day and t = 0 at Day 1. If N is the total population of the country or region analyzed, we calculate the densities s d = S d /N , y d = I d /N , and r d = R d /N . On the other hand, we calculate the time evolution s(t), y(t), and r(t), using Eqs. (1) and the initial conditions s 0 = s d (1), y 0 = y d (1), and r 0 = r d (1). The functions s(t), y(t), and r(t) depend on the parameters p and λ, which we determine by fitting the data using the least squares method.
The results of the fitting procedure were explained in Sections 2.1 and 2.2 Eventually the most important time series is y d , which represents the time evolution of the number of active cases. Therefore, in Section 2.1 we determined the parameters p and λ by fitting y(t) to y d (with the initial conditions as specified above). The results are plotted in Fig. 1. We notice that while some of the details may not be fitted by this method, the calculated functions reproduce the general behavior of the time series. The details may be adjusted by assuming that p depends on time (because of the restrictions imposed and the reaction of the population), whereas λ depends both on the time of the infection and on the time elapsed from the infection (because of the time variation of the health care system efficiency and of the human body response). Using such adjustments, one may expect to be able to fit the time series y(t), but the real problem of this method is that the same parameters should be used to calculate the estimation of the other time series, namely of s(t) and r(t). It is shown in Fig. 2 that the difference between r(t) and the data y d (t) is very big-in relevant time intervals, the estimation is more than one (or even two) orders of magnitude bigger than the reported values. Therefore, by adjusting the parameters to fit the y d time series in more detail is not expected to make such a huge difference in the values of r(t) to compensate for the differences observed in Fig. 2.
In Section 2.2, we tried to fit the time series of the confirmed cases y d + r d . We applied again the least squares method and we obtained the results presented in Fig. 3. Even though the calculations y(t) + r(t) are similar to the data y d + r d , this time the active cases y(t) differ by orders of magnitude from the time series y d , as one can see in Fig. 4. As in the case of M1, the fine adjustments of p and λ to fit the time series of confirmed cases cannot compensate for the big differences obtained for the active cases.
We also tried to fit simultaneously y d and r d , by applying the least square method to both sets, but by doing so, in the best case we could fit only one of the time series. Visually, in general this method led to worse results for both, y d and r d , whereas quantitatively, the time series of bigger numbers was better fit than the time series consisting of the smaller numbers.   Figure B1. Examples of time series of COVID19 infections fitted with the modified model SIRI. As the total populations, we used the values found on Wikipedia: Italy -N = 60.36 × 10 6 , Spain -N = 46.94 × 10 6 , the Netherlands -N = 5.518 × 10 6 , the United States -N = 328 × 10 6 , and Brazil -N = 209.5 × 10 6 . We see that in most cases, the fit is quite good (eventually with the exception of the Netherlands).