Application of probabilistic models for extreme values to the COVID-2019 epidemic daily dataset

Worldwide, COVID-19 coronavirus disease is spreading rapidly in a second and third wave of infections. In this context of increasing infections, it is critical to know the probability of a specific number of cases being reported. We collated data on new daily confirmed cases of COVID-19 breakouts in: Argentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States, from the 20th of January, 2020 to 28th of August 2021. A selected sample of almost ten thousand data is used to validate the proposed models. Generalized Extreme-Value Distribution Type 1-Gumbel and Exponential (1, 2 parameters) models were introduced to analyze the probability of new daily confirmed cases. The data presented in this document for each country provide the daily probability of rate incidence. In addition, the frequencies of historical events expressed as a return period in days of the complete data set is provided.


a b s t r a c t
Worldwide, COVID-19 coronavirus disease is spreading rapidly in a second and third wave of infections. In this context of increasing infections, it is critical to know the probability of a specific number of cases being reported. We collated data on new daily confirmed cases of COVID-19 breakouts in: Argentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States, from the 20th of January, 2020 to 28th of August 2021. A selected sample of almost ten thousand data is used to validate the proposed models. Generalized Extreme-Value Distribution Type 1-Gumbel and Exponential (1, 2 parameters) models were introduced to analyze the probability of new daily confirmed cases. The data presented in this document for each country provide the daily probability of rate incidence. In addition, the frequencies of historical events expressed as a return period in days of the complete data set is provided.

Value of the Data
• Data on daily Covid cases are now easy to obtain. Authorities there are beginning to compile, cross-check and release these data to examine and analysis it. Thus, they are widely available in most countries. However, it is not easy to associate a probability of event occurrence to each daily case report data. • These data can be updated through official reports and specialized websites. The database presented here is easy to update during the progress of the epidemic (including the third wave in some countries). In data-set of new daily cases are associated with their probability of frequency. They can be wielded to determine the probability of recent infections at specific sites. • The likelihood of a new outbreak of Covid in any of the countries above can be estimated employing the extreme values probability distribution with the best fit.
• This dataset also supports expanding understanding of the differences in geographic scale in forecasting COVID-19 case counts [2] . Show that statistically significant differences exist based on percentage error metrics when using the same forecasting method at different levels of geographic resolution. • The probability distributions presented are a complement to a forecasting model. This dataset provides daily probability of rate incidence that could be explored alongside forecasting data to gain further insight into the validity of different forecasts at varied geographic scales as a result of population size differences across countries. • In order to provide health institutions, research centers and authorities with probabilistic tools to respond to changes in the epidemic. The Matlab code for the systematic of the frequency calculations is included.

Data Description
Worldwide, COVID-19 coronavirus disease is spreading rapidly in a second and third wave of infections. In this context of increasing infections, it is critical to know the probability of a specific number of cases being reported [3] . Daily data on new confirmed cases of COVID-19 outbreaks in 16 most affected countries: Argentina, Brazil, China, Colombia, Italy, Spain, France [4] , Germany, India [5] , Indonesia, Iran, Mexico, Poland, Russia, U.K., and the United States from the 20th of January, 2020 to 28th of August 2021 were collected from COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) ( https: //coronavirus.jhu.edu/ ). A sample of more than ten thousand daily data is utilized to validate the proposed models. Figs. 1 to 4 shows an example of fit frequency analysis. Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 showed in Figs. 5 to 7 .
Very specific studies on COVID-19 forecasting are currently available. It is common to use autoregressive models of the type ARMA(p,q). For example [6] , utilize Hidden Markov Chain Models of Moroccan data y [7] using Recurrent Neural Networks; these studies are "forecasting" models. However, there are few studies on the probability of a specific number of infections happening in a day. This is one of the highlights of this dataset. It is proposed to use a frequency analysis to assign a probability of occurrence (infection) of a very particular day in a specific country.   A theoretical frequency analysis means to fit a series of data to a probability distribution function P(x) , which represents the probability of occurrence of a random variable. This procedure must be applied when it is desirable to know an event associated with a return period greater than the maximum length of data record; this is why it is called theoretical because it is not possible to estimate the event using an empirical frequency table. There are several probability distribution functions. Those most successfully used are: normal, log-normal, exponential, gamma, Pearson type III (or three-parameter gamma), log-Pearson type III and those of extreme values types I, II and III; or Gumbel, Frechet and Weibull, respectively. Mixed probability functions are also used, i.e. they can take into account two or three data sets. For daily covid data we propose to use the extreme distributions shown below.

Gumbel distribution
Where s is the standard deviation and x is the mean. α is the scale parameter. β is the shape parameter. Then to equal the probability function of the return period P = 1 − 1 T with the dis-  tribution function is.
And solving x x is the mean. β is the location parameter. According to the return period is: And solving x

Exponential II distribution
Where s is the standard deviation and x is the mean. β is the scale parameter. μ is the shape parameter. According to the return period is: And solving x x = μ + −Ln 1 Tr β (12)

Experimental design, materials and methods
Generalized Extreme-Value Distribution Type-1 (Gumbel) [8] and Exponential models were introduced to analyze the probability of new daily confirmed cases. The data presented in this document for each country provide the daily probability of rate incidence [9] . In addition, the frequencies of historical events expressed as a return period in days of the complete data set is provided. Table 1 shows the estimation of the parameters of the distributions used. This probabilistic analysis comes from the frequency analysis in each of the countries. Only some countries are shown here as examples. The total of the probabilistic analysis can be obtained from the database of this paper. If a series of extreme values is used, the maximum data recorded in each day must be used. This series is used when the design must be based on the most adverse conditions. The empirical return period of this data series is obtained with the following expression proposed by Hosking et al. [10] . T is the empirical return period, in days n is the total number of data in each country m is the order-number in a list from high to low value When historical records of a phenomenon are used, defined as daily data, they should be assigned a return period according to their observed cumulative frequencies (frequencies table). To calculate it, it is assumed that the frequency or recurrence interval of each observed event, allows assigning a return period to each data. This is known as the observed (empirical) return period. Since the return period has a completely probabilistic definition, in its mathematical form T of a daily event x , it should be defined as the inverse of the probability P(x) of that event x to occur. This means that the probability of being equalized or exceeded by another event x must be expressed as:

Funding
This work was financially supported by Consejo Nacional de Ciencia y Tecnología, CONACYT, Mexico.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.