A KdV-SIR equation and its analytical solutions: An application for COVID-19 data analysis

To describe the time evolution of infected persons associated with an epidemic wave, we recently derived the KdV-SIR equation that is mathematically identical to the Kortewegde Vries (KdV) equation in the traveling wave coordinate and that represents the classical SIR model under a weakly nonlinear assumption. This study further discusses the feasibility of applying the KdV-SIR equation and its analytical solutions together with COVID-19 data in order to estimate a peak time for a maximum number of infected persons. To propose a prediction method and to verify its performance, three types of data were generated based on COVID-19 raw data, using the following procedures: (1) a curve fitting package, (2) the empirical mode decomposition (EMD) method, and (3) the 28-day running mean method. Using the produced data and our derived formulas for ensemble forecasts, we determined various estimates for growth rates, providing outcomes for possible peak times. Compared to other methods, our method mainly relies on one parameter, σo (i.e., a time independent growth rate), which represents the collective impact of a transmission rate (β) and a recovery rate (ν). Utilizing an energy equation that describes the relationship between the time dependent and independent growth rates, our method offers a straightforward alternative for estimating peak times in ensemble predictions.


Introduction
Accurate and reliable prediction for the spread and evolution of epidemics is important for societal safety and economic activity. For example, between late 2019 and early 2022, SARS-CoV-2 (COVID-19) caused more than 5.6 million worldwide deaths. As indicated in the April 2021 World Economic Outlook Report, the global economy contracted by 3.5 percent in 2020 (Yeyati and Filippini 2021 [1]).
Mathematical models are effective methods for representing and predicting epidemic waves. Among pioneering epidemic models, the so-called SIR model, consisting of three, first-order ordinary differential equations (ODEs), was proposed by Kermack and McKendrick (1927) [2] for describing the evolution of susceptible (S), infected (I), and recovered (R) persons, respectively. Time changes for these three groups of people are determined by initial conditions and three parameters, including the transmission rate, the recovered rate, and the total number of people within a population. Such a model was applied to produce a bell-shaped curve to qualitatively depict the time evolution of infected (dead) persons (e.g., Figure 1 of Kermack and McKendrick 1927 [2]), referred to as an epidemic wave.
Since 1927, the SIR model has been widely investigated and its complexities have been expanded. Information obtained from such investigations can help us build more reliable methods for predicting * Corresponding author. new epidemics and for estimating the impact of new policies regarding epidemic spread, thus providing good guidance. For example, a well-known control parameter (i.e., known as the basic reproduction number, , ''R-naught'') has been proposed to illustrate the collective impact of transmission and recovered rates, and to qualitatively measure the ''width'' of a time varying epidemic wave. A flattened epidemic wave is depicted by a small with a relatively low transmission rate. Different viruses yield various epidemic waves, if they exist, with different slopes, suggesting different values of . The appearance of multiple COVID-19 epidemic waves, which may indicate the appearance of different virus variants, suggests the need to consider different values of . However, whether (or not) one or more control parameter(s) can be identified for depicting the time evolution of multiple COVID-19 epidemic waves should first be addressed.
Epidemic models are effective tools for guiding epidemic interventions. Since the emergence of COVID-19 disease, the SIR model and its variants have been widely applied for addressing whether (or not) and how different control scenarios (such as confinement strategies) may prevent virus growth or spread. In spite of the successful application of SIR-type models in the projection of various epidemics, fundamental and challenging questions regarding the (intrinsic) predictability of epidemics remain. For example, a recent predictability study by  [3] indicated challenges for predicting the turning point and the end of expanding COVID-19 epidemic waves. In the study by Castro et al. (2020) [3], the two pioneering studies of Prof. Lorenz (e.g., Lorenz 1963a [4], 1972 [5]) were cited as support for the analysis. To complement their discussions, our study applied the KdV-SIR equation and its analytical solutions in order to illustrate the sensitivity of predictability horizons on control parameters and initial conditions. To facilitate discussions, here, we first provide a brief review of Lorenz's definition of predictability (Lorenz 1963a [4], 1963b [6], 1972 [5], 1993 [7]).
First, as originally proposed by (Lorenz 1963b; Shen 2014 [8]), two kinds of predictability for weather and climate are intrinsic predictability and practical predictability. Recently, such a concept was applied for studying epidemic waves by Paxson and Shen (2022) [9]. Intrinsic predictability only displays a dependence on the nature of an ''epidemic'', and practical predictability is limited by imperfect initial conditions and/or mathematical formulas. As a result, different epidemics (Severe Acute Respiratory Syndrome (SARS) and COVID-19) may have different intrinsic predictability. As suggested by distinct predictability in weather systems that possess very different spatial patterns and temporal variations, a first step in examining (or revealing) whether (or not) intrinsic predictability may exist is to check whether (or not) the evolution of an epidemic wave can be described by a specific ''pattern'' (i.e., a specific curve). If so, we can then further check whether (or not) one or more control parameters are found within a curve. For example, based on statistical observations (e.g., Figure 1 of Kermack and McKendrick 1927 [2]), the time evolution of infected (or dead) persons in some epidemic waves can be described by a solitary wave with a bell-shaped curve. Therefore, asking whether (or not) such a solitary wave can be found in other epidemics (e.g., the COVID-19 epidemic) is natural. If found, efforts can then be made to identify parameters (or factors) that determine the shapes of waves. To achieve such a goal, Paxson and Shen (2022) [9] previously derived the KdV-SIR equation, solved the equation for its analytical solution, and derived an energy equation in order to identify control parameters (e.g., the time independent relative growth rate, to be discussed) for idealized epidemic waves. In this study, the validity of such analytical solutions for depicting the overall shape of COVID-19 epidemic waves using USA and world data is further examined.
Second, as discussed in Paxson and Shen (2022) [9], although the KdV-SIR equation and the non-dissipative Lorenz model are mathematically identical, SIR-type models are basically non-chaotic. Challenges regarding accurate predictions in non-chaotic and chaotic systems are different. For this study, we apply the KdV-SIR equation and its analytical solutions in order to illustrate the dependence of prediction horizons on initial conditions and system parameters. Such sensitivities of solutions are then compared to the sensitive dependence of solutions on initial conditions (SDIC), known as the butterfly effect, within chaotic systems (e.g., Shen et al. 2022a, b, c [10-12]).
As a baseline for sensitivity comparisons within the Lorenz and SIR models below, we briefly review major findings of SDIC and butterfly effects in Lorenz's studies (Lorenz 1963a [4], 1972 [5], 1969 [13]). The pioneering study by Prof.  proposed a very elegant, but profound, set of three, first-order ODEs that were applied in order to rediscover the SDIC based on Poincare (1890 [14], 1908 [15]). SDIC and solution boundedness are two major features within a chaotic system (Lorenz 1963a [4]; Jordan and Smith 2007 [16]; Shen 2019a [17]). The SDIC suggests that a tiny perturbation in the initial condition may lead to a very different time evolution for the solution. For example, when considering control and parallel runs, obtained using the same model and parameters containing slightly different initial conditions, the two runs produce almost identical results during an initial period of time, but significantly different results after a finite period of time (e.g., Figure 2 of Shen 2019a [17]). In solutions, boundedness is indicated by the finite size of a butterfly's wings. From the perspective of prediction, key points derived from the SDIC within the Lorenz 1963a [4] model are as follows: predictability is finite and predictability cannot be continuously improved using continual improvement in the initial conditions (for state variables).
The Lorenz (1963a) [4] study and an influential presentation by Lorenz (1972 [5]) entitled ''Predictability: Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?'' are widely accepted as the foundations for chaos theory. Chaos theory has been viewed as one of the three major scientific discoveries of the 20th century, along with relativity and quantum mechanics. Although some researchers have interchangeably cited the 1963a [4] and 1972 [5] studies in regards to the butterfly effect and predictability, recent studies suggest that the original butterfly effect, within the Lorenz (1963a [4]) study, and the metaphorical butterfly effect, within the Lorenz (1972 [5]) study, referred to the first and second kinds of butterfly effects, respectively, are different (Shen 2014 [8]; Shen et al. 2022a,b [10]). The former is defined by the SDIC and has been widely examined in nonlinear dynamics, while the latter has been applied to emphasize the enabling role of a tiny perturbation in creating a tornado. In this study, to simplify discussions, we only consider the 1st kind of butterfly effect (i.e., SDIC) within the Lorenz 1963a [4] model for our comparison of sensitivities within the KdV-SIR equation. All existing tools cannot address issues surrounding the 2nd kind of butterfly effect.
Third, the sensitivities of prediction to changes in model systems (e.g., different mathematical formulas or different values of parameters) and to changes in the initial values of state variables should be separately considered. Here, we briefly review related discussions in Lorenz's (1993 [7]) book entitled ''The Essence of Chaos'' that further discussed the meaning of chaos and SDIC. On page 23 of the book, Lorenz defined interior and exterior changes as changes in initial ''conditions'' that can or cannot alter fundamental features (i.e., the socalled virtual constants in his book) of a system. For example, interior changes can be made by perturbing the initial values of one or more state variables, while exterior changes are achieved by specifying a different value of parameters or changing mathematical equations.
Based on the above classification, the SDIC indicates a sensitivity dependence on interior changes. On the other hand, a sensitivity to exterior changes does not imply chaos. Based on the above discussions, three major categories of predictability dependence can be identified within various SIR-type systems, including a dependence on: (1) initial errors for state variables; (2) estimated (or pre-selected) values of parameters such as the transmission rate, the recovered rate, and/or the basic reproduction number; and (3) the assumption and complexities of epidemic models (e.g., whether a confinement strategy or time varying parameters are applied). Thus, to address the question of whether (or how) the turning point and the end of an expanding epidemic can be precisely forecast for an ongoing event, all three dependences should be examined. In this study, we simply illustrate the first two types of sensitivities using one control parameter (i.e., a time independent relative growth rate) derived from the KdV-SIR equation.
As first suggested by Paxson and Shen (2022) [9], the KdV-SIR equation makes it convenient to apply approaches in nonlinear dynamics for illustrating the characteristics of an epidemic wave and studying its predictability. Over the past several decades, various kinds of bifurcations, such as pitchfork, trans-critical, Hopf, Neimark-Sacker (secondary Hopf), Period-doubling (flip), and/or fold (Saddle-node) bifurcations, have been proposed and analyzed [16,18]. Recently, numerous studies have investigated bifurcation analyses using various methods and applications, including the following models: (1) the Planar discrete-time Hindmarsh-Rose oscillator, (2) the Generalized perturbed KdV equation, (3) the Kopel model with nonsymmetric response between oligopolists, (4) the Lotka-Volterra prey-predator model, and (5) the Generalized Kopel triopoly model [19][20][21][22][23]. While future studies will apply methods for analysis of the KdV-SIR equation, this study focuses on the application of the KdV-SIR equation and its analytical solutions for COVID-19 data analysis. As compared to the exact analytical solution of the SIR model obtained in a parametric form in [24], our system and solutions have an advantage of simplicity for illustrating the fundamental dynamics of an epidemic wave, and the relationship between basic and time varying reproduction numbers [9].
Using time-varying networks [25], past research has investigated the impact of modular and temporal connectivity patterns on the spread of epidemics. A community or module is formed when there are more internal connections within a community than external connections between communities. In simpler terms, a community is a subgraph with more links within itself than outside it [26]. A network is considered time-varying when the time scales for both the network evolution and the spreading process are comparable. In addition, models based on scale-free and small-world networks have been used to study the transmission of recent diseases such as SARS through travelers moving between cities [27]. Large-scale numerical simulations of SIS (susceptible infected-susceptible)-based models were conducted by Ferreira et al. [28] in order to determine the effective threshold for systems with a finite size. While network-based approaches that produced promising results may be considered in future studies, this study focuses on the spread of epidemics within a community.
In this paper, the KdV-SIR equation is first introduced, and its analytical solutions and the energy equation are presented as a mechanism for analyzing multiple epidemic waves using COVID-19 data in the USA (and the world). The feasibility of applying the simple mathematical formula for prediction is discussed. The paper is organized as follows. Section 2 introduces the classical SIR model, the KdV-SIR equation, its analytical solutions, and the energy equation. We then propose a methodology for the analysis of COVID-19 data, which includes the removal of high-frequency signals using 14-or 28-day running means, Fourier transforms, the empirical mode decomposition (EMD), and a curve fitting software package. Section 3 provides an analysis of COVID-19 data obtained from the USA and the world. We discuss the role of an initial relative growth rate for determining predictability horizons and for depicting the shape (e.g., the width) of epidemic waves at different periods of time. A methodology for prediction is proposed. Concluding remarks are provided at the end. The Appendix provides derivations of the KdV equation in a traveling-wave coordinate for comparison with the KdV-SIR equation.

The SIR model
The well-known epidemic model, proposed by Kermack and McKendrick (1927) [2] over ninety-five years ago, is written as follows: = . (3) The model contains three state variables and three positive parameters. State variables, , , and , represent susceptible, infected, and recovered individuals, respectively. Three parameters, , , and , denote a transmission rate, a recovery rate, and a fixed population ( = + + ), respectively. Eqs. (1) to (3) are referred to as the SIR model. A sum of Eqs. (1) to (3) yields the following conservative law: In this study, (0) and 0 are interchangeably used to represent the initial values of . Such a convention is applied to (0) and . However, while (0) represents an initial value of , , as discussed below, is reserved for the so-called basic reproduction number. Earlier studies (e.g., Strogatz 2015 [18]) reported that Eqs. (1) to (3) can be transformed into the following single, 1st-order ordinary differential equation (ODE) for the state variable : To derive the so-called KdV-SIR equation that describes the evolution of infected people, a recent study by Paxson and Shen (2022) [9] applied a weakly nonlinear assumption in Eq. (5). Below, the KdV-SIR equation and its analytical solutions are reviewed.

The KdV-SIR equation and its analytical solution
To illustrate the fundamental dynamics and predictability of epidemic waves, here, we consider a condition of 0 ≈ , and define the following two important parameters: ( -naught), that was derived from equations (e.g., Paxson and Shen, 2022 [9]), represents the basic reproduction number. Within the scientific literature, a simpler version of the basic reproduction number is defined as * = , which is a function of two parameters and . Thus, unless otherwise stated, for different cases, we keep the ratio 0 ∕ to have a fixed ratio between and * . While is used in calculations of solutions, the value of * is listed in the title of each figure. The assumption 0 ≈ implies a very small number of infected and recovered persons at the beginning of an epidemic. For this study, we chose = 10,000 and = 9, 900. In addition to , ( -naught), that represents a time-independent relative growth rate, is introduced as a control parameter. In Section 3, the role of is illustrated using real data.
As discussed in Paxson and Shen (2022) [9], Eq. (5), along with a weakly nonlinear assumption and the introduction of Eq. (6a) and Eq. (6b), yields: and The above 2nd-order ODE for , derived from the SIR model, is referred as the KdV-SIR equation; and its analytical solution, of a hyperbolic secant squared function, is called a solitary epidemic wave. A simple method for verifying the derivation of Eq. (7) is as follows: the integral of Eq. (7b) gives a hyperbolic tangent function that is a solution of Eq. (5) under a weakly nonlinear assumption (e.g., Anderson and May 1991 [29]; Keeling and Rohani 2008 [30]; Paxson and Shen 2022 [9]). The same mathematical form between the KdV-SIR equation and the KdV equation in the traveling wave coordinate is briefly summarized in the Appendix. To our knowledge, the KdV-SIR equation for variable (i.e., Eq. (7)) was first documented by Paxson and Shen (2022) [9] for studying epidemic waves. Below, both the ODE and its analytical solution are analyzed in order to propose a method for real data analysis.
From Eqs. (3) and (7), we can determine using: From 7, we can compute as follows: = − − . Fig. 1 displays analytical solutions of and , as well as . The time evolution for infected persons appears as a single, isolated wave crest and is monotonic on both sides of the crest. Stated alternatively, the curve contains one peak surrounded by two long tails, each on one side of the peak. On the other hand, as a result of the weakly nonlinear assumption, as discussed in Paxson and Shen (2022b) [31], the solution of displays a non-monotonic feature, an outcome which does not appear within the classical SIR model. This issue may be viewed as one of the weaknesses of this approach. Fortunately, state variables and were the focus in this study.

A formula for real data analysis
To propose a method for analyzing real world data, we first re-write Eq. (7b) as follows: Here, the right hand side indicates that the solution is determined by three parameters, , , and . Eq. (9) enables the application of a limited number of data points for estimating the parameter in order to construct a solution curve. Below, such a capability is first illustrated by showing the relationship between the time independent growth rate, , and time varying growth rates, ( ). For real-world data analysis, the availability of data may give the following two scenarios: (1) sufficient data points that can statistically determine all of the three (or two) parameters using optimization methods, and (2) limited data points that can help determine or estimate the time-independent growth rate. Activities in the first scenario are referred to as ''analysis'', while those in the second scenario are called ''forecasts''. The first scenario was performed using curve fitting methods. To make predictions, the 2nd scenario is our focus. Thus, our method is fundamentally different from typical curve fitting methods. Below, the two scenarios are discussed.

Predictability horizons
To make predictions using SIR-type models (e.g., Eqs. (1) to (3)), in addition to accurate initial conditions, accurate estimates of the two parameters (i.e., and ) are often required. By comparison, predictions using Eq. (9) can be made when an initial (i.e., = at = ), , and are known. For example, the values of = , , and yield: The above represents a predictability horizon. From the perspective of solution evolution, the ''parameter'' acts as a phase lag in Eq. (9). When the three parameters, , , and are known, Eq. (9) can describe the time evolution of at any time. In reality, for on-going epidemic events, daily or weekly reports of infected persons may provide initial conditions for the state variable . Thus, in order to apply Eq. (10) for constructing a solution curve, a crucial step is to determine the two parameters, and . As discussed in Section 3, such a goal can be achieved using least square methods that require sufficient data points. Below, using data only available for a finite period of time, we provide simple mathematical equations for determining .

The relationship between time-varying and independent growth rates:
vs.
An equation for the conservation of total energy ( ) is written as follows: Here, 1 is an integration constant, representing an initial . Eq. (11) was derived by multiplying both sides of Eq. (7a) by ′ and by performing a time integration. The dependence of various types of solutions on the initial is revealed by the contour lines of the TE in Fig. 2, showing oscillatory solutions (in blue), a homoclinic orbit (in green), and unbounded solutions (in red) within the − ' phase space.
By considering 1 = 0 in Eq. (11), we can obtain the following equation: The left hand side represents a time varying relative growth rate squared ( ( ) 2 ), while the right hand side is a function of the time independent growth rate ( ). Given a constant value of , ( ) 2 is a linear function of and monotonically decreases with time for a solitary wave solution (before the peak). Stated alternatively, the time independent represents an initial growth rate for infected persons whose solution can be simply approximated by an exponential function of time (i.e., ). During the intermediate stage, the local growth rate for infected persons gradually decreases with time and becomes zero at the peak for newly infected persons. While Eq. (12) was derived from the energy equation, it can also be directly derived using the derivative of the analytical solution in Eq. (7b), yielding: After taking the square of both sides of Eq. (13) and applying the identity Based on data obtained during an initial period of time, ( ) 2 can be determined using a finite difference method, as follows: Here, ( ) = ( ), is an integer, and is a time step, Thus, by using Eqs. (12) and (14), we can estimate 2 , as follows: When an idealized solitary wave solution of Eq. (7b) is applied in Eq. (14), the estimates of 2 using Eq. (15) should produce a constant.
As an illustration, estimated values of at three different times, including at the ''initial'' point (for = 0), at the half width point where = (1∕2) , and at the inflection-turning point where = (2∕3) , are: As discussed above, the right hand side in Eq. (15) is first computed using the finite difference equation in Eq. (14) and real data. In reality, the reflection-turning point (i.e., at = 2 ∕3) may be determined  when ′′ = 0 (i.e., by monitoring when ′ changes a sign). However, the half width point can only be determined during a post-process 1 .

Methods for raw data processing
The above method applies real data for estimating the time independent growth rate ( ) in order to construct a solitary wave solution in the form of Eq. (9). Considering a real world scenario, when an outbreak begins, reports for the number of daily infected persons become available. Such data are referred to as ''raw'' data and often contain high frequency fluctuations. In this study, we removed high frequency fluctuations by applying a 14-day or 28-day running mean, the Fast Fourier transform (FFT), and the empirical mode decomposition (EMD) (e.g., Huang et al. 1998 [35]). Processed data without high frequency fluctuations are referred to as ''filtered data''. For realtime prediction, local analysis methods such as a running 1 In a separate paper (e.g., Paxson and Shen 2022 [9]), a half width point was selected in order to study the skewness of an asymmetric wave. For example, = ∕2 yields two points, one on the left hand side and the other on the right hand side of . The time interval between each of the data points and the peak point is different for an asymmetric wave. As a result, the ratio of the two time intervals indicates skewness of the wave. mean and the EMD method are preferred. Since methods employing running mean and FFT are fairly common, below, we only provide a review of the EMD method. An analysis of processed data using all three methods is presented in Section 3.

The empirical mode decomposition (EMD) method
Here, we briefly discuss how the EMD is applied in order to extract Intrinsic Mode Functions (IMFs) from the time series of raw ''Data'' denoted as ( )) and how to construct ''filtered data'' by computing a sum of selected IMFs with or without the residual of data. We use the notation ( ) with = 0, 1, ⋯ to represent a time series of the variable ( ) within a domain of [0, ]. An element ( ) is the value of at = . Here is a time increment, defined as ∕ . Each IMF is also a time series over the same time domain. The th IMF is denoted as ( ) with a subscript '' ''. A residual of data, ( ), is defined as the difference between the raw data and the sum of '' '' selected IMFs. The subscript '' '' indicates the number of selected IMFs. For example, when the first IMFs are selected, the corresponding residual ( ) becomes: which is based on the property that the raw data can be represented by the linear combinations of IMFs and the residual: Next, we introduce the process of applying the EMD for determining IMFs. The following notation is adopted: Therefore, initially, no IMFs exist and we have ( ) = ( ) from Eq. (17). Input data ( ) were obtained from the ''remaining'' data once the '' ''th IMF was determined and removed. Thus, the initial input data ( ) = ( ) and, thus, ( ) = ( ) as = 0. So-called shifting processes were applied, as follows:  Step 7 were repeated 10 times). Based on the above sifting processes, IMFs possess the following features: IMFs are symmetric with respect to the local zero mean; the number of zero crossings and the number of extrema differ by at most one; the IMFs are still time domain functions that represent local variability of the original signal at a particular range of frequencies (wavelengths). Mathematically, a strict definition of the IMF is as follows: ( ) + ( ) = 0 (Wang et al. 2020 [36]), namely its local means are zero.
Using the above processes, IMFs and residual functions are defined at the same grid points as those for the raw data. The residual, , represents differences between the raw data and the sum of the first IMFs. By applying Eq. (18), the raw data ( ) can also be represented by the first ( + 1) IMFs and +1 , as follows: Eqs. (18) and (19) lead to: Since +1 ( ) is purely oscillatory with respect to +1 ( ), we can consider +1 ( ) as the local mean of ( ). Thus, EMD is a ''Reynolds type'' decomposition for sifting out (or extracting) periodic components from the data by separating the local mean from fluctuations using spline fits (Huang et al. 1998 [32]).

Procedures for constructing a solution curve
As we know, curve fitting methods can be applied for the ''analysis'' of an epidemic event (an epidemic wave). In contrast to the analysis, a method for the ''prediction'' of an on-going epidemic solitary wave is proposed. As a prototype with some simplifications to provide a proof  of concept, our method is based on direct application of the solution in Eq. (9) and the crucial step for determining the time independent growth rate ( ) using Eqs. (14) and (15). The procedures for real data applications are summarized as follows: 1. We applied the methods of running means, FFT, and EMD in order to obtain ''filtered data''. The first method has the potential for performing predictions, while the 2nd and 3rd methods are applied for verification. 2. We applied filtered data in order to compute time varying growth rates ( ( )) using Eq. (14) and estimate the timeindependent relative growth rate ( ) using Eq. (15). 3. We assumed one or several hypothetical values of (for forecasts) or determined using real data (for hind-casts). (2) and (3), can be predicted (i.e., ''forecasted'') based on Eq. (10). 5. The three parameters were used to construct the solitary wave solution using Eq. (9).

Given an initial value of and the values of and in Steps
Given a specific solitary wave solution described by Eq. (9), should be a constant. However, for real data analysis, estimated may be time varying. For such a situation, all of the estimated in Step 2 may be averaged over a period of time to represent the time independent growth rate ( ). For curve fitting using all of the data, an initial guess of is determined from filtered data. Then, a nonlinear least squares method is applied to determine and .

Discussion
A preliminary verification of the aforementioned methods along with mathematical formulas was conducted using the following approach by Paxson and Shen (2022) [9]. Numerical solutions obtained from the full SIR model with three different values of * = 1.8, 3.0, and 5 were used as ''filtered data''. We then applied the filtered data in order to estimate the values of which is time independent for an ideal solitary wave. As shown in Table 2 of Paxson and Shen (2022) [9], agreement between all estimated and ''true'' values suggested that filtered data for the above three cases can be approximated using the solitary wave solution in Eq. (9) when a realistic estimate of is obtained using Eq. (15). By plugging each estimated in Eq. (9) for constructing a solution, we determined that constructed solutions are realistic prior to the peak of epidemic waves. On the other hand, when an asymmetric epidemic wave appears at a larger (i.e., * = 5), our method produces larger errors following the wave peak. Detailed results can be found in Figure 7 of Paxson and Shen (2022) [9]. In this study, two terms, filtered data and smooth data, are used interchangeably. Below, we present an analysis of COVID-19 data in the USA, while an additional analysis for world data is included in Appendix of Paxson (2022) [37].

An overview of filtered and idealized data
Here, we briefly discuss the properties of COVID-19 raw data in the USA and the world, available from the following GitHub repository: https://github.com/CSSEGISandData/COVID-19/raw/master/css e_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirm ed_global.csv For COVID-19 data of infected persons in the USA, the number of COVID-19 daily data is 543, yielding a total period of time of = 543 days. When FFT was applied, that requires to meet a sampling theorem, a spectrum had a frequency interval of = 1∕543. Raw data and spectra are provided in Fig. 3. Low-pass filters for a minimal period of 7, 14, and 28 days were applied in order to remove highfrequency signals. We kept the first 20, 39, and 78 Fourier modes and performed an inverse FFT to obtain filtered data. Fig. 4a provides the corresponding filtered data.  For the three sets of data, as expected, data that contained periods of 28 days or larger were the smoothest. Since our study focused on prediction using the analytical expressions of Eqs. (9) and (15), such smooth data were an ideal choice for the first step.
As additional support, the EMD method was also applied in order to remove high-frequency signals for obtaining filtered data. Based on a comparison of the spectral analysis of raw data and each of the IMFs, the sum of IMFs 3 to 8 produced filtered data, as shown in Fig. 4b, comparable to data found in Fig. 4a.
Compared to the filtered data provided in Fig. 4a and b, the 28 day running means, as shown in Fig. 4c, were more applicable for realworld, on-going events. A major difference amongst the three sets of data was that running means could be performed ''locally'', making is possible for real-time predictions.
In addition to filtered data obtained using the EMD and running means methods, we created the 4th type of filtered data as idealized data for verification, as follows. Using the analytical solitary wave solution in Eq. (9), we first obtained an initial guess of parameter by using the filtered data in Fig. 4, and then determined the two parameters, and by fitting the curve to raw data. A final value of was computed by plugging and into Eq. (10). The three parameters were then used to construct a solution curve in Eq. (9), referred to as a fitted curve. As discussed, such a fitted curve was used to ''generate'' idealized data for verification. Fig. 5 provides raw data and the corresponding fitted curves. We roughly define the simplicity of data as the degree to which data could be represented using the mathematical form in Eq. (9). Based on the simplicity of the above smooth data, data were analyzed in the following order: (1) ''idealized data'' obtained from fitted curves; (2) smooth data obtained using the EMD method; and (3) filtered data obtained using 28-day running means. The three types of data are referred to as idealized data, EMD-based smooth data, and running mean (or moving average) data, respectively. Table 1 lists the last two types of data and the corresponding estimates. Since data obtained using a low pass filter with a cutoff frequency of (1/2419200) Hz (i.e., the smallest period of 28 days) and a 28-day running mean method were comparable, in Fig. 4, data obtained using the low pass filter are provided as a comparison.

A verification using idealized data
Using various kinds of ''idealized data'', a proof of concept was proposed and verified. For example, as previously mentioned, Paxson and Shen (2022) [9] used numerical solutions of the full SIR model as idealized data and then applied the procedures described in Section 2.8 in order to estimate predictability horizons. For this work, we applied fitted curves to produce idealized data at different resolutions for such predictions. Fig. 5 displays the idealized solution curves for Waves 1, 3, and 4 within the USA. As discussed, these fitted curves were constructed by fitting a mathematical expression (e.q., Eq. (9)) to the raw data. Therefore, fitted curves are capable of providing data at any fine temporal grid spacing. As a result, the fitted curves are ideal for examining the dependence of results on grid spacing. Three panels in Fig. 6 display the − diagram for all three fitted curves by using three different temporal resolutions of 1 day, 1∕4 days, and 1∕24 days. The first resolution is the original resolution of the raw data. The second one, with a 6 h resolution, has been used in conventional weather observations. In theory, as suggested by an analysis of Eqs. (10) Based on estimates for in Fig. 6, we then constructed solution curves. Estimates at the ''raw'' resolution of one day were considered.
Due to coarse resolution, the estimated varied with time. As an illustration, below, we consider the estimated at three specific times, the initial position, the half width point, and the turning point (i.e., = 2 ∕3), respectively, which mimics a real scenario for an on-going event that continuously produces new data, yielding different estimates for ''local'' . In Fig. 7  be re-written as follows: Here, ( ) represents a ( ) for each of the above three times. Each (i.e., ) is computed using Eqs. (16a), (16b), or Eq. (16c). By plugging estimated values of and an assumed into Eq. (21), we obtained . As shown in Fig. 7, from Eq. (9), solution curves were constructed. Due to a ''coarse'' resolution, yielding slightly various values of at different times, the corresponding constructed solution curves displayed relatively small differences. Since an estimated value of is smaller at a point closer to the peak (e.g., at a turning point), a slower growth rate produces smaller values prior to the peak. An estimated peak time has a larger time lag (i.e., appearing at a later time). Here, we emphasize that simply based on Eq. (9) the idealized wave only contains one epidemic wave, that is used as a baseline for comparisons with results from more realistic data. Results obtained using ''idealized data'' from Fig. 7 (as well as Figure 7 of Paxson and Shen 2022 [9]) provide proof of concept, including the accurate implementation of codes for calculations of Eqs. (9) and (15). Below, the dependence of solution curves on estimates of are further discussed using less simple data (i.e., EMD-based smooth data and 28-day running means).

Estimates of predictability horizons using EMD-based smooth data
Our method for prediction is based on analytical expression in Eqs. (9) and (15). The key feature of this method is as follows: Eq. (15) estimates the time independent growth rate using any local time varying growth rate ( ), and Eq. (9) applies the estimated to both determine a prediction horizon (i.e., in our study) and to construct a solution curve. As discussed above as well as in Paxson and Shen (2022) [9], our method may work fine when estimates of at different times do not display large temporal variations. Below, we consider cases that possess large variances in in order to pave the way for real data application. We first apply our method to EMD-based smooth data for predictions. As compared to idealized data, EMDbased smooth data include different temporal fluctuations for local growth rates. As shown in Fig. 8, local growth rates ( ( )) yield various estimates for at different times. For estimated , we assume that one of the values represents the dominant mode of an epidemic wave, Table 2 The ensemble run estimated 0 range and statistics for US EMD Wave 1.
was estimated using Eq. (10) (Eq. (21)), while represents the time when the maximum of filtered data appears. The term ''Ref diff'' indicates a relative error for an estimated predictability horizon, defined as | − | .
(a) Data plotted in Fig. 8(a). (c) Data plotted in Fig. 8(a). and that the remaining values include perturbations associated with errors and other factors (e.g., confinement restrictions). Results determined using each of the estimated values of contribute an ensemble member.
Ensemble methods (e.g., the dynamic Monte Carlo approach, DMC, [38]) have been applied to study the dependence of epidemiological infection spreading on different values of the risk factor RF, which is associated with the probability (defined as 1 -RF) for infected persons to be recovered. In this study, different ensemble members have different values of . Then, the ensemble of predicted results (e.g., the predicted values of ) are analyzed in order to provide an indication of the range of possible outcomes. Namely, a set of predictions from W. Paxson and B.-W. Shen Table 3 The ensemble run estimated 0 range and statistics for US EMD Wave 3.
(a) Data plotted in Fig. 8 (b) Data plotted in Fig. 8 all ensemble members is first computed and, then, the statistics of predicted values of from ensemble predictions are computed. Idealized data that contain a single epidemic wave can be depicted using a ''single'' time independent growth rate. However, as discussed using idealized data (e.g., Fig. 6), variations in estimates of may appear as a result of coarse time resolution. Additionally, since all data for the entire period were used to obtain intrinsic mode functions, a specific wave (e.g., Wave 3 or Wave 4 in Fig. 4) identified using EMDbased smooth data may, indeed, contain two overlapped waves (e.g., an overlap of the right tail of Wave 2 and the left tail of Wave 3 in Fig. 4). As a result, time variations for local growth rates ( ( ) as well as ) make it complicated to apply our formulas for predictions.
Before presenting statistics for the ensemble of predictions, similar to the previous case, we analyzed solution curves for three ensemble members at three different times, as shown in Fig. 9. As expected from the analysis of in Fig. 8, the ensemble runs produced larger variances for the estimated values of and their solution curves as compared to ensemble runs using idealized data, as shown in Fig. 7.
In summary, idealized data were first used to verify the implementation of codes for calculations using Eqs. (9) and (15). When EMD-based smooth data were used as less idealized data that produces temporal variations for estimates of , we used various estimated values of in order to perform ensemble predictions. While, in reality, variances in may appear for different reasons, such fluctuations arise from W. Paxson and B.-W. Shen Table 4 The ensemble run estimated 0 range and statistics for US EMD Wave 4.
(a) Data plotted in Fig. 8 unavoidable variations in local time varying growth rates driven by the collective impact of many factors (e.g., confinement restrictions). Our simple ensemble prediction method did not attempt to reveal the impact of individual factors on real epidemic events. On the other hand, at this point, we provide a specific reason for variances in EMD-based smoothed data. During the sifting process for determining intrinsic mode functions, (as discussed in Section 2.7), a spline method was used for constructing the upper and low envelopes. Usage of the spline method caused variations in estimated values for ( ) and, thus, . As a result, ensemble runs applying the EMD-based smoothed data are provided to illustrate the ''spread'' of predictions. Their statistics are not discussed. By comparison, when 28-day running mean data are used in ensemble runs, the statistics of ensemble predictions are analyzed.

Estimates of predictability horizons using 28-day running mean data
All of the above exercises were performed in order to apply our method for analyzing running mean data. Here, as shown in Fig. 4c, we focus on 28-day running means for COVID-19 data within the USA. A zoomed-in view of Waves 1, 3, and 4 from the same data is provided in Fig. 10. For each of the three waves, we first applied the filtered data in order to compute local growth rates ( ( )) using the finite difference method in Eq. (14). We then plugged local growth rates into Eq. (15) to obtain estimates of 0 . As shown in Fig. 11, estimates of 0 significantly varied with time. As compared to cases obtained using idealized data and EMD-based smoothed data, in Fig. 10, we provide solution curves constructed at three different times. Compared to the previous cases, larger variations in the estimates of 0 produced a larger spread among solution curves. Additionally, since the first local maximum for each wave was used as for predictions, larger discrepancies appeared when a specific wave may have had two local maxima (e.g., Wave 1 in Fig. 10a). On the other hand, the key point of our method for real world applications is as follows: when daily reports for new infected cases are available, yielding estimates of 0 , our method can be applied to determine when a given may appear. By recognizing the intrinsic properties of the epidemic waves and practical reasons that result in time varying 0 , ensemble predictions can be applied to provide a possible time interval for the occurrence of a given . Below, we provide an illustration.
To produce consistent predictions, an additional step was applied for processing the 28-day running mean data. We removed large variations for estimated values of 0 by computing local means, as follows. The upper and lower envelopes for estimated values of 0 were first computed and then averaged in order to determine local means, yielding a smooth curve for estimated 0 values, as shown in Fig. 10. Such a procedure was derived from Steps 4 and 5 of the sifting processes Table 5 The ensemble run estimated 0 range and statistics for US 28 day mean Wave 1.
(a) Data plotted in Fig. 13 A smoother curve can further be obtained by repeating the sifting processes, which, for now, are skipped. The local mean curves in Fig. 11 are used for ensemble predictions. Fig. 12 displays the constructed curves based on the selected range of 0 from the ensemble runs in Fig. 13. Box-Whisker plots for the range of estimated 0 and from the ensemble runs are provided in Figs. 14 and 15, respectively. Although estimated (i.e., predicted) values display variances, similar to ensemble weather predictions, we believe that advanced analysis methods can produce insightful information, which is the subject of a future study.

Concluding remarks
''Prediction is very difficult, especially if it's about the future''. -Niels Bohr (Physics Nobel Prize 1922). Since the COVID-19 outbreak in 2019, significant effort has been undertaken in order to improve predictions for the spread and evolution of epidemics that caused significant societal and economic impacts. To illustrate the fundamental dynamics of an epidemic wave, including the evolution of solitary wave solutions and the meaning of the basic reproduction number, we derived a nonlinear, 2nd-order ordinary differential equation (ODE) for the time evolution of infected persons   [9]). Since it represents the classical SIR model under a weakly nonlinear assumption and shares the same mathematical form as the KdV equation in the traveling wave coordinate, the 2nd order ODE has been referred to as the KdV-SIR equation. The KdV-SIR equation, as well as its critical points and stability, was analyzed and applied for revealing the dynamics of solitary solutions. Our study provides a proof of concept for applying the analytical solution of the KdV-SIR equation to COVID-19 data in order to estimate predictability horizons for solitary epidemic waves (i.e., the peak times of epidemic waves).
The validity of our method relies on whether (or not) a real world epidemic event or its major portion (i.e., its envelope) can be represented using an idealized epidemic wave. The analytical solution of the idealized epidemic wave in Eq. (9) is described by a hyperbolic secant squared function with three parameters, , , and , representing the maximum for the infected persons, the time for the appearance of , and the time independent growth rate, respectively. The energy equation derived from the KdV-SIR model yields an analytical expression for the relationship between the time independent growth rate and time varying growth rates (e.g., Eq. (15)). Based on mathematical formulas, illustrating the dependence of solitary wave solutions on the initial conditions and the system's parameter (i.e., an estimated growth rate) becomes possible. As a result of oscillatory components near the peak (i.e., the existence of a turning point), errors associated with perturbations in initial conditions and/or the system parameter (i.e., ) do not grow without a limit.
When a report for real cases is available for determining a local growth rate at a specific time, Eq. (15)) can be applied for estimating the time independent growth rate ( ) and, thus, for constructing a solution curve using Eq. (9) for the prediction of a time horizon (i.e., an estimated value of peak time). In this study, three kinds of data were Table 6 The ensemble run estimated 0 range and statistics for US 28 day mean Wave 3.
(a) Data plotted in Fig. 13 (c) Data plotted in Fig. 13 Table 7 The ensemble run estimated 0 range and statistics for US 28 day mean Wave 4.
(a) Data plotted in Fig. 13  generated from COVID-19 raw data, based on the following three methods: (1) a curve fitting package, (2) the EMD method, and (3) a 28-day running mean method, referred to as idealized data, EMDbased smoothed data, and running mean data, respectively. All of the data were ''comparable'' to filtered data obtained using a low-pass filter with a cutoff frequency of 1/(28 days) (e.g., Fig. 4). Idealized data were first applied in order to verify the performance of implemented codes in related calculations. As expected and illustrated, when temporal resolution was sufficiently high, our method produced nearly perfect predictions. However, errors may appear in association with the coarse temporal resolution of raw data, suggesting the importance of acquiring COVID-19 data at a high temporal resolution. The same procedures were then applied for EMD-based smooth data. As a result of variances in estimates of , we additionally performed ensemble predictions using different estimated to produce the statistics of predicted time horizons. When the aforementioned procedures were applied to 28-day running means, large variances in estimated values were smoothed by averaging the upper and low envelopes for estimated in order to obtain local means. The local means for estimated values were then used for ensemble predictions. As A box-Whisker plot for the range of estimated 0 ( ), based on ensemble runs, using the 28-day Running Mean (red) and EMD-based smooth (blue) data. Panels (a)-(c) provide results for US Wave 1, 3, and 4, respectively. Note that there is no 0 available from the time interval between the turning point and the peak for EMD-based smooth data in panel (a). There is also no 0 for the time interval between the half width point and the peak time for EMD-based smooth data in panel (b). In plots, outliers are indicated with hollow circles.
summarized in Tables 2-7, variances in estimates of produced large variances in estimates of peak times. However, as compared to other methods that require the estimates of two parameters, and , our method with parameter , which can be determined from daily reports of infected persons, can be used alongside other methods in order to provide multiple ensemble forecasts. The methodology required for deriving ''useful'' results from multiple ensemble forecasts is the subject of a future study.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.  where is a parameter and is an integration constant. The constant is determined to be = 0 if a solution ( ( )) and its derivative ( ( )∕ ) approach 0 when → ±∞. As a result, Eq. (A.2) of the KdV equation is also identical to our KdV-SIR equation, Eq. (7a).