Model selection among growth curve models that have the same number of parameters

A model selection method was proposed to determine the most appropriate model among growth curve models that have the same number of parameters. It uses a measure of mean relative squared error and regression equations from difference equations for growth curve models. The difference equations have exact solutions that are on exact solutions of differential equations as growth curve models. The regression equations from the difference equations perfectly reproduce their parameter estimates. The proposed method selects an appropriate model when data are on an exact solution of a differential equation. It was verified to be practical with six actual datasets when the Gompertz curve and logistic curve models, which are often used for forecasting, were alternative growth curve models. Subjects: Agriculture; Mathematical Modeling; Non-Linear Systems; Multivariate Statistics; Statistics for Business, Finance &Economics; Software Engineering & Systems Development; Reliability & Risk Analysis; Management of Technology & Innovation; Sales


PUBLIC INTEREST STATEMENTS
Growth curve models such as Gompertz and logistic are often used for forecasting in various fields. Selecting an appropriate model is crucial for forecasting success because forecasting using an inappropriate model can result in seriously incorrect forecasts. However, growth curve models have tended to be selected on an ad hoc basis. In this paper, a model selection method was proposed to determine the most appropriate model among growth curve models that have the same number of parameters. It uses a measure of mean relative squared error and regression equations from difference equations for growth curve models. The difference equations have exact solutions that are on exact solutions of differential equations as growth curve models. The regression equations from the difference equations perfectly reproduce their parameter estimates. The proposed method selects an appropriate model when data are on an exact solution of a differential equation. It was verified to be practical with six actual datasets.
Accurate parameter estimation is indispensable for accurate forecasting with a growth curve model. Inaccurate parameter estimation provides inaccurate forecasting because estimating parameters of the models is equivalent to forecasting with the models. Accurate parameter estimation is accomplished with difference equations that have exact solutions (Satoh, 2000;Satoh, 2001;Yamada et al., 2002). The exact solutions are on exact solutions of differential equations as growth curve models. Thus, the difference equations conserve the properties of the differential Equations Hirota, 1979;Hirota, 2000;Hirota & Takahashi, 2003;Satoh, 2000. We call the difference equations integrable difference equations. The integrable difference equations are easily applied to regression equations to obtain parameter estimates and have advantages over ordinary difference Equations Satoh, 2001. Furthermore, forecasting with the difference equations yields accurate parameter estimates in the early stage (Satoh, 2000;. Such applications of the integrable difference equations are called "applied discrete systems." Accurate parameter estimation alone, however, is not enough to provide accurate forecasting. For accurate forecasting, an appropriate model must be selected and parameters accurately estimated. Selecting an appropriate model is crucial for forecasting success (Martino, 2003) because forecasting using an inappropriate model can result in seriously incorrect forecasts (Chu, Wu, Kao, & Yen, 2009;Martino, 2003;Yamakawa, Rees, Salas, & Alva, 2013). However, growth curve models tend to be selected on an ad hoc basis (Chu et al., 2009;Yamakawa et al., 2013). There are only a few frameworks for model selection, though there have been several empirical studies (e.g., Meade & Islam, 1995).
The first method based on straightness of linearized transformation (Young & Ord, 1989) assumes that a saturation level is known. However, this assumption does not hold generally.
The second method is based on fitness (Chu et al., 2009;Vieira & Hoffmann, 1977;Yamakawa et al., 2013;Young & Ord, 1989), which is usually used as a criterion of model selection. Unlike all the other methods, which select the more appropriate model from only the logistic and Gompertz curve models, Chu et al. (Chu et al., 2009) proposed a method that selects the most appropriate model from three or more models. The fitting procedures use a nonlinear least squares method (Chu et al., 2009;Vieira & Hoffmann, 1977;Yamakawa et al., 2013). The nonlinear least squares method may provide parameter estimates with slow convergence or no-convergence. The parameter estimates may not provide a global optimum.
The third method is based on statistical testing (Franses, 1994b;Nguimkeu, 2014). Franses (1994b) and Nguimkeu (2014) arrange both differential equations to yield the same term that includes the differential terms between the Gompertz and logistic equations. They replace the differential terms with forward difference terms and obtain one regression equation. The same term that includes a difference term is regarded as the dependent variable of the regression equation. They select the more appropriate model by using tests to determine whether the coefficient of a certain independent variable in the regression equation is zero. Testing based on using a forward difference equation is generally different from that based on the exact solution of a differential equation as a growth curve model because dynamics generally differ between differential and difference equations. For example, solutions of forward and central difference equations show chaotic behavior (Li & Yorke, 1975;Ushiki, 1982) and differ from an exact solution of a differential equation that is a sigmoid function. Furthermore, these methods are particular to a model selection between both Gompertz and logistic curve models.
The fourth method is based on estimated parameter behavior Satoh & Matsumura, 2019). Satoh and Matsumura (2019) mathematically analyzed a saturation level estimated with the Gompertz curve model as an inappropriate model when data are described on the exact solution of the logistic curve model and proved that it strictly monotonically decreases as the data size increases, i.e., time elapses. Also, Satoh (2019) mathematically analyzed a saturation level estimated with the logistic curve model as an inappropriate model when data are described on the exact solution of the Gompertz curve model, proved that it strictly monotonically increases as the data size increases, and showed that the property is conserved for non-homogeneous Poisson process (NHPP) (Çinlar, 1975) data of the Gompertz curve model. As an application of these properties Satoh & Matsumura, 2019), behavior of the estimated saturation level helps us to select the more appropriate model between both Gompertz and logistic curve models. This model selection is particular to both models, too.
This paper proposes a model selection method to select among growth curve models that have the same number of parameters, which is not particular to the logistic and Gompertz curve models. It uses a measure of mean relative squared error and regression equations from integrable difference equations for growth curve models. The regression equations perfectly reproduce the parameter estimates.
The remainder of this paper is organized as follows. Section 2 proposes the model selection method. For comparison, regression equations based on central difference equations are introduced, too. Section 3 verifies the proposed model selection method with data of exact solutions of growth curve models. Section 4 verifies it with six actual datasets. Finally, Section 5 concludes the paper with a summary.

Model selection method
A model selection method with goodness-of-fit is proposed for when alternative models have the same number of parameters for regression equations. The proposed method uses a measure and integrable difference equations and is composed of the following steps: (1) Prepare alternative growth curve models, (2) Obtain integrable difference equations for the growth curve models, (3) Obtain regression equations from the integrable difference equations, (4) Obtain estimates of parameters in the integrable difference equations through regression analysis, (5) Obtain estimates using estimated parameters if the estimated parameters meet their conditions, (6) Calculate measure C using the estimates, Select the most appropriate model that has the smallest measure.

Measure
The measure for the proposed method (measure C) (Satoh & Yamada, 2001) is shown as where N denotes the number of available data points, X n the cumulative number of actual data up to discrete time n, andX n the n-th value estimated with N data points by an integrable difference equation. Although the error is usually evaluated as the mean squared error, the mean squared error is not suitable for determining the most appropriate model because it is significantly affected by the absolute values of the data. As a result, the mean squared error gives too much weight at a later stage of bigger data. However, measure C as the mean relative squared error is not significantly affected by the absolute values of the data but by the ratio between the data and estimates. As a result, measure C gives the same weight at every stage.
A combination of measure C and integrable difference equations is effective for a model selection although measure C itself is not new and cannot by itself be a measure for model selection as explained in Sect. 3.

Integrable difference equations
Integrable difference equations have been widely investigated (Hirota, 1979(Hirota, , 2000Hirota & Takahashi, 2003;Satoh, 2000). An integrable difference equation is obtained through Hirota's bilinear formalism and discretization of a bilinear equation with gauge-invariance (Hirota, 2000;Hirota & Takahashi, 2003). This method is composed of three steps: a given differential equation is transformed into a bilinear equation by a dependent variable transformation; the bilinear equation is discretized with the gauge-invariance and the discrete bilinear equation is transformed into a discrete nonlinear equation in the ordinary form by an associated dependent variable transformation (Hirota, 2000;Hirota & Takahashi, 2003).
The proposed model selection method uses integrable difference equations that have the same number of parameters for growth curve models. As examples of the same number of growth curve models, logistic and Gompertz curve models are introduced. The logistic and Gompertz curves are symmetric and asymmetric at their respective points of inflection. Their integrable difference equations are introduced.

Logistic curve model
The logistic curve model is described using the following differential equation: where LðtÞ is the cumulative number up to time t. By integrating Equation (2) and assuming that Lð0Þ ¼ k 1þm , LðtÞ is written as where k; r; and m are constant parameters. Parameter k represents an upper limit of demand as parameter r represents the speed of growth, and parameter m represents the shift of the logistic curve as Integrable difference equations for the logistic curve model were proposed by Skellam (1951), Morisita (1965) and Hirota (1979). Skellam (1951) and Morisita (1965) discretized Equation (2) as The exact solution of Equation (6) is Hirota (1979) discretized Equation (2) as He gave an exact solution: The right-hand sides of Equtions (6 and 8) include a term of index of n and n þ 1, whereas that of the forward difference equation has only terms of index of n as Satoh and Yamada  used Equations (6 and 8) for forecasting. Their regression equation to estimate the parameters is shown as Parameters k, r, and m are estimated from regression analysis aŝ wherek,r,m,Â, andB are estimates of k, r, m, A, and B. The same estimatesk,r are obtained for any δ value because time-interval δ is not used in Equation (11) and Y n in Equation (11) is independent of δ in Equations (6 and 8) .
Estimated parametersk,r, andm have to meet the following conditions: where L 1 is the first datum.

Gompertz curve model
The Gompertz curve model is described as where GðtÞ is the cumulative number up to time t. By integrating either equation and assuming that Gð0Þ ¼ ka, GðtÞ can be written as where a; b, and k are parameters whose constant values are estimated by using regression analysis. Parameter k represents the upper limit as GðtÞ ! kðt ! 1Þ: Satoh (Satoh, 2000) proposed a Gompertz integrable difference equation: which has an exact solution: For comparison, a forward difference equation of Equation (21) is introduced as The regression equation for the integrable difference Equation Satoh, 2000 is described as follows: where Using Equation (27), we can estimate parameters k, a, and b: wherek,â,b,Â, andB are the estimated values of k, a, b, A, and B.
Y n in Equation (27) is independent of time interval δ because δ is not used in Equation (27).
Hence, the same estimates ofk,â, and δ logb are obtained even when we choose any value of δ (Satoh, 2000).
Estimated parametersk,â, andb have to meet the following conditions, e À1 <b δ < 1; where G 1 is the first datum.

Regression equations based on central difference equations
For comparison, regression equations based on central difference equations are introduced for the logistic and Gompertz curve models.

Logistic curve model
To obtain the regression equation, we rewrite Equation (2) as where Y ¼ 1 LðtÞ X ¼ LðtÞ; A ¼ r; Then we obtain the following regression equation: where t n ¼ nδ; Here, δ is a constant time interval.
Given regression coefficientsÂ andB, whereÂ andB respectively mean the values of A and B estimated through regression analysis, the estimates of parameters k, r, and m can be obtained aŝ These estimates depend on the time interval δ because Equation (43) depends on δ.

Gompertz curve model
To obtain the regression equation needed to estimate the value of the parameters, Equation (21) is rewritten: where Equation (48) is then discretized to obtain the following regression equation: where A ¼ log ðlog aÞðlog bÞ ð Þ ; and (53) Given regression coefficientsÂ andB, whereÂ andB respectively mean the estimated parameters of A and B through regression analysis, parameters a; b; and k can be estimated: This estimation depends on the time interval δ because Y n in Equation (51) depends on δ. We can choose any value as δ. Therefore, the estimation depends entirely on the specific value of δ.

Exact solution data of logistic curve model
The parameters used to make logistic exact solution data were the same as those Satoh and Yamada  used: k ¼ 100, r ¼ 0:8, and m ¼ 999. These data inflected at the point where t Ã ¼ 8:63 and Lðn Ã Þ ¼ 50. The time interval of 1 used was the same as that Satoh and Yamada  used, too. Four datasets were analyzed (Satoh & Yamada, 2001): data up to (A-i) the first three data points (n ¼ 0; 1; 2), (A-ii) the data just before the point of inflection (n ¼ 0; 1; . . . ; 8), (A-iii) the data just after the point of inflection (n ¼ 0; 1; . . . ; 9), and (A-iv) the saturation level g (n ¼ 0; 1; . . . ; 21). All estimated parameters met their conditions (18), (19), (20), (32), (33), and (34) for all n. Table 1 shows the values of measure C with estimates using integrable difference equations (logistic and Gompertz) and central difference equations (logistic and Gompertz) (Satoh & Yamada, 2001). The logistic integrable difference equation matched the data completely. It reproduces data values as estimates when the exact solution is used as the input data . Thus, the values of measure C were exactly zero. Values of measure C with estimates by the Gompertz integrable difference equation were positive for all cases. As a result, the proposed method made a correct judgment. Moreover, the difference in values of measure C between the Gompertz and logistic integrable equations increased monotonically as the data size increased. The more appropriate model was determined more clearly as the data size increased. The Gompertz model, however, was selected as the more appropriate model on the basis of measure C with estimates by both central difference equations even though it was an inappropriate model for all cases (A-i, …, iv). Measure C with estimates by the logistic central difference equation increased and decreased as the data size increased. In contrast, measure C with estimates by the Gompertz central difference equation increased as the data size increased. Measure C seems to be good for model selection because the Gompertz curve model is an inappropriate model and clearer selection needs to be done as the data size increases. However, the four values of C in the logistic central difference equation were larger than the largest value of C in the Gompertz central difference equation. Thus, measure C with central difference equations cannot be used for model selection. Table 2 shows the estimated parameter k, which is the same for both the logistic and Gompertz models (Satoh & Yamada, 2001). The logistic integrable difference equation completely reproduces parameter value k of exact solution data for all cases. The k values estimated by the Gompertz integrable difference equation were larger than those estimated by the logistic integrable difference equation and monotonically decreased as the data size increased as Satoh and Matsumura proved (Satoh & Matsumura, 2019). The k values estimated by the logistic central difference equation were more accurate than those estimated by the Gompertz central difference equation although measure C with estimates by the central difference equations indicated that the Gompertz model was the more appropriate model.

Exact solution data of Gompertz curve model
The parameters used to make Gompertz exact solution data were the same as those Satoh (2000) used: k ¼ 100, a ¼ 0:01, and b ¼ 0:5. These data inflected at the point where n Ã ¼ 1:443 and Gðn Ã Þ ¼ 36:79. The time interval of 1 was the same as that Satoh (2000) used, too. Three datasets were analyzed (Satoh & Yamada, 2001): data up to (B-i) just before the point of inflection (n ¼ 0; 1; 2), (B-ii) just after the point of inflection (n ¼ 0; 1; 2; 3), and (B-iii) the saturation level (n ¼ 0; 1; Á Á Á ; 25). All estimated parameters met their conditions (18), (19), (20), (32), (33), and (34) for all n. Table 3 shows the values of measure C with estimates using integrable difference equations (logistic and Gompertz) and central difference equations (logistic and Gompertz) (Satoh & Yamada, 2001). The Gompertz integrable difference equation matched the data completely. It reproduces data values as estimates when the exact solution is used as the input data (Satoh, 2000). Thus, the values of measure C were exactly zero. Values of measure C with estimates by the logistic integrable difference equation were positive for all cases (B-i, ii, iii). As a result, the proposed method made a correct judgment. Just like the logistic exact solution data, the difference in values of measure C between the Gompertz and logistic integrable equations increased monotonically as the data size increased. The more appropriate model was determined more clearly as the data size increased. The model was correctly selected on the basis of measure C with estimates by the central difference equations for all cases of the Gompertz exact solution data. Measure C with estimates by the central difference equations selected the Gompertz model as the more appropriate model for both exact solution datasets. The values of measure C for the Gompertz central difference equation monotonically decreased as the data size increased. However, those for the logistic central difference equation monotonically decreased as the data size increased, too. The values of measure C need to increase as the data size increases because the logistic model is an inappropriate model and clearer selection needs to be done as the data size increases. Thus, measure C with estimates by the central difference equation is not reliable even though it made the correct judgment for all cases. Table 4 shows the estimated parameter k, which is the same for both the logistic and Gompertz models (Satoh & Yamada, 2001). The Gompertz integrable difference equation completely reproduces parameter value k of exact solution data for all cases. The k values estimated by the logistic integrable difference equation were smaller than those estimated by the Gompertz integrable difference equation and monotonically increased as the data size increased as Satoh proves . The k values estimated by the Gompertz central difference equation were more accurate than those estimated by the logistic central difference equation as measure C with

Verification with actual data
The proposed model selection was verified between the logistic and Gompertz models for actual datasets.

Measure C
Determining the more appropriate model at an earlier stage is desirable because forecasting at an earlier stage is more valuable. Thus, the proposed model selection was applied to the first n data points for each dataset to evaluate at how early a stage it was able to determine the more appropriate model.
Values of measure C were calculated for the first n data points in datasets L1, L2, and L3 as shown in Figure 3(a-c). The proposed model selection determined that the logistic model was more appropriate than the Gompertz model for dataset L1 except for n ¼ 8; 9; 10, dataset L2 except for n ¼ 4; 5; 6 and n ¼ 12; 13; 14, and dataset L3 for all n. The logistic model was actually more appropriate for all available data (n ¼ 13; 18, and 26 for dataset L1, L2, and L3) as shown in Figure 1(a-c).
Moreover, the proposed model selection also determined that the Gompertz model was more appropriate than the logistic model when available data were the first 8, 9, and 10 data points in dataset L1 and the first 4, 5, 6, 12, 13, and 14 data points in dataset L2. Both models were compared with actual data using the first 9 data points in dataset L1 in Figure 4(a) and the first 13 data points in dataset L2 in Figure 4(b) because the difference between the two values of measure C was the largest for n ¼ 8; 9; 10 in dataset L1 and for n ¼ 12; 13; 14 in dataset L2. The Gompertz model looks more appropriate than the logistic model from the results of the proposed model selection.
Values of measure C were calculated for the first n data points in datasets G1, G2, and G3 as shown in Figure 5(a-c). The proposed model selection determined that the Gompertz model was more appropriate than the logistic model for dataset G1 except for n ¼ 6; 7; 8; 9, dataset G2 except for n ¼ 7; 8; 9; 10; 11, and dataset G3 except for n ¼ 4; 7; 8; . . . ; 25. Actually, the Gompertz model     was more appropriate for all available data (n ¼ 19; 25, and 64 for dataset G1, G2, and G3) as shown in Figure 2(a-c).
On the other hand, the proposed model selection also determined that the logistic model was more appropriate than the Gompertz model when available data were the first 6, …, 9 data points in dataset G1, the first 7, …, 11 data points in dataset G2, and the first 8, …, 25 data points in dataset G3. Both models were compared with actual data using the first 7 data points in dataset G1 in Figure 6(a), the first 10 data points in dataset G2 in Figure 6(b), and the first 14 data points in dataset G3 in Figure 6(c) because the difference between the two values of measure C was the largest for n ¼ 6; . . . ; 9 in dataset G1, n ¼ 7; . . . ; 11 in dataset G2, and n ¼ 8; . . . ; 25 in dataset G3. The logistic model looks more appropriate than the Gompertz model from the results of the proposed model selection.

Stability of estimated saturation level
Estimated saturation level (parameter k) for actual data changes as available data points increase even though it never changes in exact solution data. However, if parameter k is estimated by the more appropriate model, the estimated values must be more stable than values estimated by the less appropriate model when data points increase. If the range of k estimated by the more appropriate model determined by the proposed model selection is smaller than that estimated by the less appropriate model, the proposed model selection determines the more appropriate model correctly. Table 5 shows the range of estimated k in the last stage where a more appropriate model was fixed because the results of the proposed model selection did not change in the stage for all actual datasets. The values of the range were calculated using the first n ¼ 11; 12; 13 data points for dataset L1, the first n ¼ 15; . . . ; 18 data points for dataset L2, the first n ¼ 17; . . . ; 26 data points for dataset L3, the first n ¼ 10; . . . ; 19 data points for dataset G1, the first n ¼ 13; . . . ; 25 data points for dataset G2, and the first n ¼ 26; . . . ; 64 data points for dataset G3. The range of parameter k estimated by the more appropriate model was smaller than that estimated by the less appropriate model as shown in Table 5. Therefore, the proposed model selection determined the more appropriate model correctly.

Conclusion
A model selection method has been proposed to select among growth curve models that have the same number of parameters. The key of the model selection method is to use Figure 6. Case of Gompertz model as more appropriate model.
integrable difference equations for the growth curve models and a measure of the mean relative squared error between actual data and estimates. The integrable difference equations have exact solutions that are on exact solutions of growth curve models and perfectly reproduce the parameter estimates from exact solution data. The ordinary forward or central difference equations cannot be used for model selection because they cannot reproduce estimates even from exact solution data. They produce the error between estimates and exact solution data. The measure of the mean relative squared error between actual data and estimates is suitable for model selection because it gives the same weight at every stage. The mean squared error, which is commonly used, gives too much weight at a later stage of bigger data, so it is not suitable for model selection. The proposed model selection method has been verified with six actual datasets when logistic and Gompertz curve models were chosen as alternatives of the growth curve models. Three datasets conform well to the logistic curve model, and the others to the Gompertz curve model. The more appropriate model alters in accordance with changes depending on available data points for actual data because actual data include noise. The proposed model selection determines the more appropriate model depending on available data points. Parameters estimated by the more appropriate model must be more stable than ones estimated by the less appropriate model when available data points increase. The range of saturation level parameters estimated by the more appropriate model determined by the proposed model selection is smaller than that estimated by the less appropriate model. Therefore, the proposed model selection determined the more appropriate model correctly.

Funding
The author received no direct funding for this research.