Optimal Release Time Estimation of Software System using Box-Cox Transformation and Neural Network

The determination of the software release time for a new software product is the most critical issue for designing and controlling software development processes. This paper presents an innovative technique to predict the optimal software release time using a neural network. In our approach, a three-layer perceptron neural network with multiple outputs is used, where the underlying software fault count data are transformed into the Gaussian data by means of the well-known Box-Cox power transformation. Then the prediction of the optimal software release time, which minimizes the expected software cost, is carried out using the neural network. Numerical examples with four actual software fault count data sets are presented, where we compare our approach with conventional Non-Homogeneous Poisson Process (NHPP) -based Software Reliability Growth Models (SRGMs).


Introduction
Software testing is an important feature in software development processes and plays an imperative role in defining the quality of software. Moreover, due to huge competitions in the market, software can be seldom survived during the long lifetime due to the competitions and always faces the version up. At the same time, newly added feature in software increases the complexity, which leads to the software faults if not evaluated appropriately. On the other hand, the main concern in software process management by practitioners is to determine when to stop software testing and release the software to the market or users. The determination of the optimal timing to stop software testing is called the optimal software release problem, which is directly related to software testing costs. The total testing cost is reduced successfully if the volume of software test work is less. The debugging cost is more significant in the operational phase than that in the testing phase after releasing software. On the contrary, the higher the software reliability desired, the longer testing period, which further accumulates higher testing cost. Hence, it is important to find as appropriate software release time taking account of the expected software cost. Many authors have discussed in the past on when to stop software testing and to release it for use. The recent trend shows that the quantitative software reliability assessment processes and the efficient software testing methodologies are becoming a firm prerequisite for software developers. Towards achieving the prerequisite, a great number of SRGMs have been developed in the literature (see Goel and Okumoto,1979;Yamada et al., 1983;Ohba, 1984;Littlewood, 1984;Goel, 1985;Abdel-Ghaly et al., 1986;Lyu, 1996;Achcar et al., 1998;Cai, 1998;Gokhale and Trivedi, 1998;Pham, 2000;Ohishi et al., 2009;. Many researchers have also studied the optimal software release policies as well. Okumoto and Goel (1980) derived the optimal software released time for their NHPP-based software reliability model (SRGM) such that the software reliability attains a certain requirement level. Pham and Zhang (1999) proposed a software cost model with warranty and risk costs. The influential involvement of Leung (1992), Dalal and McIntosh (1994), Zhang and Pham (1998), Pham and Zhang (1999), Pham (2003) has been seen in a number of optimization problems under different modeling assumptions. They focused on the completely known stochastic point process such as NHPP-based SRGM to estimate the optimal software testing time. However, it is obvious that there is a need for a better SRGM to fit every software fault count data, addressing the fact that the formulation of the optimal software release problems strongly depends on the accurate statistical estimation of software fault count process. Kaneishi and Dohi (2013), Xiao and Dohi (2013), and Saito and Dohi (2015) introduced new non-parametric assessment methods for software fault count process in testing. However, the disadvantages of these methods rises on the failure to predict the long-term behavior of software fault count process; therefore, these methods are not applicable to the actual software release problems. Recently, a non-parametric method was employed to predict approximately the optimal software release time by minimizing the upper or lower bound of the expected software cost by Saito et al. (2016).
In this paper, a neural network approach with fault count data is used to estimate the optimal software release time, which reduces the relevant software cost. The well-known data transformation technique called the Box-Cox power transform (see Box and Cox, 1964), is applied to transform the data with an arbitrary probability law into the Gaussian data, and to transform the underlying software fault prediction problem into a nonlinear Gaussian regression problem with the Multilayer Perceptron (MLP). An MLP is a feed forward artificial neural network model where all of the input data are mapped onto a set of appropriate outputs. The MLP exploits a supervised learning technique called back propagation. The back propagation is a common learning technique for training Artificial Neural Networks (ANNs) and is used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two-phase cycle; even propagation and weight update. The problems of ANN exist in many applications including software fault prediction, so that there is no efficient way to determine the best neural network architecture. The number of input neurons and output neurons may be determined from physical requirements in many cases. On the other hand, the number of hidden layers and hidden neurons significantly influence the prediction performance in the MLP feed forward neural networks. Nevertheless, the MLP is widely applicable to the statistical identification, function approximation, and time series forecasting.
The idea on data transformation is rather simple, fit useful to handle the non-Gaussian data. Tukey (1957) introduced a family of power transformations where an arbitrary transformation parameter is involved in the power transformations. Box and Cox (1964) proposed a method to calculate the maximum likelihood as well as the Bayesian methods for estimation of the parameter, and derived the asymptotic distribution of the likelihood ratio to test some hypotheses about the parameter. The main contribution by Box and Cox (1964) are two-folds: the first one is to calculate the profile likelihood function and to acquire an approximate confidence interval from the asymptotic property of the maximum likelihood estimator; the second one is to ensure that the probability model is fully identifiable in the Bayesian approach. Dashora et.al. (2015) used the Box-Cox power transformation for data driven prediction models with single output neuron in MLP and analyzed the intermittent stream flow for Narmada river basin in the context of neural computation. It compared against the MLP with seasonal autoregressive integrated moving average and Thomas-Fiering models. Sennaroglu and Senvar (2015) estimated the process capability indices in industry and compared the Box and Cox power transformation with weighted variance methods in the Weibull distribution model in terms of the process variation on the product specifications. In these ways, the potential applicability of the Box and Cox power transform is acknowledged in several research fields. Begum and Dohi (2016a) proposed a one-stage look-ahead prediction method with an MLP where the software fault count data are used. The method is adequate to predict the cumulative number of software faults in practice. Dohi et al. (1999) considered an optimal software release problem by means of the MLP-based software fault prediction, and was able to estimate the optimal software release time. They proposed an interesting graphical method to find directly an optimal software testing time, which minimizes the relevant software cost. However, the model dealt with only software fault-detection time data, and failed to make the multi-stage look-ahead prediction. An idea to use Multiple-Input Multiple-Output (MIMO) came from Park et al. (2014) who proposed a refined ANN approach to predict the long-term behavior of a software fault count with the grouped data. Dohi (2016b, 2016c) imposed a plausible assumption that the underlying fault count process obeys the Poisson law with an unknown mean value function, and proposed to utilize three data transform methods from the Poisson count data to the Gaussian data; Bartlett transform (1936), Anscombe transform (1948) and Fisz transform (1955). They also proposed the idea to apply an MIMO type of MLP to an optimal software release problem with the grouped data. Recently, Begum and Dohi (2017) proposed another MLP-based software fault prediction method with Box-Cox power transformation.

NHPP-Based Software Reliability Modeling
Assume that software system test starts at time = 0. Let ( ) denote the cumulative number of software faults detected by time t. Then the counting process { ( ), ≥ 0} is said to be a nonhomogeneous Poisson process (NHPP) with intensity function ∅( ; ), if the following conditions hold: where (ℎ) is the higher term of infinitesimal time h, and ∅( ; ) is the intensity function of an NHPP which denotes the instantaneous fault detection rate per each fault. In the above definition, is the model parameter (vector) included in the intensity function. Then, the probability mass function (p.m.f.) of the NHPP is given by The identification problem of the NHPP is reduced to a statistical estimation problem of the unknown model parameter , if the mean value function ( ; ) or the intensity function ∅( ; ) is known. The parametric statistical estimation methods can then be applied when the parametric form of the mean value function or the intensity function is given. The maximum likelihood method and/or the EM (Expectation-Maximization) algorithm are commonly used to estimate the mean value function.  summarized eleven parametric NHPP-based SRGMs (see Table 1) and developed a parameter estimation tool, SRATS. The best SRGM with smallest AIC (Akaike Information Criterion) is automatically selected in SRATS, where the best SRGM fits the past observation data on software fault counts among the eleven models.
Suppose that realizations of ( ), ( = 1,2, … , ), are observed up to the observation point . We estimate the model parameter , by means of the maximum likelihood method. Then, the log likelihood function for the fault count data ( , ) ( = 1,2, … , ) is given by where (0; ) = 0, = 0 and = for simplification. By maximizing Equation (3) with respect to the model parameter , we can predict the future value of the intensity function or the mean value function at an arbitrary time ( = 1,2, … ), where l denotes the prediction length. In parametric modeling, the prediction at time is easily done by substituting estimated model parameter into the time evolution ( ; ), where the unconditional and conditional mean value functions at an arbitrary future time are given by If the parametric form of the mean value function or the intensity function is unknown, the identification problem of an NHPP becomes much more difficult. A limited number of nonparametric approaches have been developed by Kaneishi and Dohi (2013), Saito and Dohi (2015), etc. However, those approaches can deal with the fault-detection time data, but will not work for long-term prediction with the grouped fault count data. In addition, the wavelet-based method in Xiao and Dohi (2013) can handle the grouped data, but is unsuccessful to make the long-term prediction. To overcome the above problem we use a fundamental MLP for long-term software fault prediction.

MLP Architecture
An Artificial Neuron Network (ANN) is a computational model inspired by the structure and functions of biological neural networks. An ANN has several advantages but one of the most significant ones is that it can be trained from past observation. The simplest ANN has three layers that are interconnected. The first layer consists of input neurons. Those neurons send data to the second layer i.e. hidden layer, which sends the outputs to the third layer. Subsequently, the hidden neurons have no communication with the external world, so that the output layer of neurons sends the final output to the external world. The problem is how to get an appropriate number of hidden neurons in the neural computation. Here, we study an MIMO type of MLP with only one hidden layer. Similar to Section 2, suppose that n software fault count data ( , ) ( = 1,2, … , ) are observed at the observation point (= ). Our concern is about the future prediction of the cumulative number of software faults at time ( = 1,2, … ).

First Phase: Data Transformation
The simplest MLP with only one output neuron is considered as a nonlinear regression model, where the explanatory variables are randomized by the Gaussian white noise. In particular, the output data in the MLP are indirectly assumed to be realizations of a nonlinear Gaussian model. By contrast, the fault count data are integer values. Hence, the original data shall be transformed to the Gaussian data in advance. Xiao and Dohi (2013) used a pre-data processing which is common in the wavelet shrinkage estimation, where the underlying data follows the Poisson data.
In the same way, we apply the Box-Cox power transformation technique proposed by Box and Cox (1964), from an arbitrary random data to the Gaussian data. As mentioned in the above, they developed a method to find an appropriate exponent to transform data into a "normal shape". Table 2 presents the Box-Cox power transformation and its inverse transform formula where denotes the cumulative number of software faults detected at (1,2, … , )-th testing day. Then, we have the transformed data by means of the Box-Cox power transformation. The transformation parameter indicates the power to which all data should be raised, where the parameter has to be adjusted in the Box-Cox power transformation. Generally, the transformation parameter should be optimal that's why before time series prediction pre-experiments is essential. Let ( = 1,2, … , ) and ( = 1,2, … ) be the input and output for the MIMO type of MLP, respectively. Then, the prediction of the cumulative number of software faults are given by the inversion of the data transform. Fig 1 depicts the architecture of back propagation type MIMO, where is the number of software fault count data and is the prediction length. We suppose that there is only one hidden layer with (= 1,2, … ) hidden neurons in our MIMO type of MLP.  Vol. 3, No. 2, 177-194, 2018ISSN: 2455

Second Phase: Training the Neural Network
In Fig 1, denotes the number of input neurons in the input layer, of the hidden neurons and l indicates output neurons so we assume that all the connection weights ( weights from input to hidden layer, kl weights from hidden to output layer) are first given by the uniformly distributed pseudo-random varieties. In our MIMO type of MLP, output neuron can be calculated ( , … , ) from the previous transformed input ( , … , )if these weights are completely known. However, in principle of the common BP algorithm, it is impossible to train all the weights including ( + ) unknown patterns, as a result, it is required to improve the common BP algorithm for long-term prediction scheme.

(i) Long-Term Prediction Scheme
In Fig 2, we show the structure of our prediction scheme. For the long-term prediction, we assume that > without any loss of generality. In order to predict the cumulative number of software faults for testing days from the observation point , the prediction has to be made at the point . This indicates that only ( − ) + = weights can be estimated with the training data experienced for the period ( , ] and that the remaining ( + ) − weights are not trained at time . We call these ( + ) − weights the non-estimable weights in this paper. In addition, the prediction improbability also rises more when the prediction length is longer, and the number of non-estimable weights becomes greater. In this scheme, the Box-Cox transformed data ( , … , ) with given are used for the input in the MIMO, and the remaining data ( , … , ) are used for the teaching signals in the training phase.

(ii) Common BP Algorithm
Back Propagation (BP) is a common method for training a neural network. The BP algorithm is the well-known gradient descent method to update the connection weights, to minimize the squared error between the network output values and the teaching signals. There are two special inputs; bias units which always have the unit values. These inputs are used to evaluate the bias to the hidden neurons and output neurons, respectively. For the value coming out an input neuron, ( = 1,2, … , − ), let ∈ [−1, +1] be the connection weights from -th input neuron to -th hidden neuron, where and denote the bias weights for -th hidden neuron and -th output neuron, respectively, for the training phase with = 0,1, … , − , = 0,1, … , and = − + 1, − + 2, … , . Each hidden neuron calculates the weighted sum of the input neuron, ℎ , in the following equation: Since there is no universal method to determine the number of hidden neurons, we change in the pre-experiments and choose an appropriate value. After calculating ℎ for each hidden neuron, we apply a sigmoid function (ℎ ) = 1 exp (−ℎ ) ⁄ as a threshold function in the MIMO. Since ℎ are summative and weighted inputs from respective hidden neurons, the -th output ( = − + 1, − + 2, … , ) in the output layer is given by Because are also summative and weighted inputs from respective hidden neurons in the output layer, the weight is connected from j-th hidden neuron to s-th output neuron. The output value of the network in the training phase, , is calculated by ( ) = 1 exp (− ) ⁄ . In the BP algorithm, the error is propagated from an output layer to a successive hidden layer by updating the weights, where the error function is defined by with the prediction value and the teaching signal observed for the period ( , ]. Next we overview the BP algorithm. It updates the weight parameters so as to minimize SSE between the network output values ( = − + 1, − + 2, … , ) and the teaching signals , where each connection weight is adjusted using the gradient descent algorithm according to the contribution to SSE in Equation (8). The momentum, , and the learning rate, , are controlled to adjust the weights and the convergence speed in the BP algorithm, respectively. Since these are the most important tuning parameters in the BP algorithm, we carefully examine these parameters in pre-experiments. In this paper we set =0.25~0.90 and = 0.001~0.500. Then, the connection weights are updated in the following: where ℎ and are the output gradient of j-th hidden neuron and the output gradient in the output layer, and are defined by respectively. Also, the updated bias weights for hidden and output neurons are respectively given by The above procedure is repeated until the desired output is achieved.

Last Phase: Long-Term Prediction
Once the weights are estimated with the training data experienced for the period ( , ], we need to obtain the remaining ( + ) − non-estimable weights for prediction through the BP algorithm. Unfortunately, since these cannot be trained with the information at time , we need to give these values by the uniform pseudo random variates in the range [−1,1]. By giving the random connection weights, the output as the prediction of the cumulative number of software faults, ( , … , ), are calculated by replacing Equations (6) and (7) by respectively, for = 1,2, … , , = 1, … , and = 1,2, … , . Note that the resulting output is based on one sample by generating a set of uniform pseudo random variates. In order to obtain the prediction of the expected cumulative number of software faults, we generate m sets of random variaties and take the arithmetic mean of m predictions of ( , … , ), where = 1,000 is confirmed to be enough in our preliminary experiments. In other words, the prediction in the MIMO type of MLP is reduced to a combination of the BP learning and a Monte Carlo simulation on the connection weights.

Optimal Software Release Decision
Assume that the system test of a software product starts at = 0 and terminates at = . Let be the software lifetime or the upper limit of the software warranty period, where the time length ( , ] denotes the operational period of software after the release. When the software fault count data = { , … , } which are the cumulative number of detected faults at time ( = 1,2, … , ) are observed at time (0 < ≤ ), the model parameter is estimated with these data in parametric NHPP-based SRGMs. Define the following cost components: . (17) Note that Λ ; is independent of and that Λ ; = from Eq.(5). When = , l corresponds to the remaining testing length. Hence, our optimization problem is essentially reduced to When SRGMs are assumed, the mean value function Λ ; can be estimated from the underlying data. On the other hand, in the context of neural computation, the output of MLP in Eq. (16) is regarded as an estimate of Λ ; . Hence, output sequence( , , … , ) denotes Λ ; , Λ ; , … , Λ ; in the MIMO type MLP. Dohi et al. (1999) gave the essentially similar but somewhat different dual problem to Eq. (18). Consequently, the optimal release time ̂ is equivalent to the point with the maximum vertical distance between the predicted curve Λ ; and a straight line ( − ) ⁄ in the two-dimensional plane. Then, it can be graphically checked that there exists a finite optimal software release time. In fact, when ̂ is equal to the observation point , then it is optimal not to continue testing the software product any more. In other words, the optimization problem considered here is a prediction problem of the function Λ ; .

Numerical Experiments
We give numerical examples to predict the optimal software release time based on our neural network approach, where four data sets, DS1∼DS4, are used for analysis (see Lyu (1996)); which consist of the software fault count (grouped) data. In these data sets, the length of software testing and the total number of detected software faults are given by (62, 133), (41, 351), (46,266) and (109,535) respectively. To find out the desired output via the BP algorithm, we need much computation cost to calculate the gradient descent. The initial guess of weights, , , and , are given by the uniform random variates ranged in [−1, +1], The number of total iterations in the BP algorithm run is 1,000 and the convergence criterion on the minimum error is 0.001 which is same as our previous paper by Begum and Dohi (2016a). In our experiments, it is shown that the search range of the transformation parameter λ should be [−3, +2]. We define the error criterion in the following to evaluate the prediction performance of the optimal software release time via MIMO type of MLP or SRGM, where ̂ is a posteriori optimal testing time and is calculated by finding the point (= 0,1,2, … , ) which maximizes Λ ; − ( − ) ⁄ , and q satisfies = . This error criterion is used for the predictive performance when all the data , , … , , , , … , are given. In Table 3, we present posteriori optimal release times for DS1∼DS4, when = 4, = 1 = 1 and varies in the range of 2∼10.
Meanwhile the cost parameter is not sensitive to the posteriori optimal release time in Table 3, we focus on only two cases; = 2 and = 10 hereafter. In Tables 4 and 5 we give the prediction results on the optimal software release time for DS1 with = 2 and = 10, respectively, where the prediction of optimal release time and its associated prediction error are International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 177-194, 2018 ISSN: 2455-7749 187 calculated at each observation point (50%∼90% points of the whole data). In these tables, the bold number implies the best prediction model by MIMO type of MLP in comparison with the best SRGM among eleven models in Table 1. In the MIMO type of MLP, we compare Box-Cox power transformation results with the non-transformed case (Normal) and the best SRGM. In the column of SRGM, we denote the best SRGMs in terms of prediction performance (in the sense of minimum average relative error; AE) and estimation performance (in the sense of minimum Akaike Information Criterion; AIC), denoted by P and E, respectively, where the capability of the prediction model is measured by the Average Error (AE); where is called the relative error for the future time t = n + s and is given by So we regard the prediction model with smaller AE as a better prediction model. In Table 4 and 5, it is seen that our approaches could provide smaller prediction error than the common SRGM in almost all observation points (50%∼90%).  Table 4. Prediction results of optimal software release time with DS1 with = 2 Table 5. Prediction results of optimal software release time with DS1 with = 10 International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 177-194, 2018 ISSN: 2455-7749 189 In contrast, at 80% observation time the non-transformed case (Normal) showed small errors (see Table 5). Even in these cases, it should be noted that the best SRGM with the minimum AIC is not always equivalent to the best SRGM with the minimum AE. This fact tells that one cannot know exactly the best SRGM in terms of judgment on when to stop software testing.
Tables 6 and 7 give the optimal software release time for = 2 and = 10 with DS2. In the latter phase of software testing, i.e., 80%∼ 90% observation points, SRGMs, such as txvmin, txvmax and lxvmin, could offer less error than our MIMO type of MLP (see Table 6). Similarly Tables 8∼11 provide the optimal software release time for = 2 and = 10 with DS3 and DS4 respectively. In most of the cases our MIMO type of MLP could give the best timing of software release for the all testing phase. Throughout our numerical experiments, we can conclude that when the cost parameter increases, the resulting software release time also increases. Table 6. Prediction results of optimal software release time with DS2 with = 2 International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 177-194, 2018 ISSN: 2455-7749 190 Table 7. Prediction results of optimal software release time with DS2 with = 10 Table 8. Prediction results of optimal software release time with DS3 with = 2 International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 177-194, 2018 ISSN: 2455-7749 191 Table 9. Prediction results of optimal software release time with DS3 with = 10 Table 10. Prediction results of optimal software release time with DS4 with = 2 International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 177-194, 2018 ISSN: 2455-7749 192 Table 11. Prediction results of optimal software release time with DS4 with = 10

Conclusion
Almost all software is complex in nature, which has a huge testing scope. All defects in the software are possible to find out but testing will be endless and testing cycles will continue until a decision is made when to stop. Now it becomes even more complicated to come to a decision to stop testing which is being minimized the relevant expected cost. We have given illustrative examples with four real software fault data to derive estimates of the optimal software release time and performed a sensitivity analysis for each the observation point. In the numerical examples are that our MIMO type of MLP could give the accurate optimal software release time in the all testing phases. Additionally, the method can also help project managers to decide when to stop the software testing for market release at the earlier time. Future work will include to develop a tool to support the optimal software testing decision by automating the BP learning and the size determination of a hidden layer.