Software reliability growth models: A comparison of linear and exponential fault content functions for study of imperfect debugging situations

: The software testing process basically aims at building confidence in the software for its use in real world applications. The reliability of a software system is always important to us. As we carry out the error detection and correction phenomenon on our software, the reliability of the software grows. With an aim to model this growth in the software reliability, many formulations in the form of Software Reliability Growth Models have been proposed. Many of these are based on Non-Homogeneous Poisson Process framework.In this paper, a parallel comparison of the performance of the proposed software reliability growth models is carried out, considering linear and exponential fault content functions for study of imperfect debugging situations. The performance of the proposed models has been compared with some famous existing software reliability models and the proposed models have been validated on some real-world datasets. Three goodness-of-fit criteria that include mean square error, predictive-ratio risk and predictive power are used to carry out the performance comparison of the models. Using these comparison criteria on six actual failure datasets, it is concluded that the proposed Model-2 which always outperforms Model-1, fits the actual failure data better and has better predictive power than other considered SRGMs for at least two data sets.

ABOUT THE AUTHOR Javaid Iqbal received his BSc in Mathematics from the University of Kashmir, and Masters in Computer Applications from the same university in 2004. He completed his PhD from the University of Kashmir in 2014. His research interests are Software engineering, Software reliability engineering and reliability modeling, and software performance engineering. He is a member of ACM, CSI and IETE

PUBLIC INTEREST STATEMENT
Software reliability is one of the prime quality concerns. Reliable software is dependable. Software testing helps to improve the reliability of software. This paper explores the assumption of imperfect debugging under two possibilities i.e. linear and exponential fault content functions. This paper demonstrates using real world datasets that under given assumptions how the proposed model with exponential fault content function is better suited to use for the given datasets. The researchers in this field will benefit from the comparison of imperfect debugging situations carried out in this paper.

Introduction
The scope and importance of having highly reliable software solutions deployed in every walk of life is an accepted and acknowledged fact. Evaluation of a software before its deployment is a must. Reliability is an important quality characteristic of any software. Software Reliability is defined as the probability that software will provide failure-free operation in a fixed environment for a fixed interval of time (Musa, Iannino, & Okumoto, 1987). The reliability of software applications is measured by use of mathematical formulations called as a software reliability growth model (SRGM) which basically describes the error-detection and correction processes. The SRGMs find their applications in many practical areas which include safety-critical systems like Shuttle's on-board systems software (Schneidewind & Keller, 1992) and weapon systems (Carnes, 1997).
Over the past three decades many SRGMs have been formulated under Non-homogeneous poisson process (NHPP) modeling framework. Many researchers have used NHPP based SRGMs to capture the reliability growth of a software from the processes of testing and debugging. These models include Goel and Okumoto model (Goel & Okumoto, 1979), Yamada and Ohba model (Yamada, Ohba, & Osaki, 1983), Ohba model (Ohba, 1984) etc. All such models use some underlying assumptions. The general NHPP model is based on the following assumptions: (1) The testing/debugging process follows an NHPP and (2) a debugging procedure is invoked immediately when a software error is detected. Researchers perceive some realistic assumptions to make a model better fit the error-detection and correction process. For example, it is more realistic to consider imperfect debugging than perfect debugging. Goel and Okumoto (1979) proposed an exponential SRGM. The Goel and Okumoto model is one of the earliest NHPP models for software reliability and has been extensively used in literature. This has a concave-shaped mean value function and could not accommodate learning effects very well. As a result, many modifications and generalizations of G-O model have been proposed. Yamada and Ohba (1983) proposed delayed S-shaped SRGM while Ohba (1984) proposed inflection S-shaped SRGM to allow learning effects. These models Goel and Okumoto (1979), Yamada et al. (1983), Ohba (1984) etc. assume perfect debugging. Learning phenomenon has been considered in Yamada, Tokuno, and Osaki (1992), Pham and Zhang (2003) etc. under imperfect debugging assumption. Learning phenomenon and imperfect debugging are two important aspects of many NHPP models.
In this study, imperfect debugging is considered as an increasing function, with linear and exponential behaviors assumed separately for sake of comparisons. This function is called as total fault content (TFC) function. The motivation behind this paper is to study imperfect debugging by a parallel comparison of the performance of the proposed software reliability growth models that incorporate linear and exponential fault content functions for imperfect debugging and then compare the results with some famous existing models using some real datasets. Indeed, linear and exponential fault content functions have been extensively studied to model imperfect debugging situations. It may be noted that the difference between perfect and imperfect debugging is the nature of TFC function i.e. total number of faults. Perfect debugging has constant TFC while as imperfect debugging has a dynamic TFC which is a function of testing time.
The rest of the paper is organized as: In Section 2 the model development is discussed. The mean value functions of two models are considered, many famous existing models are tabulated, sources of six datasets used in this study are listed and estimated parameter values are tabulated for these six datasets. In Section 3 numerical example are given to illustrate the usefulness of these models and to carry out the performance comparison using three comparison criteria of the models. Finally, discussion and conclusion of the analysis made is presented in Sections 4 and 5 respectively.

Model development
The solution of the following differential equation representing relationship between failure intensity and the initial fault content is used to represent a general class of NHPP models (Pham, 2007;Pham, Nordmann, & Zuemei Zhang, 1999): where a(t) is the time-dependent fault content function which represents total number of faults in the system including the initial and introduced faults due to imperfect debugging, r(t) is the timedependent fault detection rate function representing faults detected per unit time, m(t) representing mean value function.
The general solution of differential Equation (1) is represented by the following equation: is the marginal condition of (2) and t 0 is the time at which debugging starts.
With this modeling methodology available in literature, our models are developed under common assumption of an increasing time-dependent fault detection rate function r(t) that exhibits an S-shaped behavior to capture learning effects and is represented by the following equation, where α, represents autonomous errors and η represents the learning factor.
However, as per the intent of this paper, we consider two cases of imperfect debugging. In case of model-1, a linear time-dependent fault content function a(t) is assumed (Yamada et al., 1992) where a is the initial fault content and β is the constant fault introduction rate. As is evident, this imperfect debugging situation considers that the current number of faults a(t) is proportional to the testing time.
and in case of model-2, an exponential the time-dependent fault content function is assumed (Yamada et al., 1992), where a is the initial fault content and faults can be introduced exponentially per detected fault.
These two behaviors of a TFC function have been extensively used to represent imperfect debugging situations in literature available for NHPP models. With the assumed behavior of r(t), and separately considering each behavior of a(t), in the generalized mean value function m(t) for NHPP models in Equation (2), we arrive at the following mean value functions for our models: Model-1 (MVF): (1) (2) Model-2 (MVF): Table 1 summarizes the proposed models and other famous existing NHPP models. Table 2 presents the datasets by labels and the sources of datasets listed.
The parameters estimated for some of the model listed in Table 1 are presented in Table 3 for six datasets used.

Numerical applications
In this section, using the comparison criteria like mean square error (MSE), predictive ratio risk (PRR) and predictive power (PP), a parallel comparison of our imperfect debugging models is carried out. The models are also compared to some famous pre-existing models. Six sets of actual software failure data obtained from actual projects have been used. Table lists comparison criteria used in this study. Formula notations include m i is the total number of cumulated errors cumulated between 0 and t i , m(t i ) is the estimated number of errors at t i obtained from fitting of the mean value function of the proposed model, n is the number of observations and k is the number of parameters (Table 4).
Each of these comparison criteria indicates a better fit of model than other when run on same data-set, if the criterion value is smaller. The following Tables 5-7 present criterion (MSE, PRR and PP) value and rank in a value (rank) combination format. Here and onwards, we have used shorter names of models listed in Table 1 and references are also available in Table 1 to avoid repetitions.

Label/reference/data-set Description
[1] Zhang and Pham (1998)/Failure data of misra system Documented in Misra (1983) and also available in Zhang and Pham (1998), the data-set summarizes the number of failures per 1 h execution time interval and is recorded for 25 h with 136 cumulative failures [2] Hossain and Dahiya (1993)/Failure data of NTDS System Documented in Hossain and Dahiya (1993), core of NTDS is the development of software for the real -time, multi-computer system and consists of 38 different project schedules with each module supposed to having 3 phases of production, test and user. The failure data of this software reports time-between-failures in days. A total of 26 failures were found during the production phase and 5 during the test phase [3] Pham and Zhang (2003)/Failure data of tandem software Documented in Wood (1996) comprises four datasets obtained from four major releases (Release 1, 2, 3, 4) of software products at Tandem Computers company (Zhang, Teng, & Pham, 2003). Release-1 data-set comprises a total of 100 faults recorded in 20 weeks. The actual numbers of CPU hours used are also documented and a total of 10,000 CPU hours have been used. For sake of avoiding any confidentiality issues, the number of faults were normalized from 0 to 100, and the amount of testing effort consumption was proportionately translated into the range [0, 10,000] (Lin & Huang, 2008;Wood, 1996) [4] Bai, Hu, Xie, and Ng (2005) (2005)/Failure data of wireless data service system Documented in Jeske and Zhang (2005) where a description of its main functions as routing of voice channels and signaling messages to relevant radio resources and processing entities are also described. This data-set comprises three datasets (Release 1, 2, 3). R-1 had a life cycle of 13 months in the field with a cumulative exposure time of 58,633 system-days, recording a total of 33 failures, out of which 19 were unique. R-2 also had a life cycle of 13 months in the field with a cumulative exposure time of 167,900 system-days, recording a total of 115 failures, out of which 71 were unique. R-3 consists of failures recorded during feature testing and load testing. Exact specifications of this release can be found in Jeske and Zhang (2005). However, R3 field failure data for its first 6 months of deployment are recorded as 19 observations with cumulative observed failures equal to 22 and cumulative exposure time of 64,390

Discussion on results
As can be seen from the Table 5, on the basis of MSE criterion Model-2 that uses exponential fault content function outperforms Model-1 that uses linear fault content function. In fact Model-2 performs best among all pre-existing models for datasets 3 and 4 and ranks third for datasets 1 and 5. Rank 4 is the worst Model-2 has got for datasets 2 and 6. This means that from MSE values, Model-2 provides significantly better estimation than any other model for datasets 3 and 4.
The second comparison criterion used is PRR. The PRR values for Model-2 outrank Model-1 as can be seen from Table 6. Here also, Model-2 outranks all other models for datasets 3, 4 and 5. Model-2 Predictive-ratio risk (PRR) (Pham, 2007) Predictive power (PP) (Pham, 2007)  ranks second for data-set 6 and ranks third for data-set 1. However, for data-set 2, Model-2 performs worst. Model-1 ranks third for datasets 2 and 4.
On the basis of PP values from Table 7, Model-2 predicts better than all other models for four out of six used datasets i.e. for datasets 3, 4, 5 and 6 and predicts third best for remaining two datasets. This is indicative of significant predictive power of Model-2 over rest of the models. Model-1 ranks second for data-set 6 and third for data-set 4.   For sake of drawing conclusion for parallel comparison of Model 1 and Model-2, it may be noted that Model-2 always outranks Model-1 for the given six datasets. Therefore for the datasets used in this study, it is observed that exponential imperfect debugging situations are better suited for use of model-2.

Conclusion
We present two new software growth models by considering two imperfect debugging situations represented by linear and exponential fault content functions. The mean value functions of the some famous existing and the two considered models have been used to present estimated parameters for six datasets. Three comparison criteria that include MSE, PRR and PP have been used to present a parallel comparison between the two considered models and among some famous existing models.
Based on the discussion on analysis of models using these comparison criteria on six actual failure datasets, it is concluded that the proposed Model-2 which always outperforms Model-1, fits the actual failure data better and has better predictive power than other considered SRGMs for at least two data sets (data-set 3 and 4). In fact Model-2 has better predictive power for four out of six data sets considered in this study. This indicates that for the given assumptions, given failure datasets and comparison criteria, Model-2 will perform better than other models for each comparison criterion for at least two datasets, which is significant enough to validate the models.
However, future research needs to be carried out on extending the existing forms of imperfect debugging situations or finding new such forms to better represent realistic imperfect debugging situations.