SDE based Unified Scheme for Developing Entropy Prediction Models for OSS

Today, so as to meet the user's requirement, modification of software is necessarily required. But at the same time, to incorporate these modifications and requirements there are enormous changes which are made to the coding of the software and over a period of time these changes make the software complex. Largely there are three types of code changes occur in the source code namely, bug repair, feature enhancement & addition of new features, but these changes bring the uncertainty in the bug removal rate. In this paper, these uncertainties have been explicitly modeled and using three-dimensional wiener processes that define the three types of fluctuation; we have come up with an entropy prediction modeling framework with a unified approach. The analytical solution of the equation is interpreted using Itô’s process. The models are fitted on three real life projects namely Avro, Hive and Pig of Apache open source software (OSS) The experimental findings show that present models exhibit accurate estimation results and have strong prediction skills. KeywordsComplexity of code changes (CCC), Differential equation (DE), Distribution, Entropy, Fluctuations, Open source software (OSS), Unification.


Introduction
Software's role in smooth functioning and productivity enhancement of almost every sector cannot be denied. "The advancements in computing capabilities of devices have proved to be one of the significant reasons for the introduction of software to time critical systems. OSS has brought major breakthroughs in development trends for large scale software systems. For such software, the prototype is developed and is evolved with voluntary contribution across the globe. The development pattern followed by OSS doesn't rigidly comply with standards described under the traditional software engineering lifecycle (Bhatt et al., 2017). With the enormous rise in software demand, IT industry has become highly competitive. Quality is the most critical characteristic that ensures the sustainability of software in market and reliability is an important metric to quantify the system's quality". In past decades numerous researches have been carried out in the direction of reliability modeling and assessment (Kapur et al., 2011a).
Open Source Software (OSS) varies characteristically from closed source software. Features such as irregular volunteer participation, lack of hierarchical management, no delivery schedules, and lack of rigid SDLC principles etc. make it different from closed source systems. OSS development is initiated with an idea or concept and a working prototype is worked out by few developers or a small team. "The core system with limited functionality is then released over the internet along with source code for further enhancement and refinement with global volunteer participation. The most distinguishing feature over closed source software is that this is released with very little functionality testing and therefore it is the user population who adopts such systems become the deciding factor in reliability growth phenomenon for such software. The OSS keeps on evolving due to the contributions from volunteers across the world. The software is developed by the core team with few members and released over the internet".
"The OSS community generally follows a layered architecture which describes the users into various categories like core members, active developers, peripheral developers, bug fixer, bug reporter, passive users etc. from the core to the periphery. The users are assigned participation levels which may change with time and contributions. People start working on software and when they find some glitch in software workflow, they report it as a failure and the fault is identified". After fault identification developers try to remove the fault.
Software source codes are enhanced to satisfy the user's needs. "The huge changes in the code make the software complex over a period of time. These changes increase the complexity of the code which also leads to the introduction of the bugs. The changes are being made continuously in the software to remove the bugs, enhance the features and implement the new functionalities Deepika et al., 2017). The open source software is being built through the contributors from diverse communities and keeps on moving rapidly. The code complexity is one of the major attributes which determines the quality and reliability of the software. Many contributors are making changes in the source code of the software to produce the quality of the software in a limited time. It is getting difficult to remember all those changes committed in the code which makes the source code complex. The contributors are interacting with each other through the discussion forum. Three types of code changes occur in the source code namely, bug repair, features enhancement and addition of new features. Bugs are generated in the software mainly due to miscommunication or no communication among active users, frequently changing requirements, early release pressures, the occurrence of programming errors, increase software complexity".
The remainder of the study is structured into the following sections: In Section 2, a detailed review of past literature is presented. Section 3 describes the modeling framework and also notations and assumptions are used for formulating the models. Following that, in Section 4, the parameter estimation and comparison criteria results of the proposed models are produced. Finally, in the next section, we summarize the key results and provide the concluding remarks which are then followed by the references.

Literature Review
Numerous studies have been carried out for assessing the reliability of OSS systems. Tamura and Yamada (2008) investigated reliability assessment combining it with neural networks. Some of the latest SRGMs proposed to assess the reliability growth of software systems incorporating practical phenomena are Chatterjee and Shukla (2016), Li and Pham (2017), Zhu and Pham (2017). Stochastic differential equation-based models have been proposed for OSS and further discussed its application in the optimal version update problem by Tamura et al. (2008). Li et al. (2011) presented another model to predict optimal version update time for OSS. Yang et al. (2016) included the process of fault detection and correction and performed reliability modeling for multi-release OSS. Many other studies tried to model reliability for OSS. Rahmani et al. (2009Rahmani et al. ( , 2010 and Zhou and Davis (2005) studied the OSS reliability model by experiments and also performed the comparative analysis.
The frequent change in the source code of open source software makes the software complex and more error-prone. "The CCC has been quantified and design using an information theory-based procedure called entropy. In literature, researchers have worked on entropy and it is widely being exercised in software reliability. It was Shannon (1948Shannon ( , 1951 who provided the foundation for the area of information theory and discussed measures for the efficiency of the communication channel and entropy. The CCC matrix is also used for the maintenance of software (Kafura and Reddy, 1987). In the earlier stage, an article was written by Fenton and Neil in 1999 where the detection rate was used to predict defects with an objective (Fenton and Neil, 1999). After Fenton and Neil (1999), Graves et al. (2000) gave the work related to product measures (Graves et al., 2000). Some researchers proposed in the literature to resolve the fault potential. (Khoshgoftaar et al., 1999;Leazak et al., 2002;Arisholm and Briand, 2006). Then some researchers have discussed the bugs prediction of various projects with the absence of single metrics (Nagappan and Ball, 2005). In 2008 Moser has come up with a comparative study of change and code metrics (Moser et al., 2008). Some researchers gave different approaches to compare the history of complexity metric for bug prediction". Then other researchers have given the comparison of complexity metric and product measures (Hassan, 2009;Kamei et al., 2010).
Later on, D'Ambros have come up with a study of decay based models (D'Ambros et al., 2010(D'Ambros et al., , 2012. Then a model was suggested by Singh and Chaturvedi (2012) on the basis of entropy for bug prediction utilizing support vector regression. Further, they have proposed a mathematical model to study the diffusion of the CCC. The release time of software has been predicted by Chaturvedi et al. (2013) using the CCC. Further, the researchers have extended their work and discussed about the prediction of subsequently year expected bugs based on the current year CCC .
Along with several other measures of entropy, Arora et al. (2014) considered bug prediction models using different types of measures. Chaturvedi et al. (2014) used measures of entropy in the prediction of CCC. A defect prediction model was studied to predict how many defects would arise throughout the improvement of software lines on the basis of entropy (Jeon et al., 2014). Singh et al. (2015) proposed three models-software reliability growth models, models based on CCC to predict bugs in software. Recently Anand et al. have characterized the CCC into two aspects: features improvement and a bug fix for OSS (Anand et al., 2019a). defines the three types of fluctuations namely; bug fix, improvement and addition of new characteristics. The proposed framework is capable of handling the distribution function and is thus an important step towards the unification of SDE based models, which depends on specific distribution functions.

Model Development
In this section, the emphasis is laid on describing the proposed entropy-based models. Section 3.1 lists the set of notations and assumptions that have been used.

Assumptions
At original time t=0, there is no change in file and CCC is zero.
Using the above set of assumptions to capture the uncertainty, we propose the unified models based on CCC to be diffused in software over a period of time. Stochastic differential equations have an inbuilt tendency to cater to irregular fluctuations that can be represented as an ordinary differential equation. This equation describes the Brownian motion, which can be solved by using In line with what is available in software reliability literature, the following linear differential equation contributed to model the diffusion rate of CCC (Kapur et al., 2011b): where, () kt is the relative diffusion rate of errors being detected/removed. Under certain situations, () kt might not be known in its exact sense; also it may happen that the fluctuations are because of some environmental factor. In that case, it is important to account for such uncertainty that can be modeled by considering associated "noise" term as follows (Singhal et al., 2019): where, () ht is the hazard rate or random external factor. The exact behaviour of noise is difficult to understand and only the distribution function that it might follow can be identified in advance, the function is time dependent and non-random in nature.
where, () t  is the factor that portrays the Gaussian white noise and s accounts for the magnitude of irregular fluctuations, that is, ( ) ( ) dZ t dt t  = . Before we proceed, it is important to understand the basic definition and properties of the Wiener process (Brownian motion), which are as follows: "The Wiener process () Zt is in essence a series of normally distributed random variables and some time points". "The variances of these normally distributed random variables increase to reflect that it is more uncertain to predict the value of the process after a longer period of time".
The properties of the Wiener process { () Zt is a continuous function of t .
Eq n .
(3) when substitute in Eq n .

a G t dt E s a H t dZ t Ft
Using the ideology from Anand et al. (2018aAnand et al. ( , 2018bAnand et al. ( , 2019aAnand et al. ( , 2019b of Ito Integral, the next component of the Eq n . (7) has vanished, that is ( ( ) ( ) 0 E s a H t dZ t  −=   "which implies that the non-anticipating function will be statistically independent in the future time or mathematically we can say that if we take the expected value of any non-anticipating function then it vanishes the whole component" i.e.
Therefore DE (7) can be modeled as: This is the distribution based mean value function using SDE. The complexity of the software system influenced the developer to deviate the testing phenomenon from the NHPP based modeling behaviour to a more realistic environment that accounts for various uncertain factors during the testing process. "The behaviour of the testing process, that is, majorly uncertain in nature is influenced by various factors like, testing efforts, testing skills, efficiency and methods. In order to capture these uncertain aspects regulating the testing process, various researchers have considered one-dimensional wiener process but changes in software source codes are unavoidable and the source code of the software is frequently modified to meet the user's huge requirements (Kapur et al., 2011a). Code change process means to study the patterns of source code Modifications. These changes are occurring due to bug repair (BR), features enhancement (FE) and addition of new features (NF)". In present work, these uncertainties have been represented by S1 (BR), S2 (FE) and S3 (NF) respectively (Singh and Sharma, 2014).

Bug repair
BR is the changes or modifications made in the code due to fixing of a bug.

Features enhancement
FE is the changes made due to some cosmetic changes like formatting, alignment, justification, comments etc.

Addition of new features
NF is the changes made in the code due incorporation of new features or new components.
Hence, Eq n . (9) can be modeled that reflect the three-dimensional Wiener process with three types of fluctuations; we get the mathematical structure as (Tamura and Yamada, 2014):

ft dH t s s s a H t dt Ft s a H t dZ t s a H t dZ t s a H t dZ t
The solution of Eq n . (9), under the seed value (0) 0 H = as follow: Now, using the solution process () Gt in Eq n . (10), several entropy predictions models can be derived for different distributions. It is to note that as per our knowledge, this is the first attempt to model the complexity of code changes with irregularity with a unified approach. There have been proposals in the past that have studied about these two aspects separately and in different fields, but in unification; they have been studied for the first time (which is obtained by unified approach)

Proposed Model I
In this proposed model, it is assumed the distribution of diffusion for entropy follows Weibull distribution. It is much used in reliability engineering. Due to its versatile nature, it can take the characteristics of other types of distributions. Thus, we can say that generalization of the exponential distribution is Weibull distribution because of its flexible nature. Hence, in expression (11), F(t) follows the Weibull distribution where  is shape parameter (or slope) and b is the rate at which the complexity of code changes is diffused in the code. Now putting the value of F(t) in Eq n . (11), thus the proposed model for entropy prediction is given as The preceding equation represents the expected number of CCC at any given time t.

Proposed Model II
It is considered that F(t) follows the Rayleigh distribution in the given Eq n . (11). This distribution is the measure of a two-dimensional random vector whose coordinates are distributed identically.
Now substituting the value of F(t) in Eq n . (11), so the entropy prediction can be modeled as The above Eq n . (13) describes the amount of entropy prediction based on the unified approach.

Proposed Model III
In proposed model III, we have assumed that the cumulative distribution function follows the normal distribution. The normal distribution is a type of continuous probability distribution for a real value random variable. It is a bell-shaped curve and also called Gaussian distribution. The general form of its probability density function  The parameter  is the mean or expectation of the distribution (and also its median and mode); and  is its standard deviation. A variance of the distribution is 2  .
Now substitute the value of F(t) in Eq n . (11), we get, In the above Eq n . (14), we have described the expected number of entropy predictions for normal distribution.

Data Description and Analysis
For validation of the above proposal, we have carried out the data analysis on Apache open source software data sets. Data Set-1(Abbreviated DS-1) consists of 62.68 entropy that have been predicted in 17 months from Avro Project. Similarly, Data Set-2 (Abbreviated DS-2) and Data Set-3 (Abbreviated DS-3) consist of 68.58 entropy and 56.79 entropy that has been predicted in 18 months and 15 months from Hive and Pig projects respectively. All three data sets can be seen in Table 1. The parameters of the models have been estimated using Statistical Analytical Software (SAS)/ETS user's guide 9.1 (2004) based on the nonlinear regression method. "The performance of models is judged by their capability to fit the data (goodness of fit) and predicting the future performance of the entropy (as shown in Figures 1, 2 and 3). The parameter estimation and comparison criteria result for DS-1, DS-2 and DS-3 of all models under consideration can be viewed through Table 2, Table 3 and Table 4 respectively.".

Comparison Criteria for Proposed Models
The following table gives the results of comparison criteria for developed models which are calculated by Sum of Square errors, Mean Squared Error, Root Mean Square Errors and Coefficient of Determination. On the basis of analysis performed in above Table 5, it can be clearly seen that the values of comparison criteria are quite significant, that is the obtained values are in acceptable range as the value of R-square obtained in approaching 1 which means the proposed Model-I are able to cater the entropy prediction in the good sense. Figures 1-3 also provide a way to graphically analyze the predictive capability of models which show that computed values are close correspondence to open source software data.

Conclusion
In the present work, a generalized framework for deriving several entropy prediction models based on SDE of Ito's type has been presented. Three basic reasons that; bug fixing, features improvement and new feature introduction have been studied for the diffusion process which can bring out changes in the code structure. The proposed framework is capable of handling any general distribution function and takes care of three-dimensional Wiener processes for inculcating