Pattern Trees for Fault-Proneness Detection in Object-Oriented Software

: Problem statement: This study introduced an application of pattern tree based classification technique in the area of object-oriented software quality estimation. This application explored the fault prediction accuracy of pattern trees. Approach: Similarity measures and fuzzy aggregations employed in the pattern tree technique had been used to generate tree models for fault detection in software modules. Experiments had been performed on datasets namely, KC1 and KC3 obtained from NASA’s metric data program. Pattern tree models were built using metrics from the object-oriented software datasets. Results: AND/OR, OWA and WA had been selected for pattern tree induction. Pattern tree models build using RMSE similarity measure produced higher accuracy as compared to other similarity measures. Conclusion: The proposed application succeeded in improving the quality of the object-oriented software in terms of prediction accuracy. Pattern trees models were found to be less structural complex as compared to fuzzy decision trees.


INTRODUCTION
Advances in distributed object technologies dramatically impact the development process of distributed software applications.In particular, time for providing new distributed services is decreasing because applications are not built from scratch any longer (Denaro et al., 2003).Object-oriented technology brings great ease in software redevelopment areas.One of the new issue is that how to develop quality of the system and how to measure and improve software quality for both development and redevelopment.Object-oriented design plays a pivotal role in software development, because it determines the structure of the software solution (Khan et al., 2006).Software quality estimation is a key factor in developing a software system.High-assurance software systems depend on the stability and reliability of underlying software.The goal of software quality estimation is intangible in an actual project environment.The quality cannot be directly checked in the product, it must be planned right from the beginning.The failure rate of software is high in the early stage of software development life cycle, due to the undiscovered errors or faults.Software faults are common reasons of complexity in modern systems.Software faults are the defects that cause a software failure in an executable product (Khoshgoftaar and Seliya, 2002).
A lack of quality in design process can make correct implementation impossible.If these faults are not found earlier in object-oriented software modules then it will be very costlier to fix them in the end thereby decreasing the quality of the end product.Finding faults in the early stage increases the quality of the end product and prevent ripple effects from the changes later in the software development life cycle.It is wise to isolate the faults as early as possible in design phase.Therefore estimation of quality of software has become an important factor in software development.Software metrics have become essential in software engineering for several reasons, among which quality assessment and reengineering.In the field of software evolution, metrics can be used for identifying stable or unstable parts of software systems (Lanza and Ducasse, 2002).Software metrics is a necessary step for quality and reliability (Wang et al., 1997).
Decision tree is one of the simplest software quality modeling techniques used in software quality estimation (Ishrat et al., 2009).Decision tree is one of the most widely used practical methods for inductive inference (Mitchell, 1997).Software quality estimation models have been built using various decision tree techniques.Khoshgoftaar and Seliya (2002) and Wang et al. (1997) have applied regression tree algorithms for software fault prediction.Khoshgoftaar and Seliya (2002) and Khan et al. (2006) have also applied classical decision tree algorithms like C4.5, CART and S-Plus for software quality estimation.These models effectively minimized software failures and improved the reliability of the software systems.Classical decision trees and ensemble techniques (Ishrat et al., 2009), fuzzy decision trees technique (Ishrat et al., 2010) have been applied to build quality estimation models for object-oriented software data.
Pattern trees: Like decision trees, pattern trees are an effective tool for classification applications.A novel pattern tree induction method has been proposed to build the pattern trees, by means of the similarity measures and different aggregation operators (Huang and Gedeon, 2006).A pattern tree is used to represent pattern of data which belong to the same class.Under binary context, the fact that a data matches a given pattern tree induces that the data should be classified into the class that the pattern tree represents.Under fuzzy context, the matches of a data and a given pattern tree would not be crisp yes or no, instead, a truth value which is in the range of [0, 1] is obtained to reflect how confident a data should be classified to the class that the pattern tree represents (Huang, 2007).A pattern tree is a tree which propagates fuzzy terms using different fuzzy aggregations.Each pattern tree represents a structure for one output class which is located at the top as the root of the tree.The pattern tree induction methods are based on similarity measures and fuzzy aggregations.Note that all the nodes within the pattern trees are leaf nodes.When a new data sample is tested over a pattern tree, it starts from the bottom leaves and travels to the top.It finishes with a truth value indicating the degree of this data sample belonging to the output class of this pattern tree.The output class with the maximal truth value is chosen as the prediction class (Huang and Gedeon, 2006).

Similarity measures:
Let A and B be two fuzzy sets (Zadeh, 1965) defined on the universe of discourse U.The commonly used fuzzy similarity definitions are shown in Table 1, where ∩ and ∪ denote a certain tnorm operator and a t-conorm respectively.The fuzzy similarity (Chao et al., 1996) between them can be defined as: where, ∩ and ∪ denote a certain t-norm operator and a t-conorm respectively.Usually, the MIN(∧) and MAX(∨) operators are used.According to the definition, 0 S(A, B) 1. ≤ ≤ in practice, this measurement can be computer as: Where: An alternate similarity definition proposed by Huang and Gedeon (2006) and Huang et al. (2008) for pattern tree construction is Root Mean Square Error (RMSE) based fuzzy set similarity.Consider that the Root Mean Square Error of fuzzy sets A and B can be compared as: The RMSE based fuzzy set similarity can be defined as: The large value Sim(A, B) takes, the more similar A and B are.
Fuzzy aggregations: Fuzzy aggregations are logic operators applied to fuzzy membership values or fuzzy sets.They have three sub-categories, namely t-norm, tconorm and averaging operators such as Weighted Averaging (WA) and Ordered Weighted Averaging (OWA) (Huang and Gedeon, 2006;Yager, 1988).In fuzzy sets theory, triangular norms (t-norm) and triangular-conorms (t-conorm) are extensively used to model logical operators and and or.The basic t-norm and t-conorm pairs which operate on two fuzzy membership values a and b, a,b [0,1] ∈ are shown in Table 1.The aggregations above are only shown to apply to a pair of fuzzy values; they can also be applied to multiple fuzzy values as they retain associatively.
A WA operator of dimension n is a mapping An OWA operator (Yager, 1988) of dimension n is a mapping The fundamental difference of the OWA from WA aggregation is that the former does not have a particular weight w i associated for an element, rather a weight is associated with a particular ordered position of the element.The main factor to determine which aggregation should be used relies on the relationship between the criteria involved (Huang and Gedeon, 2006).

Datasets:
The empirical software datasets used in the case study have been taken from NASA IV and V Facility Metrics Data Program, a freely available repository website.This repository contains software metrics and associated error data.The two datasets namely, KC1 and KC3 have been used contains a set of software metrics and an additional attribute called fault, to check whether a module is faulty or not.The fault prone modules constitute only small portion in the datasets (NASA, 2008).The numbers of cases collected in these datasets belong to one of the two classes either faulty or non-faulty.Each dataset contains different number of software metrics.The metrics involved in the datasets were taken as independent variable.The dependent variable is fault or non-fault modules.Table 2 shows the description of the datasets.For all datasets, a simple fuzzification method based on three evenly distributed trapezoidal membership functions for each input variable i.e., metrics from the datasets is used to transform the crisp to fuzzy values (Huang, 2007).The whole datasets are divided into training and test sets.
Pattern tree induction method: Assume a dataset has n input variables A i , = 1,2,…,n and one output variable X.Further assume that input variables each have m fuzzy linguistic terms denoted as A ij , i = 1,2,…,m and output variable has k fuzzy or linguistic terms denoted as x j , = 1,2,…,k.That is, each data point is represented by a fuzzy membership value vector of dimension (nm+k).The task is to build k pattern trees for the k output classes (fuzzy or linguistic terms).The task is to build k pattern trees for the k output classes (fuzzy or linguistic terms).The induction of pattern tree, say for class X 0 , is described in algorithm shown in Fig. 1.The induction of other pattern trees follows the same principle.
In the initialization, the set of primitive trees P is constructed, in which each fuzzy term A ij , i = 1,…,n, j = 1,…,m is use to construct a primitive pattern tree.
The primitive tree which has highest similarity to output class term X 0 , is then selected as the initial candidate tree C 0 .Here P indicates that it contains a set of trees in contrast to one tree such as C 0 .The subscript of zero in C 0 indicates that tree has zero depth.In induction, the aggregation is attempted between the previous candidate tree C k-1 and any primitive tree S in the primitive tree set P, using any aggregation ψ drawn from the aggregation set ψ.When ψ = WA or ψ = OWA, the weights which make the aggregated term most similar to class term used.A constraint is imposed upon the aggregation: The primitive tree S cannot be a subset tree of the candidate tree C k-1 , which prevents a primitive tree being used in the aggregated tree more than once.Among all aggregated trees, the one which has the highest similarity to class term X 0 is selected as the current candidate tree C k , which has one more depth than the previous candidate tree C k-1 .If the candidate tree has reached the pre-defined depth d, or the new candidate tree C k has a lower similarity to X 0 than the previous one C k−1 , the induction stops and the tree which has the highest similarity is returned as the optimal tree.In this algorithm, an aggregation always happens between a candidate tree and a slave primitive tree.The aggregated trees thus always have one fuzzy term as its right child for the internal node.This kind of tree is denoted as simple pattern trees.In contrast, pattern trees which do not have such a constraint is referred to as general pattern trees (Huang, 2007).

RESULTS AND DISCISSION
The experiments have been carried out using KC1 and KC3 datasets.The aim is to estimate the quality of the object-oriented software by predicting the number of faults.Pattern tree models were built using all the software metrics from the two data sets.Out of all aggregations mentioned above and /OR, OWA and WA have been selected for pattern tree induction.RMSE and Jaccard similarity measures are tried on the both datasets, out of which RMSE produced promising results.The maximum depth d is set to 3 and the candidate tree level is 2. The performance of both datasets is shown in Table 3 In Fig. 2 FTerm0 and FTerm1 are the fuzzy terms associated with their respective input variables i.e., the metrics.The oval shapes are input variables and the number inside these oval shapes denote the following metrics participated in pattern tree induction:  In Fig. 3 the following metrics corresponds to the numbers inside the oval shapes: The performance of the proposed application is evaluated and compared with the fuzzy decision tree (Ishrat et al., 2010) models.The prediction accuracy of the pattern trees and the fuzzy decision trees are shown in Table 4.It can be observed that pattern trees performed in a consistent way for both datasets.The pattern tree results in higher classification accuracy than fuzzy decision tree.Structural complexity of pattern trees is less than fuzzy decision trees.

CONCLUSION
This study has proposed a new application of decision tree termed pattern trees, which make use of different aggregations including both t-norms and tconorm, for quality estimation in the area of object oriented software.Like decision trees, pattern trees are found to be an effective tool for classification applications.The pattern tree induction methods are based on similarity measures such as RMSE and fuzzy aggregations OWA and WA.The pattern trees have been generated for faults prediction in the software modules using all the metrics from the datasets.The pattern trees build using RMSE similarity measure produced best results.The pattern trees performed consistently.The comparison to fuzzy decision tree shows that the pattern tree can obtain higher classification accuracy.The pattern trees are found to be less complex in structure than fuzzy decision trees.
j x , j 1,...., m, = = The crisp values are discredited in the variable domain a j B (x ) and (x ) µ µ = The fuzzy membership values of x for A and B b j is the jth largest element of the collection {a 1 ….,a n }.

Fig. 1 :
Fig. 1: Induction of simple pattern tree Data preprocessing: These datasets have been preprocessed to a format acceptable by the pattern tree software tool, before they are used in the experiments.For all datasets, a simple fuzzification method based on three evenly distributed trapezoidal membership functions for each input variable i.e., metrics from the datasets is used to transform the crisp to fuzzy values(Huang, 2007).The whole datasets are divided into training and test sets.

Table 1 :
Basic t-norms and t-conorms pairs

Table 2 :
Datasets used in the experiments

Table 3 :
Prediction accuracy of pattern tree

Table 4 :
Prediction accuracy of pattern tree and fuzzy decision tree