Study of Data Mining Algorithms Using a Dataset from the Size-Effect on Open Source Software Defects

This article focuses on the quality of data mining algorithms in terms of the accuracy ratio and time consumption. So, in order to figure out the best algorithm among the classification and clustering algorithms, the WEKA program will be testing all algorithms using a real dataset from the size effect on defect proneness for open source software. The Mozilla product is adopted as an example of open source software. The dataset that is used in this paper represents the output of the study of the size effect on defect proneness in the open source software. The study of Mozilla product shows a significant relationship between the size of software and the number of defect proneness in software. The Mozilla product study produced a dataset to be as inputs of the WEKA program in order to compare the data mining tools (algorithms). We use the Naive Bayes, Decision Trees J48, Expectation-maximization for classifying and K-Star and Simple KMeans for clustering methods. The findings demonstrate the difference between the algorithms according to the accuracy, and the time consuming to reach the result in each algorithm. Furthermore, the effect of the software size is significant on defect proneness. Finally, the experiments are conducted in WEKA with the aim of this research is finding out the best algorithm in terms of accuracy and timeconsuming. At the end, the paper will be figuring out the best algorithm in order to choose and depending on it in the tests of classification and clustering.


Introduction:
According to Fitzgerald, (2006), the open source software (OSS) which is a computerized software with its main source code readily available for exploration, distort and distribute the same for whatever purposes. The software must be supported by a valid license in which the holder of the copyright. Its development usually takes place in a rather public collaborative manner and is the most vivid example defining open-source development and is more often can now access the module created on the current work to date. A minor cataloguing error rate measuring less than 5.4% was able to be achieved using instance-based learner.
A. The corpus: The corpus a composition of 15545 illustrations, hosting 6 traits each portraying a binary character to indicate how a defect in the same can be fixed. The meanings of the characters are illustrated as pointed out below: 1. Idthis is a distinct numeric identification allocated to different C++ category.

2.
Startthis is a time infinitesimally more superior than the duration of modification in which conclusion was derived.

3.
Endeither the duration of the following modification, or the finalization of the conclusion period, or erasing.

4.
Eventit is set to 1 upon fixing of a defect at the period indicated by the end.

5.
Erasing of a class is much easier when progressing to a conclusion whose event is based on 1 if the category is erased for corrective maintenance.

6.
Sizeit is dependent duration covariate, and its pillars host the number of source lines codes of the C++ at the duration of starting.

7.
Stateit transforms to 1 after initially being set at 0. This occurs after the category undergoes an event, and thereafter retains as 1.

8.
The tool used for the categorizing our experiments was the Weka toolkit in which at the start of the test the tactic induced was five-fold cross-validation with a defaulted parameter set to have a wide-ranging synopsis.
B. Classification Algorithms 1. Lazy k star: K is a case-based categorizer that is the category of a test relying upon the type of those training cases that are familiar to it as indicated by some comparison utilities. It is different from other learners by the fact of using entropy-based distance utilities.

Naïve Bayes Classifier:
The classifier group belongs to common classifiers relying upon probabilities based on using the theory of Bayes with critical assumptions regarding the features (Wang and Li, 2015). Since its extensive study in the 1950s, the theory remains a popular tactic for text classification with the difficulty of analyzing documents aligning to one class or the other, with features being indicated by word frequencies (Puga and Altman, 2015). Its competitiveness is based upon its advanced tactics for support vector equipment. It also retrieves applications in automated medical diagnosis. The naïve Bayes model is a provisional possibility character. When issued with a situation instance to be categorized represented by dependable variables, it allocates to the instance possibilities for each of K expected results or categories. The concern of the formulation is when there is a huge number ( \ 1 , … … . , ) for every value of K. The scenario is done using the following formula: In layman language and at the same time applying the Bayesian probability terminologies, the equation can be re-written as follows: Consequently, achieving a constant number of denominators. Therefore, the numerator equals to the following joint probability : 3. Decision Tree: Dai and Ji (2014) argued that during probability analysis, the decision tree is viewed as a critical decision tool support another tool that makes use of a treelike graph or specifically a model of decisions as well as any likely consequence that include some chance event outcomes (Wu et al., 2015). Additionally, the rest consequences included utility and resource costs. Therefore, the decision tree is one of the ways of displaying an algorithm. In most cases, decision trees are implemented in solving operations research problems. In other words, the use of decision analysis when identifying a strategy will most likely get to a goal.
In terms of appearance, a decision tree looks like a flowchart structure whereby every internal node point to a test that is an attribute such as to indicate whether a head /tail comes up when a coin is tossed. Every branch of the tree indicates the result of the experiment or occurrence. On the other hand, leaf nodes showed a class label that is the decision taken after considering every possible scenario. The classification rules are represented by the path pointing from root to leaf. When carrying out decision analysis, the decision tree, as well as the similar influence diagram, are both applied like visual or analytical decision support tools.
In such a case, the expected values (which also refer to the anticipated utility of all the competing alternatives) are computed. In normal conditions, the decision tree contents three kinds of nodes:

1.
Decision: It is usually pointed out using squares.

2.
Chance: It is mostly represented using circles.

3.
End: It is depicted using triangles.
In most cases, decision trees are applied during research. Furthermore, decision trees are useful when carrying out analysis in order to locate a strategy that is most likely to result in a specific goal. Practically, when decisions must be taken online without any consideration to incomplete knowledge, the decision tree is supposed to be accompanied by Clustering is mostly used in data exploratory and mining, which is a common technique applied in statistical data analysis. It is also used in other fields, like machine learning, computer vision, info retrieval, and finally bioinformatics. Specifically, cluster analysis is one

K mean:
K-means clustering is one of the vector quantization methods. The method originated from signal processing, which is a popular criterion for analyzing data during mining. The K-means clustering method objective is to partition and observe k clusters where each observation is assigned to a cluster bearing the nearest mean and serves as the cluster model. The exercise culminates to data partitioning into the Voronoi cells data space.
One of the problems involved in this is that it is computationally difficult (NP-hard).
Nevertheless, most efficient times, heuristic algorithms are mostly used to assemble quickly in a local optimum. The process resembles the expectation-maximization algorithm, which is a combination of Gaussian distribution through iterative refinement method used by the two algorithms. also, both processes apply cluster centers to effectively model data. However, the k-means clustering method finds comparable spatial clusters. However, the expectedmaximization mechanism permits clusters to bear alternative shapes.

Result and Experiments:
A. Use of Classification Algorithms: 1. Lazy K-star: After applying the K star algorithm, the result is as in Fig.4, Fig.5 and Table1.

Naive Bayes Classifier:
The following results show Naive Bayes Classifier result as indicated in Fig.6, Fig.7 and Table 2.

Decision Tree:
When Decision Tree J48 Classifier is applied, the result will appear as in Fig.8, Fig.9, Fig.10 and Table 3.

Expectation-Maximization (EM):
After implementing the EM cluster algorithm, the result is as in Fig.11 and Table 4.

Results and Discussion:
In the experiments that were performed based on Mozilla open source software dataset, the lazy k-star result, which was represented in Fig. 4, Fig. 5 and Table 1. Where Fig. 4 clarify the correct and incorrect class in the tested data (15545 Instances) which was 92% correct classified and 8.0% incorrect, and test mode is 5-fold cross-validation as Table 1 showed. The  Table 2. Where Fig. 6 showing correct and incorrect class in the tested data (15545 Instances) which was 68.6 % correct classified and 31.3% incorrect, and test mode is 5-fold cross-validation as Table 2 showed. The Fig. 7 showed the distribution of the attributes (6 attributes) between negative and positive tested as 0 (Negative) and 1 (Positive) for the classifier error.
The performance of the second algorithm compared with the first one was decline, because of the low accuracy of the correct class (68.6%) and the consumption of the time was more (0.06 second). The third algorithm result (Decision Tree) showed in Fig. 8, Fig. 9, and Table 3. Where Fig. 8 clarify the correct and incorrect class in the tested data (15545 Instances) which was 94.6 % correct classified and 5.3% incorrect, and test mode is 5-fold cross-validation as Table 3 showed. The Fig. 9 showed the distribution of the attributes (6 attributes) between negative and positive tested as 0 (Negative) and 1 (Positive) for the classifier error. Accordingly, to the accuracy, the decision tree algorithm performance was good, but with more time consumption (0.44 seconds).
The second part of the experiments which were represents applying the clustering algorithms in order to show the similarities and differences between the data used, EM algorithm result represented in Fig. 11 and Table 4. Where Fig. 11 visualizing the cluster assignments for Mozilla tested data (15545 Instances) which was cluster 0 (74% similar) and cluster 1 (26% differ), and test mode is evaluate on training data as Table 4 showed. The time consumption was too long compared to previous algorithms (102 seconds). K-mean algorithm result represented in Fig. 12 and Table 5. Where Fig. 12 visualizing the cluster assignments for Mozilla tested data (15545 Instances) which was cluster 0 (57% similar) and cluster 1 (43% differ), and test mode is evaluate on training data as Table5 showed. Accordingly, to the time consumption, the performance of this algorithm was pretty good compared with EM algorithm (0.19 seconds).
Finally, the following tables (Table 6 and 7) demonstrates the performance of the algorithms compared to each other according to the time consumption and the accuracy.

Conclusion:
In this article, tests were carried out using the data applied in all the five algorithms, as stated in the WEKA application. The findings demonstrated, as Table 6 show, that the classified algorithm is Decision Tree J48 with a correct accuracy rate of 94.6% and 0.44 seconds needed time to build the model. Accordingly, the model represented the best algorithm to be found between the rest of the classification algorithms. Although of the accuracy rate which is done by the Decision Tree J48 was good, the Lazy K-star algorithm performed well in both cases of time consumption and accuracy rate Table 6. So, the Lazy Kstar in the case of huge data will be the best option to choose among the rest of classification algorithms. Alternatively, the cluster algorithms represented by EM as well as the Simple K-Means resulted in the given results that listed in the Table 7. The performance of the EM algorithm compared to the K-Mean algorithm was better in the case of time consumption. So, the EM algorithm will be the best option to choose instead of K-Mean algorithm. Among the used methods (which include the Bayes Theorem) argues the following probability: P(A|B) = P(B|A) P(A)P(B)P(A|B)=P(B|A)P(A)P(B).
The method also assumes that the class conditional P(B|A) P(B|A) is independent so you can have P(B|A) =∏P(BI|A).