Comparison of Two New Data Mining Approach with Existing Approaches

This study studies two uncertainty data mining approaches and gives the two algorithms implementation in the software system fault diagnosis. We discuss the application comparison of the two data mining approaches with four classical data mining approaches in software system fault diagnosis. We measure the performance of each approach from the sensitivity, specificity, accuracy rate and run-time and choose an optimum approach from several approaches to do comparative study. On the data of 1080 samples, the test results show that the sensitivity of the fuzzy incomplete approach is or so 95.0%, the specificity is or so 94.32%, the accuracy is or so 94.54%, the runtime is 0.41 sec. Synthesizing all the performance measures, the performance of the fuzzy incomplete approach is best, followed by decision tree and support vector machine is better and then followed by Logistic regression, statistical approach and the neural networks in turn. These researches in this study offer a new thinking approach and a suitable choice on data mining.


INTRODUCTION
Because of the rapid increase of measurement data in engineering application and the participation of human, the uncertainty of information in data is more prominent and the relationship among data is more complex.How to mine some potential and useful information from plentiful, fuzzy, disorderly and unsystematic, strong interferential data, so as to perform real-time and effective engineering applications, this is a problem needs to be urgently further study.
Data mining is a process of selection, exploration and modeling to a mass of data for discovering beforehand unknown rules and relations, whose purpose is to get some clear and useful results for the owner of the database presented by Giudici et al. (2004).The spread speed of data mining was very fast and its application scope was widespread day by day introduced by Giudici et al. (2004), Liang (2006), Zhang et al. (2008) and Chen et al. (2008).Liang (2006) provided several data mining algorithms and some applications in engineering.Zhang et al. (2008) and Chen et al. (2008) introduced three data mining algorithms in medicine applications.However, the data mining industry was still in the initial stage of development in China, the domestic industries basically didn't have their own data mining systems.Now, some algorithms on data mining have been relatively mature as shown in Balzano and Del Sorbo (2007) and Wolff et al. (2009).The decision Tree algorithm based on CHAID, some rules generated by Scenario could be applied to the unclassified data set to predict which records would have promising results.Scenario's decision tree algorithm is very flexible, which gives the user the choice to split any variable, or the choice of splitting with statistical significance.He carried out the graphical analysis to the crude data by using the fold line chart, histogram and scatter plot.Liang (2006) listed several main software developers on data mining.
This study introduces two new approaches on data mining, uses them and other classical supervised learning data mining technologies to learn and classify 1080 data, validates the feasibility and effectiveness for the new data mining approach and compares the performance of these approaches with each other, so as to hope that can select a best mining approach for fault diagnosis in software system.This study evaluates the performance of each approach from sensitivity, specificity, accuracy, respectively.

FUZZY INCOMPLETE APPROACH
In here, the positive region and the reduction are mainly used.Their definitions refer to Pawlak (1982).
The fuzzy incomplete approach consists of three procedures and is given as follows.
Firstly, the incomplete reduction algorithm is as follows: Input: a set of condition attribute is C = {a 1 , a 2 ,…, a n } and a set of decision attribute is D = {d}.
Output: A set of attribute reduction is RED (Ω).
Step 1: Compute the C positive region of D is ( ) Step 2: For an attribute i a C ∈ , after it is removed, the obtained subset of condition attribute is C\ {a i }.Then compute the C\ {a i } positive region of D is Step 3: , then it indicates the attribute a i to relative to the decision attribute D is unnecessary.Assign C = C\ {a i } and go to the step 2. Otherwise, the attribute reduction Secondly, define α i (k) is the deviation between the measured attribute and the necessary attribute, for example, it is a norm or a covariance of error.
If we choose the normal membership function, the similar degree of the deviation under the normal operating condition at the moment k is: where, the 0 1 b < ≤ is a pending constant.Obviously there is ( ) L , after the ( ) i d k is obtained, the similarity vector between the standard value of the necessary attribute and the measured value of the real state in the normal operating condition is also obtained and labeled as M i (l), i.e.: Based on the definition of fuzzy synthetic function, define the synthetic similar degree of the deviation from time 1 to time l is: Thirdly, in order to give the similarity judgment between the standard value and the measured value, we assume H 0 and H 1 are the following event: H 0 : If the similar degree is bigger than a certain threshold value, a certain attribute is a necessary attribute.H 1 : If the similar degree is not bigger than a certain threshold value, a certain attribute is an unnecessary attribute.
Assume the threshold parameter is ξ .According to the experience and the test, there is 0.5 1 Then H 0 is accepted, i.e., a certain attribute is a necessary attribute; otherwise H 1 is accepted.Therefore, some data of the test similarity that satisfy the above formula are required.Otherwise some unsatisfactory data are removed.

MINING OF UNKNOWN PARAMETERS CALLED A STATISTICAL APPROACH
This study implements the data mining to the unknown parameters by the characteristics of statistical distribution.Because many random variables in practice problems obey (or approximately obey) a normal distribution, this study focuses on the introduction of the mining of unknown parameters about the normal population.Let 1 2 , , , n X X X L be a sample with n capacities from a normal population N (a, σ 2 ).
Here give an instance whether a mean is equal to the mining of known value.This mining problem is: Here, a 0 is a known mean.The σ 2 is an unknown.The σ 2 is discussed as follows: When the H 0 comes into existence,  the statistical theory, when the H 0 comes into existence, there is ( ) So, for a given significant level ( ) 0 1 α α < < , the critical value t a of the distribution with the freedom degree n-1 can be obtained by the t-distribution table so as to make ( ) For a given sample observed value x 1 , …, x n , we calculate the value of The mining method is: If t t α > , then H 0 is rejected, otherwise H 0 is accepted.The mining method is called the T-mining method.

COMPARISON OF THE NEW AND SEVERAL EXISTING DATA MINING APPROACHES
Criteria of performance index: The confusion matrix is used for calculating the classification accuracy.To the classification of 2 categories as an example, the confusion matrix is shown in Table 1.
Experiment and comparison: Chen et al. (2008), Aburrous et al. (2010) and Khalifelu and Gharehchopogh (2012) gave the forecast accuracy of the decision tree approach was higher than the corresponding value of other approaches and its standard deviation was less than that of other approaches.But by experimental validation, these performances of the fuzzy incomplete approach introduced in this study are better than those of the decision tree approach.The conclusion is shown in the following.
The test results of two approaches on data mining are given in experiment in here, i.e., the test results of performance of the fuzzy incomplete and statistical approaches for 10-group samples of historical measured data of software aging, which are shown in Table 2. Similarly, the test results of other approaches can also be given.Here they are omitted.
We experiment with 10-group data to compare the two new approaches with the existing approaches given by Liang (2006), Chen et al. (2008), Aburrous et al. (2010) and Khalifelu and Gharehchopogh (2012), but for simplicity, the test results of only one sample set here are given, as shown in Table 3.
By the experiment of 10-group data, the results of correct performance to every approach are shown in Fig. 1 and 2.
In this study, we use six indexes which are sensitivity, specificity, forecast accuracy, error classified rate, missed classified rate and runtime to compare the performances of six data mining approaches.From Fig. 1 and 2 and Table 3 known, the fuzzy incomplete approach has the highest sensitivity, the forecast accuracy for every group, which is higher than those of other approaches.The average forecast accuracy of fuzzy incomplete approach is also slightly higher than that of the other approaches and its runtime is least.Moreover, in the test of small sample set, the standard deviation of the forecast accuracy, sensitivity and specificity of fuzzy incomplete approach in the 10 groups, mean of error classified rate and missed classified rate all are less than those of the other approaches, it indicates its forecast results are relatively stable.Therefore, the performance of the forecast model established by the fuzzy incomplete approach is slightly better than that of other models on the whole, so the fuzzy incomplete approach is a preferred approach.Secondly, the decision tree, support vector machine, logistic regression, statistical approach and neural networks are followed in turn.

CONCLUSION
This study uses the six data mining approaches to test 10-group data whose number of samples is 192, 296, 526, 929, 1080 and so on, respectively.A best performance is selected from every approach based on the sensitivity, specificity, accuracy, error classified rate, missed classified rate and running time to compare with each other, in order to discover a suitable approach for aging characteristic research of software system.The test results show that the fuzzy incomplete approach is the best, the next is the decision tree, followed by the support vector machine, Logistic regression and statistical approach, the worst is the neural network.Through the contrast research discovered, the fuzzy incomplete approach is more suitable for the research of characteristic discrimination of aging detection in software system.
JJ−1˟2 approximates to σ 2 better, so we think naturally that we may use the ˟ * = ˟ * $ = # ˟ to replace the parameter σ in / .Thus, the statistic % / = • J − 1 is obtained.According to the N denotes the total number of samples.The N c is the number of samples for the correct classification.The P c denotes the correct classified rate.The P e denotes the error classified rate.The P m denotes the missed classified rate.Then, obviously, there exists

Fig. 1 :Fig
Fig. 1: The accuracy and the mean of 6 approaches

Table 1 :
The confusion matrix of performance index

Table 2 :
Test results of performance of fuzzy incomplete and statistical approaches

Table 3 :
The performance indexes of each approach