Performance Analysis of CHAID Algorithm for Accuracy

Decision tree algorithm belongs to the machine learning method of supervised learning, and it is one of the common techniques of data mining. CHAID algorithm is one of the common decision tree algorithms. The CHAID algorithm can be used for prediction as well as for classification. It is mainly used in market analysis, risk prediction and other fields. The second part tells the core idea of CHAID algorithm and the specific steps of the classification process, as well as the principle formula of chi-square detection and the specific steps of the calculation mainly used by CHAID algorithm. The third part describes the branch principle of CHAID algorithm and other decision tree algorithm, and analyses the accuracy of CHAID algorithm. Let us choose a relatively good decision tree algorithm according to the specific data situation, and apply the CHAID algorithm can also make some countermeasures according to some factors affecting the accuracy, which is more conducive to get a more accurate and better result.


I. INTRODUCTION
Since the 1990s, with the rapid development of information technology, the application of database systems has become more and more widespread, and at the same time database technology has entered a completely new stage, from just managing some simple data in the past to managing a wide variety of complex data such as images, videos, audio, graphics and electronic files generated by various devices, and the amount of data has become larger and larger.
In this era of such advanced information, the vast amount of information may not only not bring us benefits, but also bring us many negative effects. The most important factor affecting this is that effective information is difficult to be refined, and too much meaningless data will inevitably bring about the information distance and the loss of meaningful knowledge.This is what John Nalsbert calls the "information-rich but knowledge-poor" dilemma.
At that time, with the original functions of the database system, people could not discover the relationships and rules implied in the data, and could not predict the future trends based on the existing data.There is a lack of methods to uncover the hidden value behind the data. For this reason, there is an urgent need for a technology that can analyze large amounts of information more deeply, discover and capture the hidden value, and make the seemingly useless data useful.In such a scenario, the datamining technology has emerged. Machine learning is one of the important methods of data mining implementation. With the continuous increase of data mining and analysis needs of various industries in this era, the status of machine learning is also constantly improving [1].
Machine learning is a multi-disciplinary intersection involving several disciplines such as statistics, probability theory, and algorithmic complexity theory. Decision trees, Bayesian learning, artificial neural networks, random forests, and other aspects are all research directions of machine learning.One of the most commonly used machine learning methods is the decision tree algorithm.
Decision trees are an effective way to generate classifiers from data, and decision trees represent the class of the most widely applied logical methods. Decision tree algorithm is a common technique in data mining that can be used to classify analyzed data or to predict [2]. Chisquared automatic interaction detector (CHAID) is one of the typical decision tree algorithms [3].
In 1980 Kass first proposed the chi-squared automatic interaction detector (CHAID), a tool used to discover relationships between variables, a decision tree technique based on an adjusted significance test (Bonferroni test) [4].
The CHAID algorithm is a sensitive and intuitive segmentation method. It divides the respondents into several groups according to the relationship between the underlying variable and the dependent variable, and then each group into several groups. Dependent variables are usually some key indicators, such as the level of use, purchase intention, etc. A dendrogram is displayed after each program run: the top is a collection of all respondents, the following is a subset of two or more branches, and the CHAID classification is based on a dependent variable [5].
CHAID can be used for prediction, as well as classification, and to detect interactions between variables [6]. In practice, CHAID is often used in the context of direct selling, selecting consumer groups and predicting their response, how some variables affect other variables [7], while other early applications are in the research field of medicine and psychiatry [8]. There are also engineering project cost control, financial risk warning and fire reception and handling analysis [9].
In this paper, we mainly studies the CHAID algorithm and a comparison between CHAID algorithm and other commonly used decision tree algorithms, such as CART, ID3, and the accuracy analysis of CHAID algorithm [10].

II. CHAID ALGORITHM AND CHI-SQUARE DETECTION
The core idea of CHAID (Chi-Squared Automatic Interaction Detection) algorithm: optimal divide the samples according to the given target variable and the selected feature index (i. e., predictive variable), and group the contingency table to automatically judge according to the significance of chi-square test.
The field selection of the CHAID algorithm was performed using the chi-square test.

A. Classification process of the CHAID algorithm
The target variables for categorization were first selected, and then cross-categorized with the target variables to produce a series of 2 D taxonomic tables.
The chi-square value of the two-dimensional classification table was calculated separately, and the size of the P-values was compared, and the two-dimensional table with the lowest P-value was used as the best initial classification table, and then the categorical variable was used as the first-level variable of the CHAID decision tree.
Based on the best initial classification table, we will continue to classify the target variables to obtain the second and tertiary variables of the CHAID decision tree.
The process is repeated until the p-value is greater than the set statistically significant alpha value or until the classification stops when all variables are classified.

B. Introduction of chi-square detection
Because the CHAID algorithm mainly uses chi-square test, so I will introduce chi-side test below: 1) The concept and significance of Chi-square detection: The chi-square detection is the deviation between the theoretical value and the actual value of the statistical sample. The degree of deviation determines the size of the chi-square value. If the degree of deviation is smaller,the chi-square value will be smaller; On the contrary, the larger the chi-square value; if the actual value and the theoretical value are calculated, the chisquare value is equal to 0.
2) The basic idea of Chi-square detection: The Chisquare test is a commonly used hypothesis test based on the chi-square distribution.
Firstly, assume that the hypothesis H 0 : 'the expected frequency is not different from the observed frequency' holds. Under this premise, the chi-squared values of the theoretical and actual values x 2 are calculated. The probability that hypothesis H 0 holds under the current statistical sample can be determined based on the chisquared distribution and degrees of freedom. The following figure shows some chi-square value probability tables: If the P-value is small, then the probability of the hypothesis H 0 holding is small,the hypothesis H 0 should be rejected, indicating a significant difference between the theoretical value and the actual value; if the P-value is large, the hypothesis H 0 cannot be rejected and there is a difference between the theoretical value and the actual situation represented by the actual value.
3) Formula for chi-square detection: In this formula, x 2 is the chi-square value obtained by the actual and theoretical values, k is the number of cells in the two-dimensional table, A i is the actual value of i, E i is the expected value of i, n is the total number of samples, p i is the expected frequency of i, and E i = (n: the total number of samples) * (p i : the expected probability of i).

4)
Steps for chi-square detection: Firstly, assuming that H 0 holds, determine the degree of freedom (degree of freedom = (row-1) * (column-1), where the row, the column is the number of rows and columns in the twodimensional table); Then the theoretical frequency number is obtained with the maximum likelihood estimation; Then substitute it into the formula to solve; Assume that in the above example "whether or not to make up has no relationship with gender".
Maximum likelihood estimation yields the expected value E i : E 1 =100 * 110 / 200=55 (100 is the number of men actually surveyed, 110 / 200 is the number of people in all the surveys, from which the likelihood estimate of men made up) There is a significant difference between the value obtained in maximum likelihood estimation (inside of parentheses) and the actual value (outside of parentheses).
Generation formula: Because the desired result, x 2 > 10.828, means that the null hypothesis of 0.001 might hold, the 99.9% probability is that no makeup is significantly associated with sex; C. The field selection procedure for the CHAID algorithm.
After understanding the process of chi-square detection, let's take a look at the field selection process of the CHAID algorithm: The CHAID field is selected using a chi-square statistic. The chi-square distribution is actually whether the two category fields are distributed or not. And the numerical field, it will help you automatically discrete, into a category field, to analyze whether it is related to the target field. Larger indicates that the more significant the relationship, otherwise not obvious. The row of the above table is the income level, and the column is whether to buy a computer.
The calculation of chi-square statistics should first draw the frequency from the data, and then find their expectations. Then the squared difference of the two is found and summed.
The first table is a table of the actual data, derived from the actual data statistics.
The second table is a table of seeking expectations, which is equivalent to thinking that the two are independent, so you can multiply them directly by the probability, and then multiplied by the total.
The third table is seeking the variance. After summing, x 2 is equal to 0.57, and the corresponding probability is 75%, indicating that the two are relatively weak.
Let's calculate the full chi-square probabilities: After comparison, we can see that the age feature value is the largest in x 2 =3.54667 value, indicating that the age is the most closely related to whether to buy a computer, so we chose the age variable as the variable of the decision tree to produce the leaf node of the next level.

III. COMPARISON OF DECISION TREE ALGORITHM AND ACCURACY ANALYSIS OF CHAID ALGORITHM
At present, the three most commonly used decision tree algorithms are CHAID, CART, and ID3 (including the later C4.5, and even C5.0).
CHAID algorithm: The CHAID algorithm has a long history. According to the principle of local optimization, CHAID uses the chisquare test to select the independent variables that most affects the dependent variable. Then, because the independent variables may have many different categories, CHAID algorithm will generate equal amounts of leaf nodes according to the number of categories of the independent variable, so CHAID algorithm is a multi-fork tree.
Scope of application of the CHAID: The CHAID method is optimal when the predictor variable is categorical variables. For continuous variables, CHAID automatically divides the continuous variables into 10 segments, but there may be omissions. When the predictor variables are demographic variables, the researchers can quickly identify the market characteristics of the different segments, eliminating the difficulty of combining the cross-analysis tables with checking.
Accuracy analysis of the CHAID algorithm: CHAID algorithm uses the chi-square detection method in statistics, and because of the chi-square detection method, CHAID has a good mathematical theoretical basis in branch calculation, and its credibility and its accuracy are relatively high.
On this basis the CHAID algorithm uses the prepruning method, pre-pruning is pruning before dividing and generating the decision tree under constant pruning, so pre-pruning not only reduces the training time overhead and testing time overhead of CHAID decision tree, but also reduces the risk of overfitting. On the other hand, some pre-pruning divisions may not improve the generalization performance, or may even cause a temporary decrease in generalization performance, but subsequent divisions based on this division may lead to a significant improvement in generalization performance, thus posing the risk of underfitting. Therefore, if the number of pruning is maintained in a good interval when the amount of data is sufficient and the types are mostly categorical variables, the risk of underfitting of CHAID algorithm will be further reduced and the accuracy will be further improved.
CART algorithm: For the CART (Classification and Regression Tree) algorithm, the segmentation logic of CART is the same as that for CHAID, and the division of each layer is based on the test and selection of all independent variables. However, the test standard used by CART is not the chisquare test, but the indicators of impurity, such as the Gini coefficient (Gini). The biggest difference between the two is that CHAID adopts the principle of local optimization, that is, the nodes are irrelevant with each other. After a node is determined, the following growth process is carried out completely within the node. CART, on the other hand, focuses on overall optimization by using posthoc pruning methods, which lets the tree grow as much as possible, and then goes back to trim the tree.
Post-pruning decision trees generally retain more branches than pre-pruning decision trees. In general, after pruning decision tree under fitting risk is very small, generalization performance often due to pre-pruning decision tree, but after the pruning process is conducted after generating complete decision tree, and to the bottom to the leaves in the tree node by calculation, so the cost of training time than for pruning and pre-pruning decision tree are much bigger.
If there is missing data in the independent variable, CART will be used to find an alternative data to replace the missing value, while CHAID will take the missing value as a separate type of value.
CART and CHAID, one is a binary tree and the other is a multi-fork tree; CART selects the best binary cut in each branch, so a variable is likely to be used multiple times in different trees; CHAID divides multiple statistically significant branches for one variable at a time, which will grow faster, but the support of the subis rapidly decreases compared with CART, approaching a bloated and unstable tree more quickly.
Therefore, after the number of data categories in the data set increases to a certain extent, the accuracy of the CHAID algorithm will have a large decrease compared with the CART algorithm. The number of features of the data can be reduced by removing some data irrelevant to the target data during the data cleaning, so as to improve the accuracy of the CHAID algorithm. ID3 (including the later C4.5, and even C5.0) algorithm: ID3 (Iterative Dichotomiser) algorithm and CART is in the same period, its biggest feature is that the independent variable selection criteria is: Based on the measure of information gain selects the attribute with the highest information gain as the split attribute of the node, the result is the minimum information required to classify the segmented node, which is also an idea of division purity. As for the later development of C4.5, which can be understood as the development version of ID3, the main difference between the two is that C4.5 uses the information gain rate instead of the information gain measure in ID3. The main reason for such a replacement is that the information gain measure has a disadvantage, that is, it tends to choose the attributes with a large number of values. Here is an extreme example, for the division of Member_Id, each Id is a purest group, but such a division has no practical significance. The information gain rate adopted by C4.5 can overcome this disadvantage. It adds a split information to normalize constraint on the information gain. And C5.0 is the latest version. Compared to C4.5,C5.0 uses less memory and builds a smaller rule set than C4.5, while being more accurate.

IV. CONCLUSION AND FUTURE WORK
Decision tree algorithm belongs to the supervised learning machine learning method, and it is a commonly used technology in data mining. It can be used to classify the analyzed data, and it can also be used for prediction.Common algorithms are: CHAID, CART, ID3, C4.5, C5.0 and so on. The second part is the study of the core idea of the CHAID decision tree algorithm and the classification process, the specific steps of the classification process, and the principle formula of the CHAID decision tree algorithm in the branching process.The third part is a comparison between the CHAID decision tree algorithm and other commonly used decision tree algorithms such as CART, ID3 and other algorithms. In this study, we mainly compared the branching principles of several decision tree algorithms and their advantages and disadvantages, and the accuracy of the CHAID algorithm.The CHAID algorithm uses chisquare detection and pre-pruning in the branch method; the CART is the Gini coefficient (Gini) pruning in the branch method; ID3 is a measure based on information gain; and C4.5 and C5.0 adopt information gain rate.This paper can let us have a basic understanding of these algorithms, let us can according to our specific data to choose a relatively good decision tree algorithm for data mining, in the application of CHAID algorithm can also make some countermeasures according to some factors affecting accuracy, so more conducive to get a more accurate and better results.
In the next stage of research, I will find some data to specifically implement these several decision tree algorithms, and discuss the accuracy of the CHAID algorithm under different data.According to the experimental results, we can specifically compare the difference between these decision tree algorithms and the impact of different data on the accuracy of CHAID algorithm, as well as the difference between the accuracy of each algorithm.when we choose the decision tree algorithm according to the specific situation of the data, we can have a clearer and more intuitive understanding, and have a better understanding of the accuracy analysis of CHAID algorithm.