Comprehensive Decision Tree Models in Bioinformatics

Purpose Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. Methods This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. Results The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. Conclusions The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.


Introduction
Decision trees are one of the most popular classification techniques in data mining [1]. One of the main reasons for this is decision trees' ability to represent the results in a simple decision tree format which is easy to interpret for experts, as they can see the structure of decisions in the classifying process. The basic idea of the decision tree format is to construct a tree whose leaves are labeled with a particular value for the class attribute and whose inner nodes represent descriptive attributes. Given an inner node N, the children of N correspond to different possible values of the associated descriptive attribute. Once a decision tree is built, determining the class value for a new instance is achieved by following a path from the root to a leaf according to the values of the descriptive attributes of the instance. The class value assigned will be that labeling the leaf. Following this process one can easily extract classification rules that can be readily be expressed so that humans can understand them. In addition to their simplicity, building decision trees is often a less time consuming classification process compared to other classification techniques [2], and decision tree rules can be directly used as statements in a database access language (e.g. SQL).
Decision trees can be built with several different approaches where the most popular are C4.5 [3] and CART [4]. Due to their popularity, decision trees have been applied to different research fields including bioinformatics [5,6], medicine [7] and image classification [8]. In addition, several commercial products use decision trees for knowledge discovery, predictive analysis and other purposes. For instance, KnowledgeSeeker [9] offers business intelligence software for customer analytics and marketing analytics.
From the knowledge discovery perspective, the ability to track and evaluate every step in the decision-making process is one of the most important factors for trusting the decisions gained from data-mining methods. Examples of such techniques are decision trees that possess an important advantage in comparison with competitive classification methods -i.e., the symbolic representation of the extracted knowledge. Decision trees, along with rulebased classifiers, represent a group of classifiers that perform classification by a sequence of simple, easy-to-understand tests whose semantics are intuitively clear to domain experts [10]. Although current state-of-the art classifiers (e.g. Support Vector Machines [11]) or ensembles of classifiers (e.g. Random Forest [12] or Rotation Forest [13]) significantly outperform classical decision tree classification models in terms of classification accuracy or other classification performance metrics, they are not suitable for knowledge discovery process.
When decision trees are used in knowledge discovery, one should usually include domain experts in the analysis process. Therefore, in most cases the final decision trees will be presented to domain experts for evaluation of extracted knowledge -i.e., rules that can be derived from a decision tree. In such cases the complexity of decision trees, which is usually measured as the number of nodes or the number of rules that can be extracted from a tree, is of high importance and can influence the evaluation of the discovered knowledge by domain experts [14]. Decision tree complexity has been studied in terms of reducing the complexity and maintaining or improving the accuracy at the same time. Bohanec and Bratko [15] studied the difference between pruning a decision tree for better approximation of the target concept and pruning the decision tree to make it practical for communication and understanding by the user. Their study focused on developing algorithms for obtaining the smallest pruned decision trees that represent concepts within some chosen accuracy. Oates and Jensen [16] studied the influence of database size on decision tree complexity. They demonstrated that the tree size strongly depends on the training set size. Therefore, many approaches that are based on removing training instances prior to tree construction [17,18,19] could result in smaller trees just because of the training set reduction.
Different visual representations of decision trees like the classical node-link diagrams [20,21], treemaps [22,23], concentric circles [24,25], and many others have been proposed in the past. A major consideration in evaluation of decision trees is also how efficiently they use screen space to communicate the tree information [26]. Through application of decision trees to different fields of research and their use in open source and commercial software for machine learning and data mining, it has been demonstrated that end-users still prefer node-link diagrams although their space covering is not optimal. Huysmans et al. [27] observe that currently, most research focuses on improving the accuracy or precision of predictive models and comparatively little research has been undertaken to increase their comprehensibility to the analyst or end-user. They empirically investigated suitability of decision tables, (binary) decision trees, propositional rules, and oblique rules in environments where interpretability of models is of high importance. The results showed that users prefer decision tables, followed by decision trees to other compared knowledge representations, but authors admitted that only inexperienced users were included in the study.
A multi-criteria approach to evaluation of decision trees that also includes size of the built decision trees was proposed by Osei- Bryson [28]. It aims to make the data mining process simpler for data mining project teams, especially when they have to evaluate significant number of decision trees. The proposed project uses three measures to evaluate appropriateness of the decision trees: stability, simplicity and discriminatory power. Simplicity, or equivalently complexity is further divided in the number of rules that can be extracted from the tree and the average length of the extracted rules.
Due to their popularity and a need to build simple decision trees with as little effort as possible, this paper proposes a novel method called Visual Tuning of Decision Trees (VTDT). This method helps data analysts in building effective decision tree representations with spending less time on setting and tuning the parameters of decision tree induction algorithm when compared to classical methods. From the analyst's perspective it is very important that the produced representation of the decision tree allows effective communication with end-users (i.e. customers) or domain experts in cases of decision tree applications in research. In addition, from our own and experience of our colleagues, we know that, although we live in a digital age, we still meet a lot of experts in different domains who prefer to have the final decision tree printed out on a sheet of paper. The result of the VTDT method is a decision tree that can be printed out on a single page or displayed on a computer screen without the need for scrolling or zooming. It is also important to take care of the decision trees that would be too pruned when using the default parameters of decision tree induction method. One could also call this type of decision tree induction ''one-button decision trees'' as there is no need to tune the parameters and build multiple decision trees anymore.

Methods
The proposed method in this paper presents an automated tuning process for the widely used C4.5 decision tree, which was developed by Quinlan [3]. More precisely, it focuses on C4.5's implementation in the Weka machine learning framework [29], where it is referred to as J48.

Tuning the Parameters
There are multiple settings that can influence the size of the generated decision tree. Two types of pruning are available -i.e., subtree replacement and subtree raising. Subtree raising uses a technique where a node may be moved upwards towards the root of the tree, replacing other nodes along the way during a process of pruning. In general, subtree raising is computationally more complex than subtree replacement where the nodes in a decision tree can be replaced by leafs. Another setting influencing the pruning process is confidence factor that represents a threshold of allowed inherent error in data while pruning the decision tree. By lowering the threshold one is applying more pruning and consequently generates more general models. To obtain simpler models where leafs contain higher number of samples, it is possible to set the minimal number of objects in a single leaf. This setting can also be used in tuning to achieve simpler and smaller decision  trees. The final setting that can be used to tune the visual outlook of the tree is called binary splits selection. This setting forces the splitting of nodes to only two branches instead of multiple splits. The default J48 decision tree in Weka uses pruning based on subtree raising, confidence factor of 0.25, minimal number of objects is set to 2, and nodes can have multiple splits. To allow automated tuning in Weka, a package called Visually Tuned J48 (VTJ48, available at http://ri.fzv.uni-mb.si/vtj48/) was developed during this study. All parameters, mentioned in the previous paragraph, are automatically tuned in VTJ48 to allow the so called ''one-button data mining''. However, it is possible to change the default values for dimensions of the resulting window that represent boundaries of the VTJ48 decision tree. Default values for maximal dimensions of the decision tree are set to 12806800 pixels corresponding to the Widescreen eXtended Graphics Array (WXGA) video standard. The aspect ratio of this resolution is 16:10 (1.60) and comes very close to aspect ratio of A4 paper dimensions (approx 1.41). The chosen dimensions can also be displayed on most computer monitors in use today.
Although it would be possible to use the original Weka source code to display decision trees, some adaptations to original decision tree visualization methods had to be done to allow better covering of space for nodes and leaves. In comparison to classical Weka decision tree visualization, we changed the shape of internal nodes to allow more space on both sides of nodes. Additionally, we reduced the height of the trees with reduction of the vertical distance between nodes by 50%.
Tuning of parameters in VTJ48 is done using adapted binary search where confidence factor of pruning is optimized until highest acceptable value of this parameter is found. Boundaries for confidence factor optimization are set at 0 and 0.5 (starting value in VTJ48 and the maximal allowed setting in J48). In cases where initial confidence factor tuning cannot build an acceptable decision tree, binary splits are turned on. This step usually significantly reduces horizontal dimensions of the tuned decision tree. Tuning of confidence factor is done once again. In rare cases, where binary splits are not enough, VTJ48 tries to increase minimal number of objects in leaves. This parameter (m) is increased from 2 until m,n in steps of m 2 , where n is number of all samples in the training set. More extensive search could have been chosen, but in such case one should expect a significant increase in the time complexity of the tuning process. The pseudocode in Figure 1 describes the reduction of the tree size process as implemented in VTJ48.
In rare cases the default settings of VTJ48 algorithm will produce an extremely small tree consisting of just one or even without splitting nodes. Therefore, in cases of decision trees with only one or two leaves, an approach using unpruned decision tree is used. With confidence factor set to 0.5 such tree will usually grow over the predefined boundaries. This time a linear hillclimbing approach is used to increase the minimal number of objects in leaves, because there is no need to tune confidence factor in an unpruned decision tree.

Experimental Settings
By reducing the size and complexity of decision trees to fit the predefined screen resolution or paper size one is expecting significantly lower classification accuracy, especially in initially  Protein solubility is an important protein property since low protein solubility can lead to several diseases [31] or affect isolation of proteins from complex mixtures [32]. Several attempts to classify and predict protein solubility have been made [33][34][35][36]. To assess our method, we used the eSol database (available at http:// tp-esol.genes.nig.ac.jp/) which includes information about protein solubility of the entire ensemble of E.coli proteins. The database contains 1,625 proteins, out of which 782 are insoluble and 843 are soluble proteins. We calculated 21 feature datasets for each of these proteins as shown in Table 2. These numeric features have shown to be influential in protein solubility prediction in previous works, where: -the feature datasets 1-18 contain mono-, di-and tri-mers using 7 different alphabets, -the feature dataset 19 contains 4 sequence-computed features, i.e., molecular weight, sequence length, isolectric point and GRAVY index, -the feature dataset 20 contains features used in [33], and -the feature dataset 21 combines all features from the previous datasets.
GEMLeR datasets used in this study were created by selecting one out of 9 groups of samples in so called one-versus-all binary classification setting. Unsupervised highest variance filter was chosen to avoid the so called ''selection bias'' when reducing the number of attributes by eliminating the measurements with extremely low variance. Samples consisting of original 54,681 expression measurements from Human Genome U133 Plus 2.0 Array GeneChip were reduced to 10,935 (20%) gene expression measurements that represent attributes of 9 datasets.
2.2.4. Performance Evaluation. Different measures were observed for J48 decision tree using default settings and VTJ48 decision tree on all datasets. Basic size related measures like width and height of decision tree in pixels, number of leaves and number of nodes were calculated for each decision tree on each dataset. Additionally, Classification accuracy (ACC) and area under ROC curve (AUC) were calculated using 20 runs of 10-fold cross-validation on all datasets to observe differences in classification performance.

Results
To evaluate the proposed method we compared the classification performance and size of the classical C4.5 trees (J48) with the visually tuned C4.5 trees (VTJ48). Initially, the tests were performed on 40 datasets from the well-known machine learning repository. In addition, the tests were done on two types of datasets where decision trees can be applied in the field of bioinformaticsi.e., 21 protein solubility datasets and 9 gene expression analysis datasets. As expected, in most cases, the original J48 decision tree vastly exceeded the predefined display resolution of 12806800 pixels (Table 3 and Table 4). In some extreme cases the width of the decision tree exceeded the predefined dimension by more than 10fold (letter, audiology, soybean). However, decision trees of this size and high number of classes are inappropriate for extraction of rules and presentation to end-user. Altogether, in the UCI datasets evaluation, there are 30 datasets where VTJ48 optimized a decision tree by reducing the number of leaves to fit into predefined dimensions. In 8 cases VTJ48 produced decision trees with more leaves than the original J48 method. Increase of the tree size occurred in cases when there were only one or two leaves produced using default settings of J48, pruning was automatically turned off in VTJ48 resulting in more complex decision trees. In case of protein solubility datasets, there were 20 datasets where the complexity of the tree was reduced and only one case where it increased. Similar changes in tree complexity were observed in gene expression problems, where complexity increased only in 2 out of 10 datasets. Observing the complexity of built decision trees one should also note that VTJ48 starts the tuning process of confidence factor at 0.5, whereas J48 starts at 0.25 resulting in more complex VTJ48 decision trees that still fit into predefined visual boundaries.

Classification Performance on UCI Data
Accuracy and AUC (Table 5) were used for evaluation of classification performance, although due to the high number of multiclass datasets, it is debatable whether accuracy is the right measure for classification performance. As suggested in [38], the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. Comparing accuracy using win/draw/lose record one can observe J48 wins on 21 datasets, while VTJ48 managed to outperform J48 on 16 datasets. Statistical significance testing shows that J48 significantly outperforms VTJ48 in accuracy (p = 0.022), while there is no significant difference in results of AUC (p = 0.766). As already mentioned, one should be cautious when interpreting the results above, since accuracy is not a well suited performance measure in cases of unbalanced multi-class datasets. Therefore we did another test where only 16 binary class datasets were used and found out that there are no statistically significant differences present (p = 0.320). Table 3 demonstrates a big difference in decision tree size (number of leaves) comparing J48 to VTJ48 decision trees. Table 4 shows the average decision tree dimensions for the protein datasets including the average number of leaves. It can be noticed that the size was reduced on the majority of the feature datasets. The only exceptions are the DimersClustEm14 and TrimersHydro datasets, on which the tree size increased. Table 6 shows accuracy and AUC for the evaluation of classification performance on protein datasets. Since all these datasets present a binary classification problem, accuracy and ACC are more appropriate measurements when compared to the UCI datasets. Again, the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. When observing the accuracy win/draw/lose record, one can notice that J48 wins on 5 datasets, while VTJ48 managed to outperform J48 on 15 datasets. The results are similar for AUC where J48 wins on 5 datasets, while VTJ48 wins on 16 datasets. Table 7 shows the average decision tree dimensions for the 9 GEMLeR datasets including the average number of leaves. In  comparison to protein and UCI datasets it is evident that gene expression problems do not create very large tree, therefore the reduction in size, when VTJ48 is used is not that big. Table 8 shows accuracy and AUC for the evaluation of classification performance for the GEMLeR datasets and also demonstrates that the performance actually increases if we use simpler (i.e., smaller) decision tree models.

Classification Performance on Bioinformatics Data
Statistical significance testing was done on all 20 cross-validation run results for 30 bioinformatics datasets together using Wilcoxon signed-rank test. In case of accuracy (p = 0.002) and AUC (p = 0.001) the VTJ48 trees significantly outperformed J48 trees. Although we did not expect such significant differences between results in favor of VTJ48, it is obvious that VTJ48 is well suited for datasets with binary class attributes and a high number of possibly redundant attributes.

Examples of Large Decision Trees
In this section we demonstrate two examples, each of them with two decision trees built on a single dataset. The first tree in each example is the result of J48 using default settings, and the second tree is the result of VTJ48.The dataset in the first example is the letter dataset from the UCI repository. This dataset contains 26 class values which represent 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20.000 instances with 16 attributes. Fig. 2 shows the original tree and the visually tuned tree. One can notice the extremely complex original decision tree, which is the result of the high number of classes. Since the visually tuned tree does not cover all the possible classes, it cannot achieve competitive classification accuracy.
The second example presents two decision trees built on the protein solubility dataset with all features (Figure 3). In this case the accuracy and AUC were both improved significantly, when the size of the decision tree model was reduced. This is possible due to binary class attribute that still allows effective trees that are much smaller than the original pruned J48 trees.
In addition, to demonstrate the most significant rules from both decision trees in Figure 3, we extracted top 5 rules according to their support in training set ( Table 9). The numbers at the end of mono-, di-and tri-mer attribute names (e.g. the number 34 in MonomersClustEm17_34) distinguish attributes inside different alphabets. It can be observed that J48 produces more complex rules with a higher number of conditions which use attributes from more different alphabets. On the other hand, top 5 rules from VTJ48 tree cover much more samples (70.6%) than top 5 rules derived from J48 (36.6%). It is evident that low error rates on the training set do not guarantee good classification performance on the test set. We can once again conclude that in most cases, at least in protein solubility domain, more complex trees result in overfitting to training samples.

User Study
To test the effectiveness of the VTJ48 method in terms of usability, a Weka package was developed implementing the visually constrained tree building algorithm. An experiment to compare the duration of building decision trees using the J48 and VTJ48 Weka packages in Weka Explorer was set up. Three different datasets from the UCI repository (balance-scale, credit-g, and splice) were chosen based on their complexity where the need for tuning the tree models is more likely to be necessary. Fourteen master students, all enrolled in a Bioinformatics program, were recruited to take part in the experiment. After a brief introduction to the VTJ48 Weka package, the participants were given the datasets and were asked to build a comprehensible decision tree from each dataset using both, J48 and VTJ48 methods. Additionally, the participants were instructed to optimize each decision tree to fit to a single computer screen to allow optimal comprehensibility. In the case of J48 classifiers, this meant tuning the binary splits, minimal number of objects, and pruning parameters. In the case of VTJ48, this simply meant setting the desired resolution parameters. The duration from the start of the tree building process to the point when the decision tree was displayed on a single screen, was stored for further analysis. Figure 4 clearly shows that tree building times were shorter for VTJ48 method on all datasets.
In order to test the statistical significance of the obtained results, the Wilcoxon signed-rank test was chosen to compare the distributions of tree building times and accuracies for different tree building methods. Each of the tests was assessed at a significance level of 95%. The medians of tree building times for the J48 and VTJ48 methods were significantly different for two datasets (balance-scale: p = 0.002, credit-g: p = 0.020). Tree building times were not significantly different for the splice dataset (p = 0.396), however, the mean tree building time was still 33.14 seconds shorter for the VTJ48 method.

Discussion
This study focused on evaluation of decision tree performance when useful and comprehensible decision trees are needed. The  Table 9. Top 5 rules with the highest support in All Features extracted from J48 and VTJ48 decision trees.  evaluation was done on 30 datasets from the following two areas in bioinformatics: protein solubility classification and gene expression analysis. More precisely, strict boundaries for width and height of built decision trees were set to produce more comprehensible trees. It is important to note that VTDT approach only helps the end-user in tuning the decision tree building parameters and does not propose a novel decision tree building algorithm. Although this paper presents the automated visual tuning on C4.5 decision trees, it would be possible to adapt the VTDT principles to any other decision tree building algorithm that requires tuning of parameters to achieve optimal results. By tuning the parameters, without interfering with the internal decision tree building process and constraining the tuning only by the dimensions of the decision tree, the bias of influencing the classification performance is avoided.
The results of our study confirmed there is no statistically significant difference in predictive performance between the decision trees built using default values and the ones that were built using the proposed process of visual tuning. Moreover, when AUC is observed, visually tuned models, that are usually also much simpler than large default models, performed better on majority of datasets. This is especially true for most of the protein and gene expression datasets, where the performance improvements were significant. However, it has to be noted that a larger sample of datasets would be needed to draw more reliable conclusions. Based on these results, one could conclude that simpler models usually produce at least comparable results if not better. This has also been shown in many other studies related to the Occam's razor theory [39,40]. However, there are also studies that demonstrate the contrary -i.e., growing the trees will improve the classification performance [41]. To sum it up, it all depends on how a simple model is defined. In the case of VTDTs, we should probably state that if the model is simple enough (i.e., fits into our predefined visual boundaries), it will produce good or even better results than most of the more complex models. Unfortunately, the proposed decision tree tuning suffers from the high time complexity in comparison to classical decision tree that is built only once. However, as shown by the the user study, it still saves a lot of time in comparison to manual tuning and fitting of the decision tree to desired dimensions.
In this paper we evaluated the visual tuning strategy only on C4.5 decision trees. From the research and also from the practical usability point of view, it would be important to extend this study and consequently develop a Weka package that would allow simultaneous tuning of different decision tree models (e.g. CART [4]). Some of the areas where visual tuning could also be applied are comprehensible ensembles of classifiers or variations of decision tree models (e.g, Alternating decision trees [42]). These models combine boosted decision stumps in a structure where visual constraints could be beneficial for the end-user in different areas of bioinformatics.