Application of biclustering algorithm to extract rules from labeled data

Purpose – For many pattern recognition problems, the relation between the sample vectors and the class labels are known during the data acquisition procedure. However, how to ﬁ nd the useful rules or knowledge hidden in the data is very important and challengeable. Rule extraction methods are very useful in mining the important and heuristic knowledge hidden in the original high-dimensional data. It can help us to construct predictive models with few attributes of the data so as to provide valuable model interpretability and less training times. Design/methodology/approach – In this paper, a novel rule extraction method with the application of biclusteringalgorithm is proposed. Findings – To choose the most signi ﬁ cant biclusters from the huge number of detected biclusters, a specially modi ﬁ ed information entropy calculation method is also provided. It will be shown that all of the importantknowledge is in practice hidden in thesebiclusters. Originality/value – The novelty of the new method lies in the detected biclusters can be conveniently translated into if-then rules. It provides an intuitively explainable and comprehensive approach to extract rulesfrom high-dimensional data whilekeeping high classi ﬁ cationaccuracy.


Introduction
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product.These products are often proteins, but in non-protein coding genes such as ribosomal RNA, transfer RNA or small nuclear RNA genes, the product is a functional RNA.Gene expression data is a kind of data matrix used to represent the expression level of different genes under specific conditions simultaneously.Each element is a real number which is often the logarithm of the relative abundance of the mRNA of the gene (Madeira and Oliveira, 2004).Usually, in the data matrix, genes are arranged in the row direction, while the column direction represents different time or different environmental conditions.
As an array contains tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel.The acquired gene expression data are a typical kind of high-dimensional data.The huge number of features or attributes adds great difficulties to the prediction and interpretability capabilities of all of the models applied to analyze it because of redundant features and noises.How to find the useful rules or knowledge hidden in the data is very important and challengeable.Feature selection is usually a necessary step to facilitate further processing which is especially true for high dimensional data (Gorzałczany and Rudzi nski, 2017).In machine learning and statistics, attribute selection is the process of selecting a subset of the fewest number of informative attributes for classification, rule extraction and the other applications (Shrivastava and Barua, 2015).
Rule-based expert systems are often applied to classification problems in various application fields, like fault detection, biology and medicine (Dahal et al., 2015;Shrivastava and Barua, 2015).In Roubos et al.'s (2003) study, the authors show a compact, accurate and interpretable fuzzy rule-based classifiers obtained from labeled observation data.To implement it, an iterative approach for developing fuzzy classifiers was proposed.The initial model was derived from the data and subsequently, feature selection and rule-base simplification were applied to reduce the model, while a genetic algorithm was used for parameter optimization.Moreover, the researchers proposed different optimization-based methods such as ant colony optimization and particle swarm optimization to extract rules (Chen et al., 2015;Indira and Kanmani, 2015).
Support vector machines (SVMs) are learning systems based on the statistical learning theory and exhibit good generalization ability on different kinds of real data sets (Han et al., 2015).Companying with the study on SVMs, it has gradually turned into a leading machine learning technique and has been applied in a wide range of areas such as bioinformatics, pattern recognition, text classification and so on (Shi et al., 2015).Researchers interested in this topic can easily access to a lot of free software or toolbox.However, the results given by SVMs are usually difficult to explain.In safetycritical or medical applications, an explanation capability is an absolute requirement.A rule extraction method based on SVMs was proposed in Núñez et al.'s (2002) study.The authors introduced a SVM plus prototypes procedure for rule extraction.This method allows giving explanation ability to SVMs.Once determined the decision function by means of a SVM, a clustering algorithm was used to determine prototype vectors for each class.These points were combined with the support vectors using geometric methods to define ellipsoids in the input space with minimum overlapping between classes, which were later transferred to if-then rules.
One important analysis task of microarray data concerns the simultaneous identification of groups of genes that show similar expression patterns across specific groups of experimental conditions (Wang et al., 2014;Maulik et al., 2015).Most of time, it is not the sample vectors as integrity shows the strong coherence with each other, but the elements at some specific positions among different sample vectors show the local similarity (Valarmathi et al., 2015).Besides classical clustering methods such as hierarchical clustering, in recent years, biclustering has become a popular approach to analyze biological data sets and a wide variety of algorithms, and analysis methods have been published (Czibula et al., 2015;Shinde and Kulkarni, 2016;Indira and Kanmani, 2015).
Such applications can be addressed by a biclustering process whose aim is to discover biclusters (Cheng and Church, 2000).The so called bicluster is a subset of genes and conditions of the original expression matrix where the selected genes present a coherent Algorithm to extract rules behavior under all the experimental conditions contained in the bicluster.In other words, the data in the same bicluster show a high degree of local similarity.The difference between a bicluster and a submatrix is that all the biclusters are definitely submatrices, but only those submatrices whose row or column vectors satisfying some kind of linear relations will be treated as biclusters.Biclustering algorithms are just a kind of data processing algorithms to find those submatrices lying in the original data matrix showing the local similarity.This technology has found numerous applications in research and applied areas like biology, drug discovery, toxicological study and diseases diagnosis (Alon et al., 1999;Alizadeh et al., 2000;Golub et al., 1999;Pomeroy et al., 2002).
However, the number of biclusters lying in the data, the size and the spatial positioning relations among these biclusters is completely unknown and strongly data dependent (Rabia et al., 2016).In Kaiser and Leisch's (2008) study, the authors introduced the R package which contains a collection of biclustering algorithms, preprocessing methods for two way data and validation and visualization techniques for bicluster results.In Amela et al.'s (2006) study, the authors provided the Biclustering Analysis Toolbox, BicAT, as a software platform for clustering-based data analysis that integrates various biclustering and clustering techniques in terms of a common graphical user interface.Furthermore, BicAT provides different facilities for data preparation, inspection and post processing such as discretization, filtering of biclusters according to specific criteria or gene pair analysis for constructing gene interconnection graphs.The toolbox is described in the context of gene expression analysis but is also applicable to other types of data.The authors compared different biclustering techniques with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise (Shi et al., 2015;Maulik et al., 2015).
When the biclusters have been detected by applying the biclustering algorithm, the problem is how to translate the biclusters into the corresponding rules.In fact, it can be easily implemented combining with the data discretization schemes.As each bicluster is a submatrix, the line and column numbers that it covers are known.As each experiment condition can be treated as an attribute and all the column numbers of the bicluster can be used as a prerequisite for a rule, a bicluster detection procedure can also be thought as an attribute's selection processing.This kind of rule extraction provides a comprehensive interpretable way compared with the other methods while keeping high classification accuracy.

The proposed method
For many pattern recognition problems, the relation between the samples and the classes are known during the acquisition procedure of the data.And this kind of data is called labeled data.Suppose a bicluster B is composed of the row numbers set {i 1 , i 2 ,. ..,i m } and column numbers set {j 1 ,j 2 ,. ..,j n } of a labeled data matrix D, then the function U(B) = {i 1 , i 2 ,. .., i m } is defined to determine the set of those row numbers that the elements of B lies in D and |U(B)| = m, so is the definition of the function W(B) = {j 1 , j 2 , . .., j n }.The position of a bicluster B lying in the original data matrix D can be determined by U(B) together with W(B).
Usually, the data in the same class show some kind of behavior similarity is called a rule.A rule is applicable only for the sample vectors in the same class.The similarity of sample vectors spanning over the class boundary cannot be thought of the knowledge to distinguish data among different classes.That means biclustering processing results with labeled data are meaningful depending on the class labels.Directly biclustering with D without considering the labels of the sample vectors does not help to find the biclusters which will be translated into the rules eventually.

Data discretization
The flowchart of the new rule extraction method is illustrated in Figure 1, which is mainly composed of four sequential processing procedures: data discretization, biclustering processing, bicluster significance evaluation and rule translation based on the discretization schemes.Data discretization is a technique to partition continuous attributes into a finite set of adjacent intervals to generate attributes with a small number of distinct values (Kurgan and Cios, 2004).Discretization algorithms have played an important role in data mining and knowledge discovery (Tsai et al., 2008).They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily but also make learning more accurate and faster (Oliveira, 1999).
Assuming that a dataset consists M examples and S target classes, a discretization algorithm would discretize the continuous attribute a in this dataset into , where d 0 is the minimal value and d n is the maximal value of attribute a.Such a discrete result is called a discretization scheme on attribute a.This discretization scheme should keep the high interdependency between the discrete attribute and the target class to carefully avoid changing the distribution of the original data.
As having been introduced before, each column of D can be considered as an attribute or feature no matter what real physical meaning it has.If a sample vector V [ D has value V(a) with respect to the attribute a, then the discretized value V D (a) of V on a is determined by the discretization scheme.For an example, if it is known that V(a) [ (d i , d iþ1 ], i = 0, 1,. ..,n -1, then after the discretization processing, the value of V D (a) will be i þ 1.
The wine data contains the chemical analysis of 178 wines produced in the same region in Italy but derived from three different cultivars.The problem is to distinguish the three different types based on 13 continuous attributes derived from chemical analysis: alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoids phenols, proanthocyaninsm color intensity, hue, OD280/OD315 of diluted wines and proline (Roubos et al., 2003).In Figure 2, the original wine data with all of its 13 attributes are shown.The wine data are also a kind of high-dimensional data even though their dimensionality is less than the real gene expression data.It has been widely applied in the research studies on machine learning such as attribute selection, pattern recognition and rule extraction.The discretized wine data with all of their 13 attributes are shown in Figure 3.The data discretization schemes are listed in Table I where the method proposed in Tsai et al.'s (2008) study is applied.

Criterion for rule extraction
The symbol B j j v , j = 1, 2, . .., m is used to represent all of the biclusters detected within the sample vectors belonging to the same class v .Then jU B j j v À Á j and jW B j j v À Á j are two very important factors which can be used to evaluate the significance of the bicluster B j j v .As we expect the number of rules should be as small as possible, the requirement has the meaning in twofold.First, the smallest number of rules means the nature of the data has been well grasped by the rules.Second, the rules should have restriction on all of the vectors within the same class v .In other words, the rules should be adapted to all of the vectors in the same class.The above analysis tells us the first rule selection criterion which is: In equation ( 1), the subscript variable m means the number of biclusters within the same class v has been used to extract the corresponding rules.Undoubtedly, the value of m should be as small as possible.
Usually, there are a huge number of biclusters which can be detected.The significance of each detected bicluster must be evaluated by calculating its information entropy.Based on it, the significance of all these biclusters can be sorted in a decreased manner.Applying the first rule selection criteria, the minimum number m can be determined.It must be pointed out that since every bicluster is corresponded to a rule, then minimum number of rules to well express the knowledge hidden in the class v is also m.Here, we simultaneously draw an important conclusion which is the least number of rules providing 100 per cent of recognition accuracy is data-dependent.We also give the way to determine the exact value of it.However, when the number of rules is fixed, the new algorithm provides the most distinct and convenient way to find the rules while giving the assurance of the maximum accuracy.
Though the introduced above rule extraction method from bicluster is done within each of the class respectively.Whether the combination of rules from different classes can be applied to the whole data should be done with careful consideration.If each column of the data D is treated as an attribute, then it usually has different values.The information entropy corresponding to each attribute within the same class v can be calculated as corresponding to the attribute a in the class v .Suppose that all of the sample vectors in D then the attribute a is of the least importance, ass all of the sample vectors have the same value on it.

Bicluster significance evaluation
How to select the significant bicluster among the huge number of detected biclusters is very important.Given a data matrix D, each row vector of D can be considered as a sample vector, assume the number of sample vectors lying in D is N, all these samples belong to different classes named v i , i = 1, 2, . .., M. Define N i is the number of samples in class v i , then p i which means the probability of one sample belonging to the class v i can be estimated by N i /N.The expected information entropy provided by D is: If an attribute A has a number of k values which are {a 1 , a 2 ,. ..a k }, then the whole samples set can be classified into k different subset S 1 , S 2 ,. .., S k by only using attribute A. Assume N ij is the number of samples in the subset S j which belongs to the class v i , then the information entropy of classified result by attribute A is defined as: where jS j j which means the probability of samples lying in subset S j belonging to the class v i .The whole information gain acquired by attribute A is: Assume that there is a bicluster B when doing biclustering with E(D), it is a fact that the more row numbers B covers and the less attributes B has, the more meritorious B is.Based on it, we define ð Þj as a weight to indicate the importance of information provided by B. |U(v )| is the number of samples belonging to the class v where the bicluster B is founded.Equation (3) only instructs how to calculate the information entropy with one attribute.As each bicluster B satisfies |W(B)|!2, which means we have to take a number of |W(B)| attributes' information entropy into account.For any two different attributes, A 1 and A 2 , if I(A 1 ) < I(A 2 ), then the values taken by A 1 are more regular than the values taken by A 2 .When applying these two attributes to classify an unknown input sample, the classified result based on A 1 will be more accurate than that of A 2 .Considering all of Algorithm to extract rules the aforementioned analysis, we define the following formula as an index to evaluate the significance of the bicluster B:

Translation of biclusters to rules
Assume there is a bicluster Bj v lying in the class v where the number of sample vectors in , the bicluster Bj v can be conveniently translated into the corresponding rule accompanied by the data discretization schemes.As there are a number of jW Bj v ð Þj¼ n attributes in Bj v , the translated rule has n antecedents which are related with attributes a j 1 ; a j 2 ; Á Á Á ; a jn , respectively.Here, the attribute a j 1 is used for explanation.If the data discretization scheme on attribute a j 1 is , as we have known how the data discretization works, the value j 1 on which the attribute a j 1 is means the original attributea j 1 's value without doing data discretization belongs to the range d j 1 À1 ; d j 1 À Ã ; by this way, the first antecedent of the rule is if a j 1 2 d j 1 À1 ; d j 1 À Ã .Keep on this kind of processing till to the attribute a jn , the full description of the rule is determined.

Computation example
The well-known wine data are applied as the experiment data to illustrate the feasibility and effectiveness of the proposed new method.There are 59 samples in class v 1 , 71 samples in class v 2 and 48 samples in class v 3 .The number of all of the samples is 178.Each sample vector has 13 numerical attributes whose values are different from each other observably.Compared with the real gene expression data, the wine data have smaller dimensionality, while they are all numerical data with high dimensionality (Wang et al., 2014).The application of wine data will help to save a lot of time to verify the feasibility of the proposed algorithm without destroying the nature of the research data object.And it also facilitates the comparison among the different research results.
Here, the data discretization method proposed in Tsai et al.'s (2008) study is applied and the discretization schemes are listed in Table I.The discretized wine data are illustrated in Figure 4.As there are three classes, the discretized data within each class is isolated as a single picture.According to the data discretization scheme, each of the 13 attributes is discretized into less than three intervals which means the discretized data are only composed of three different numbers 1, 2 and 3.Each number is represented by a colorful square for intuitive illustration.The whole processing is followed by the procedures shown in Figure 1.
As these three biclusters totally cover 173 sample vectors out of the whole 178 sample victors, the three extracted knowledge offer a recognition accuracy of 97.19 per cent.

Conclusions
Rule extraction methods as an approach which tries to find the useful knowledge hidden in the high-dimensional data are very useful.The so-called rules are in practice, and some sample vectors in the data show coherent similarity with each other.Because the data in the same bicluster are closely related to each other, the transition from bicluster to rule has a natural consistency.As the elements of a bicluster are just lying in the original data and can be conveniently translated into a corresponding rule, the results of the new method have good explanation ability.The difference of the new method with the other methods lies in it applies the biclustering algorithm to discover the local similar biclusters existing among the original data matrix.
In the large amount of bicluster results detected, a specially modified information entropy calculation method is provided to evaluate the significance of all the detected biclusters.Then, all of the biclusters can be sorted in a decreased manner according to their information entropy values.By this way, those most significant biclusters within the samples belonging to the same class individually can be selected to extract the rules.We will Algorithm to extract rules try to deal with more different types of data and compare the results with the results of existing literature.Processing with real gene expression data is ongoing and will be presented in the future work.