rpartOrdinal : An R Package for Deriving a Classiﬁcation Tree for Predicting an Ordinal Response

This paper describes an R package, rpartOrdinal , that implements alternative splitting functions for ﬁtting a classiﬁcation tree when interest lies in predicting an ordinal response. This includes the generalized Gini impurity function, which was introduced as a method for predicting an ordinal response by including costs of misclassiﬁcation into the impurity function, as well as an alternative ordinal impurity function due to ? that does not require the assignment of misclassiﬁcation costs. The ordered twoing splitting method, which is not deﬁned as a decrease in node impurity, is also included in the package. Since, in the ordinal response setting, misclassifying observations to adjacent categories is a less egregious error than misclassifying observations to distant categories, this package also includes a function for estimating an ordinal measure of association, the gamma statistic.


Introduction
For many high-throughput genomic studies, the phenotype to be predicted is ordinal.Some examples of ordinal responses include the more recently advocated method for evaluating response to treatment in target tumor lesions, known as the response evaluation criteria in solid tumors (RECIST) method, with ordinal outcomes defined as complete response > partial response > stable disease > progressive disease.Moreover, most histopathological measures are ordinal, such as scoring methods for liver biopsy specimens from patients with chronic hepatitis, including the Knodell hepatic activity index, the Ishak score, and the METAVIR score.Statistical methods such as adjacent category, proportional odds, and continuation ratio models (?) are traditionally used when modeling an ordinal response, though they fail for high-throughput genomic datasets when the number of covariates, , exceeds the number of observations, .
An alternative class prediction method, classification trees (CTs), is capable of predicting a response when the << (?).Suppose independent observations to be classified are characterized by a -dimensional vector of predictors = ( 1 , 2 , . . ., ) and each observation falls into one of classes.Let denote the class with = 1 representing observations in class 1, = 2 representing class 2,. .., and = representing class .When deriving a CT, all observations start together in the root node, .Then, for predictors 1, 2, . . ., , the optimal split is determined, where optimality is defined as that split resulting in the largest decrease in node impurity.
For node , the optimal split divides the observations to the left and right descendent nodes, and , respectively, and the proportion of cases in each of the classes within these nodes are called the node proportions, that is, ( | ) for = 1, . . ., such that ( 1 | ) + ( 2 | ) + . . .+ ( | ) = 1.For nominal response classification, the within-node impurity measure most commonly used is the Gini criterion (?), defined as (1) This is the default impurity function in the R programming environment rpart package (??) for predicting a nominal class response.However, use of this impurity function does not take advantage of the additional information present when the response is ordinal.To that end, the generalized Gini impurity function (?), which factors in ( | ), the cost of misclassifying a class observation as belonging to class , has been proposed for ordinal response prediction.
Another proposed ordinal impurity function for deriving an ordinal response classification tree based on a measure of nominal-ordinal association (?) that does not require the assignment of costs of misclassification is where ( | ) = ∑ =1 ( | ) (?).A splitting method for ordinal response prediction that is not impurity-based is the ordered twoing method (?).Though this method was described in the seminal book by ? and has been implemented in the CART Software by Salford Systems (??), it has not yet been implemented in R. Ordered twoing proceeds by reformulating the ordinal response as a vector of dichotomous responses, where for each observation , the ℎ dichotomous response is taken to be For node and dichotomous response , the split that maximizes over the covariates is taken to be the best split for that dichotomous response .Subsequently, the split associated with the dichotomous response * = arg max ( , , ) is then selected for splitting node .In this paper, we describe the rpartOrdinal R package, which implements ordered twoing, the generalized Gini, and the ordinal impurity splitting 1 bwt > 3500 2 3000 < bwt ≤ 3500 3 2500 < bwt ≤ 3000 4 bwt ≤ 2500 Table 1: Ordinal response levels for low birthweight (Category).
methods.These splitting methods should be considered for use when deriving an ordinal response classification tree.
For nominal response prediction, misclassification rates are often examined as a means for assessing the performance of the classifier.For ordinal response prediction problems, it may be of more interest to estimate the gamma statistic as an ordinal measure of association between the observed and predicted responses as a means for gauging the success of ordinal classification.Briefly, the association between two ordinal variables and can be estimated by the gamma statistic (?), where given the cross-tabulation matrix of and having rows and columns, the number of concordant pairs for cells (1, 1) to ( − 1, − 1) is given by Similarly, the number of discordant pairs for cells (1, 2) to ( − 1, ) is given by We then let = ∑ −1 such that the gamma statistic of ordinal association is defined as An R package, rpartOrdinal, that implements the described ordinal splitting methods as well as a function for estimating the gamma statistic as an ordinal measure of association is available for download from the Comprehensive R Archive Network at http://cran.r-project.org/package=rpartOrdinal.

Illustrative datasets 2.1. Low birthweight dataset
The lowbwt dataset was downloaded from ftp://ftp.wiley.com/public/sci_tech_med/logistic/ and includes birthweight and associated risk factors measured on 189 women as described by (?).For illustrative purposes, an ordinal response variable (Category) will be derived from birthweight as defined in Table ?? and added to the lowbwt dataset.
In addition to this ordinal response, the dataset includes variables listed in

Gene expression in B-cell acute lymphocytic leukemia
As an example using a high-througput genomic dataset, the acute lymphoblastic leukemia ALL dataset was downloaded from the Bioconductor experiment repository http://www.bioconductor.org/packages/release/data/experiment/html/ALL.html, which includes gene expression microarray data for 128 patients, 95 with B-cell and 33 with T-cell leukemia (??).Patients with B-cell acute lymphocytic leukemia (B-ALL) are staged according to whether or not the leukemic cells express different antigens (e.g., CD19, HLA-DR, CD10) or immunoglobins (e.g., surface immunoglobin or cytoplasmic immunoglobins).The stages are ordered, such that B1 represents early pre-B ALL (do not express CD10); B2 represents disease more advanced than B1 as CD10 is expressed, but not yet meeting criteria stages B3-B4; B3 represents common ALL, where the CD10 antigen is expressed but still lacking IgM; and B4 represents pre-B ALL, where CD10 and IgM are expressed.B-ALL stage is clinically important as it is one of several factors used to plan treatment.Among the 95 B-ALL patients in the publicly available dataset, 90 were staged (19 B1, 36 B2, 23 B3, and 12 B4 patients).Note that this dataset requires more memory than is typically available on a standard Windows PC.The B-ALL dataset was therefore analyzed using a MacBook Pro laptop (Mac OS X 10.5.7) having 8GB RAM.

Implementation
The rpartOrdinal package was written in the R programming environment (?) and depends on the rpart package (?).Currently, rpart includes methods for deriving regression, classification, and survival trees.Due to the method= option in rpart, users can define their own splitting methods for use in conjunction with the rpart function.A user defined method passed to the method= option must be a list consisting of three functions named eval, split, and init.
Since previous research comparing the ordinal splitting methods to traditional methods for single trees and for bootstrap aggregating classification trees has demonstrated that when the response to be predicted is ordinal, an ordinal splitting method is usually preferred (??), we have implemented three ordinal splitting methods, namely, ordered twoing, ordinal impurity, and generalized Gini for use in conjunction with rpart.

Ordered twoing
The ordered twoing splitting criteria in Equation ?? has been implemented as a callable method in rpart.Here we derive an ordinal classification tree for predicting the ordinal response in the low birthweight dataset, Category, using ordered twoing by additionally specifying method=twoing.Such a CT may be useful for exploring factors related to poor neonatal outcomes.
R> plot(otwoing.rpart)R> text(otwoing.rpart,pretty = TRUE) The additional pretty = TRUE argument to the text function does not abbreviate factors as alphanumeric characters if they appear in the tree.Alternatively, the post() function can be used for producing postscript files containing the tree topology which more extensively labels the splits and predicted class for each node.

Ordinal impurity function
As with the twoing function, the ordinal impurity function in Equation ?? has been implemented as a callable method in rpart.That is, within the rpart function, the user should specify method=ordinal to fit an ordinal response classification tree using Equation ??.
R> ordinal.rpart<-rpart(Category ˜age + lwt + race + smoke + ptl + ht + ui + ftv, data = lowbwt, method = ordinal) R> plot(ordinal.rpart)R> text(ordinal.rpart,pretty = TRUE) The ordered twoing and ordinal tree topologies are very similar with exception that node 38 is split by lwt in the ordinal tree whereas this same node is split by age in the ordered twoing tree, with the descendent node 77 splitting variable also differing.

Generalized Gini impurity
The generalized Gini impurity function in Equation ?? has been implemented in this package by allowing the user to specify a loss.matrixparameter in the optional parms argument within the rpart call.The loss.matrix parameter accepts either "linear" or "quadratic" for using either linear or quadratic loss, respectively.The specific syntax for the low birthweight example using the linear loss follows.

Gamma statistic
The ordinal.gamma function estimates the gamma statistic, which is a measure of the strength of the association of the cross-tabulation of two ordinal variables.The following example replicates Table 2 R> ordinal.gamma(job.satis,income) [1] 0.2211009 Returning to the ordinal classification trees for the low birthweight dataset, it may be of interest to estimate the gamma statistic as an ordinal measure of association between the observed and predicted ordinal responses.However, estimating the gamma statistic as an ordinal measure of association for the training data will not provide useful information regarding how well the predictor may generalize when presented with new data.Therefore, cross-validation methods may be used.The following code was used to perform five-fold cross-validation where the observations included in the ℎ fold are stored in the ℎ component of groups.
Ordinal impurity 0.446 Ordered twoing 0.436 Generalized Gini with linear cost of misclassification 0.345 Generalized Gini with quadratic cost of misclassification 0.402 Table 3: Five-fold cross-validation estimate of the gamma ordinal association measure between the observed and predicted ordinal response for the low birthweight dataset.

Gene expression in B-ALL
Here we demonstrate application of the ordinal classification methods for predicting B-ALL stage.
R> library("rpartOrdinal") R> library("ALL") R> data("ALL") The class object ALL is an ExpressionSet, developed by the Bioconductor project (?) as a container for high-throughput genomic datasets.This object includes both a × matrix gene expression data, where represents the number of probesets (i.e., genes) interrogated by the high-throughput assay and represents the number of samples processed, and an × data frame of phenotypic data, where again represents the number of samples and represents the number of phenotypic variables.The gene expression matrix can be extracted from the ALL object using the exprs() extractor function while the phenotypic data can be accessed using the pData() extractor function.As described in section 2.2, the ALL object includes gene expression and phenotypic data for 128 patients, 95 with B-cell and 33 with T-cell leukemia.In this example, we will restrict attention to only those patients with B-cell leukemia who were also staged as either B1, B2, B3, or B4.The pData(ALL) object includes a vector BT which stores the type (B or T) and stage (1, 2, 3, or 4) of disease and can be used for subsetting the ALL object.The following line of code was used to restrict the dataset to patients staged as B1, B2, B3, or B4.Further information on the phenotypic variables stored in the ALL object can be obtained by issuing ?ALL.
R> stage<-factor(pData(BALL)$BT,levels=c("B1","B2","B3","B4"),ordered=TRUE) Ordinal CTs predicting disease stage may be useful for exploring genetic mechanisms that lead to B-ALL progression.Prior to fitting a CT, a data frame consisting of the ordinal outcome stage and the transposed × gene expression matrix must be constructed.
R> Bcell<-data.frame(t(exprs(BALL)),stage) Once the data frame has been constructed, ordinal classification trees may be fit using syntax similar to the lowbwt example.The following syntax was used to fit CTs using ordered twoing, ordinal, generalized Gini with linear loss, and generalized Gini with quadratic loss, respectively.Interestingly, all four methods split the root node using the same variable and cutpoint.For ordinal classification, it may be of interest to use five-fold cross-validation to estimate the gamma statistic as an ordinal measure of association between the observed and predicted ordinal responses.The following code was used for performing five-fold cross-validation.R> V <-5 R> n <-length(Bcell$stage) R> leave.out<-trunc(n/V) R> o <-sample(1:n) R> groups <-vector("list", V) R> for (j in 1:(V -1)) { jj <-(1 + (j -1) * leave.out)The gamma ordinal association measure for each of the four splitting methods for the B-ALL gene expression dataset are listed in Table ??.

Summary
Herein we have described the rpartOrdinal package which works in conjuction with the rpart package in the R programming environment.The package provides methods for fitting a CT when the response is ordinal.We note that another R package, party (?), can also be used to derive an ordinal conditional inference tree, where the variable selected for splitting a given node is determined using an inferential test (?).These methods may prove useful when the dataset to be analyzed includes an ordinal response and the number of covariates exceeds the sample size.In such situations, traditional ordinal response methods such as proportional odds models cannot be fit.Secondary data analyses are a natural and desirable by-product from publicly available databases such as Gene Expression Omnibus.In the high-throughput genomic setting, most attention has been focused on classification algorithms for dichotomous responses.We believe analysts will find the rpartOrdinal useful particularly when modeling ordinal responses for high-dimensional datasets.

Figure 4 :
Figure 4: CT for lowbwt using generalized Gini with quadratic cost of misclassification.

Table 2 :
Table ??.Description of covariates included in the low birthweight dataset.