dawai : An R Package for Discriminant Analysis With Additional Information

The incorporation of additional information to discriminant rules is receiving increasing attention as the rules including this information perform better than the usual rules. In this paper we introduce an R package called dawai, which provides the functions that allow to define the rules that take into account this additional information expressed in terms of restrictions on the means, to classify the samples and to evaluate the accuracy of the results. Moreover, in this paper we extend the results and definitions given in previous papers (Fernández et al. 2006, Conde et al. 2012, Conde et al. 2013) to the case of unequal covariances among the populations, and consequently define the corresponding restricted quadratic discriminant rules. We also define estimators of the accuracy of the rules for the general more than two populations case. The wide range of applications of these procedures is illustrated with two data sets from two different fields such as biology and pattern recognition.


Introduction
The incorporation of additional information, often available in applications, to multivariate statistical procedures through order restrictions is receiving increasing attention during the last years as it allows to improve the performance of the procedures.Good examples of this trend are the papers by Rueda et al. (2009), Fernández et al. (2012) and Barragán et al. (2013) where this information is used to improve statistical procedures for circular data applied to cell biology, ElBarmi et al. (2010) where the information is used for estimating cumulative incidence functions in survival analysis, Ghosh et al. (2008) where it is used to make inferences on tumor size distributions, or Davidov and Peddada (2013) where a test for multivariate stochastic order is applied to dose-response studies.
In this work, we deal with the incorporation of additional information to discriminant rules.Discriminant analysis is a well-known technique, first established by Fisher (1936), used in many science fields to define rules that allow to classify samples into a small number of populations based on a sample of observations whose population is known, usually called training set.To our best knowledge, the first paper considering additional information under the usual equal covariances assumption, which leads to linear discriminant rules, is Long and Gupta (1998).However, that paper provided limited results for the case of two populations with simple order restrictions and identity covariances matrices only.In a series of papers, the rules appearing in that initial paper have been improved, first to deal with more general types of information expressed in terms of cones of restrictions and general covariance matrices in Fernández et al. (2006) and later to the case of more than two populations in Conde et al. (2012).The robustness of the rules has also been studied in Salvador et al. (2008) and good estimators of the performance of the rules (which is an essential issue in discriminant analysis) have been provided in Conde et al. (2013).From now on, we will refer to these rules as restricted rules as the additional information is incorporated through restrictions on the populations means.
The purpose of the present paper is two fold.The first is to introduce the dawai package, programmed in R environment, which can be downloaded from http://cran.r-project.org/web/packages/dawai/.This package provides all the functions needed to take advantage of the rules that incorporate additional information.The functions in the package allow to define the restricted rules, to classify the samples and to evaluate the accuracy of the results.The second contribution of this paper is the extension of the ideas given in previous papers from the case of equal covariances in the different populations to the case of unequal covariances among the populations and consequently the definition of the corresponding restricted quadratic discriminant rules, and also the definition of estimators of the accuracy of the rules for the general case where more than two populations appear in the problem.
In Section 2 we describe the statistical problem and the methodology of Fernández et al. (2006), Conde et al. (2012) and Conde et al. (2013), which we extend to the above mentioned situations.In Section 3 we introduce the dawai package and explain some details about the main user-lever functions it includes.The wide range of applications of the dawai package is illustrated in Section 4 using two data sets coming from two different fields such as biology and pattern recognition.Some concluding remarks are provided in Section 5.

Discriminant analysis with additional information
We consider a finite number k ≥ 2 of distinct populations of items Π 1 , . . ., Π k .Each item is assumed to belong to one and only one of the populations.Let Z be a categorical variable identifying the population and let X = (X 1 , . . ., X p ) be the p-dimensional vector of predictors.Denote also as P XZ the joint distribution of (X, Z), and as P j the distribution of X in population Π j with density function f j , j = 1, . . ., k.The classical discrimination problem deals with the classification of an observation U = (U 1 , . . ., U p ) , whose origin is unknown, into one of those populations.If we consider a 0-1 loss function and a priori probability π j for the population Π j , j = 1, . . ., k, it is well known that the optimal classification rule, also called Bayes rule, is given by: In applications, the density functions f j , j = 1, . . ., k, are unknown although there is sample information available.This sample information is contained in the so-called training sample given by a set of items for which both the predictors values and the correct population they belong to are registered.We represent the training sample as where n is the items sample size, Y i is the value that vector X takes at the i-th item in the sample and Z i is the population the i-th item belongs to.Then, a classification rule is an application R n : {R p × {1, . . ., k}} n × R p → {1, . . ., k}, that assigns a new observation U ∈ R p for which the population is unknown to one of the k populations, R n (M n , U ) ∈ {1, . . ., k}.
From now on, we assume that π j = 1 k , j = 1, . . ., k (the case of unequal a priori probabilities is a trivial extension).If we further assume multivariate normality, i.e., P j ∼ N p (µ j , Σ), j = 1, . . ., k, where µ j = (µ j1 , . . ., µ jp ) is the mean of vector X in population Π j , j = 1, . . ., k, and Σ is a common covariance matrix, then the optimal classification rule (the one with lowest expected loss) may be written as: Unfortunately, this rule cannot be used in practice as the mean vectors µ j , j = 1, . . ., k, and the common covariance matrix Σ are unknown.However, as we have a training sample these parameters may be estimated using respectively the sample vectors means Y j and the pooled sample covariance matrix S, where n j = n l=1 I (Z l =j) is the sample size of population Π j , j = 1, . . ., k, and n = k j=1 n j .As this estimated rule, obtained plugging the estimators in the initial rule, is linear in the predictors, it is usually known as linear discriminant rule or Fisher's rule: If the covariance matrices are not assumed to be equal, i.e., P j ∼ N p (µ j , Σ j ), j = 1, . . ., k, the optimal rule can be written as: Again, if we replace in this rule the unknown parameters µ j and Σ j by their corresponding estimators Y j and S j = 1 , we obtain a rule that depends on the predictor in a quadratic way and it is therefore known as the quadratic discriminant rule:

Restricted discriminant rules
In the introduction we referred applications where it is usual that some additional information is available.In many of these cases the information can be written as inequality restrictions among the population means.In the literature these restrictions are usually represented by a polyhedric cone (cf.Robertson et al. 1988 or Silvapulle andSen 2005).In our case, our pk-dimensional populations means will belong to a cone where the q vectors a j ∈ R pk , j = 1, . . ., q, are determined by the restrictions imposed on the means.
Polyhedral cones are widely used in restricted inference literature, because they cover the most interesting cases from a practical standpoint.Among these cones, those representing order relations among the means are especially interesting.For example, it is not unusual to know that the observations from one of the populations, for example, Π 1 (which may be the control population in a medical study), take, in mean, lower values that those coming from any of the other populations for a subset L ⊆ {1, . . ., p} of predictor variables.In the usual restricted statistical terminology, we can say that there is a "tree order" among the populations means of the variables in L. In this case, we can write Another usual situation appears when it is known that there is an increase in the means of a subset L of predictors (for example, due to increasing severity level in an illness study).This is known as a "simple order" among the populations means of the variables in L, and may be represented in R pk using the cone (5)

Restricted linear discriminant rules
As mentioned above, in this case we assume Σ j = Σ, j = 1, . . ., k. Fernández et al. (2006) deal with this situation when the number of populations is k = 2.They propose a family of classification rules whose expected loss (total probability of misclassification) is lower than that of the linear discriminant rule (1).These rules are based on the use of additional information to obtain alternative estimators of the vector means.The generalization to the k > 2 populations case appears in Conde et al. (2012).These alternative estimators are defined via an iterative procedure whose convergence is shown in Fernández et al. (2006) and that is described here for completeness.
Consider the pk square matrix .
Definition 1 (Conde et al. 2012) For γ ∈ [0, 1], let µ γ be the limit value, when m −→ ∞, of the following iterative procedure: is the projection of Y ∈ R pk onto the cone C using the metric given by the matrix S −1 * , and The computation of the projection of a vector onto a polyhedral cone can be carried out using the lsConstrain.fitmethod contained in ibdreg (Sinnwell and Schaid 2013) R package.Figure 1 shows, in R 2 , the cones C and C P and the estimators defined when γ = 1 and Σ = I, exposing the need for an iterative procedure when C is an acute cone.
For more details on these restricted linear rules and their properties the reader is referred to Fernández et al. (2006) and Conde et al. (2012).

Restricted quadratic discriminant rules
The problem of incorporating additional information to classification rules when the covariance matrices cannot be assumed to be equal has not, to the best of our knowledge, been considered in the literature up to this date.This problem is important as equality among covariance matrices cannot be assumed in many applications.In this subsection, we extend the ideas appearing in the definition of restricted linear discriminant rules to the unequal covariances case, thus defining the corresponding restricted quadratic discriminant rules.The main novelty is the definition of the appropriate projection matrix for the computation of the restricted estimators.We have checked that the matrix that correctly extends the restricted linear discriminant rules is . Consequently, in this case for each γ ∈ [0, 1] the dawai: An R Package for Discriminant Analysis With Additional Information.
estimator µ γ = µ γ 1 , . . ., µ γ k of µ = (µ 1 , . . ., µ k ) ∈ C is obtained using the iterative procedure described in Definition 1, replacing the matrix S −1 * by S −1 * * .Again, the estimators of the means µ γ j and the covariances matrices S j are plugged into the original rule (2) to obtain the restricted quadratic discriminant rules R q (γ) : The computational complexity of these rules is clearly not significantly higher than that of the restricted linear discriminant rules, and the simulations studies we have performed show that, as we will see in the applications appearing in Section 4, this rule improves the corresponding unrestricted quadratic discriminant rule.
The rlda and predict.rldafunctions in R package dawai allow to define restricted linear discriminant rules and to classify samples, respectively.The rqda and predict.rqdafunctions are the corresponding versions for performing restricted quadratic discrimination.

True error rate estimation
From the applications point of view, the evaluation of the classification rule for a given training sample is even more important than the expected loss of the rule.The true error rate, E n , of the rule R n is the probability of misclassification of the rule given the training sample, i.e., It is well known that the best way of estimating the true classification error of a classification rule is the use of an independent and large enough sample, usually called test sample.However, in practice it is common that the sample size is not large enough to split the sample into a training and a test sample as that would decrease the efficiency of the rule.For this reason, the estimation of E n for the usual rules such as for example Fisher's linear rule (1), the quadratic discriminant rule (2), the nearest neighbors rules (Cover and Hart 1967) or random forest rules (Breiman 2001), is a widely studied topic in the literature.Parametric and non-parametric estimators of E n have been proposed and non-parametric estimators based on resampling have shown a good performance for the above mentioned rules.Schiavo and Hand (2000) summarizes the work made on this topic until that date.More recent references are, for instance, Steele and Patterson (2000), Wehberg and Schumacher (2004), Fu et al. (2005), Molinaro et al. (2005), Kim and Cha (2006), Kim (2009) or Borra and Di Ciaccio (2010).Conde et al. (2013) propose four new estimators of E n specific for the restricted linear discriminant rule R l (γ) for k = 2 populations.Two of them, BT 2 and BT 3, are generated from the leave-one-out bootstrap (LOOBT , see Efron 1983).The other two, BT 2CV and BT 3CV , are cross-validation after bootstrap (BCV , see Fu et al. 2005) versions of BT 2 and BT 3 respectively.In the following, we describe the generalization of these estimators to k > 2 populations and to restricted quadratic discrimination cases.This is the second theoretical novelty appearing in this paper.As happened when the restricted discriminant rule was extended to the more than 2 populations case in Conde et al. (2012), the extension is not immediate as the appropriate parameter spaces and projection matrices have to be considered.Following the arguments already appearing in Efron (1979) original paper or in Boos (2003), the underlying idea in the definition of the new estimators of the true error rate is that the "bootstrap world" should mirror the "real world".We present two proposals: the first one is to modify the restrictions cone, the second one is to adapt the training sample.

The BT2 and BT2CV estimators
Assume that the additional information is written as in (3).Let us denote as C the following random cone generated by the sample mean vectors Y = Y 1 , . . ., Y k : i.e., the cone determined by the restrictions verified by the sample means.
The true error rate estimator BT 2 of the restricted linear or quadratic classification rules (R l (γ), R q (γ)) is computed as follows.A bootstrap training sample

The BT3 and BT3CV estimators
The true error rate estimator denoted as BT 3 is based in adapting the original training sample, instead of modifying the cone C like in BT 2, as follows.
Assume that the original training sample M n = {(Y i , Z i ), i = 1, . . ., n} does not verify the restrictions, i.e., Y / ∈ C. For any γ ∈ [0, 1], let µ γ j be the restricted estimator of µ j obtained in Definition 1. Now, we transform the original training sample in such a way that the new sample means belong to C. The transformed training sample is {(W i , Z i ), i = 1, . . ., n}, where for i = 1, . . ., n and j = 1, . . ., k.
In this way W = W 1 , . . ., W k , W j = 1 n j n l=1 W l I (Z l =j) , j = 1, . . ., k.Now, the estimator denoted as BT 3 is computed in a similar way to that of BT 2 but replacing the original training sample by the transformed one and C by C.
The BT 3CV estimator is the cross-validation after bootstrap version of BT 3.These four estimators of the true error rates of the restricted linear and quadratic discriminant rules can be obtained with err.est.rlda and err.est.rqdafunctions of the R package dawai, respectively.

Package dawai
We start this section giving some background on R packages for performing discriminant analysis.We then explain some details about the functions of this package.

Related packages
As discriminant analysis is a well-known and widely used technique there are many packages in R for performing discriminant analysis.The basic procedures are in package • rrlda (Gschwandtner et al. 2013): Robust Regularized Linear Discriminant Analysis.
Since none of the existing packages for discriminant analysis are applicable for performing discriminant analysis under restrictions, in this article we introduce the package "discriminant analysis with additional information", with the acronym dawai.
Our package depends on boot (Ripley 2013) for bootstrapping, ibdreg (Sinnwell and Schaid 2013) for computing the projection of a vector onto a polyhedral cone with lsConstrain.fit,and mvtnorm (Genz et al. 2013) for computing multivariate normal densities.These packages should be installed before loading dawai.

Functions provided in dawai
The R package dawai consists of a total of six functions, three for each of the two restricted discrimination analysis situations: equal or unequal covariances in the populations.The three functions for each case are: one to define the rules that take into account the additional information expressed in terms of restrictions on the populations means and to classify the samples in the training set; a second one which predicts the populations of new samples using the previously defined rule; and, finally, a third one which can evaluate the accuracy of the rules associated to the training set.
The R help files provide the definitive reference.Here we explain some details about the main user-level functions and their specific arguments.
The first function for linear discrimination is the the rlda function that can be used to build restricted linear classification rules with additional information expressed as inequality restrictions on the populations means, using the methodology developed in Fernández et al. (2006) and Conde et al. (2012).It creates an object of class "rlda": rlda(x, grouping, subset = NULL, resmatrix = NULL, restext = NULL, gamma = c(0, 1), prior = NULL, ...) or rlda(formula, data, subset = NULL, resmatrix = NULL, restext = NULL, gamma = c(0, 1), prior = NULL, ...) Argument resmatrix collects the additional information on the means vectors, as follows: resmatrix • (µ 1 , . . ., µ k ) ≤ 0. Each of the rows of resmatrix corresponds to each of the q restrictions included in the model (3).Obviously, the number of columns of resmatrix is the total number of parameters kp, with the first p columns corresponding to the mean values of the predictors for the first population and so on.
The purpose of restext is to make easier the specification of the two most usual cones of restrictions, the tree order (4) and the simple order (5) cones.The first element of restext must be either "s" (simple order) or "t" (tree order), the second element must be either "<" (increasing componentwise order) or ">" (decreasing componentwise order), and the rest of the elements must be numbers from the set {1, . . ., p}, separated by commas, specifying among which variables the restrictions hold.
Argument gamma is the vector of values (in the unit interval) used to determine the restricted rules R l (γ).
The second function predict.rldaclassifies multivariate observations contained in a data frame newdata using the restricted linear classification rules defined in an object of class "rlda": predict.rlda(object,newdata, prior = object$prior, gamma = object$gamma, grouping = NULL, ...) Finally, the third function err.est.rldaestimates the true error rate of the restricted linear classification rules defined in an object x of class "rlda", using the methodology developed in Conde et al. (2013) and in Section 2.2. of this paper: err.est.rlda(x,nboot = 50, gamma = x$gamma, prior = x$prior, ...) Argument nboot is the number of bootstrap samples used.
The rqda, predict.rqdaand err.est.rqdafunctions are the corresponding versions of the rlda, predict.rldaand err.est.rldafunctions for performing restricted quadratic discrimination.The rqda function builds restricted quadratic classification rules using the methodology developed in Section 2.1.The predict.rqda function classifies multivariate observations with restricted quadratic classification rules.Finally, the err.est.rqdafunction estimates the true error rate of restricted quadratic classification rules using the methodology developed in Section 2.2.
Examples to illustrate these functions are provided in Section 4.

Applications
There is a wide range of applications of the dawai package, which we illustrate in this section using two data sets coming from two different fields such as biology and pattern recognition.

Biological application
In patient care, as for example in cancer treatment, an important component is the correct classification of the patient into one of the disease stages.The disease stages correspond to increasingly advanced levels of the disease, so it is reasonable to expect the mean values of some variables to increase or decrease with the severity of the illness.This is the case of primary biliary cirrhosis (PBC), an autoimmune liver disease causing liver inflammation and a gradual destruction of the intrahepatic bile ducts found within the liver.PBC is a progressive disease, with four sucessive stages as time passes (Scheuer 1967).
The data set we will use now, called pbc, is in survival R package, taken from Therneau and Grambsch (2000).This data set is from the Mayo Clinic trial in PBC of the liver conducted between 1974 and 1984, and it has 418 cases and 20 variables.
We will use this data set to exemplify the restricted linear discriminant rules.We consider three variables as predictors (p = 3), Bili, Albumin and Platelet (the amounts of serum bilirubin (mg/dl) and serum albumin (g/dl) and platelet count, respectively), and three populations (k = 3), joining the original stages 1 and 2 into one so that the classes have enough elements to split the sample into training and test data sets of reasonable sizes, as seen below.
We transform logarithmically the values of the explicative variables so that the variables are approximately normally distributed.
Finally, we estimate the true error rate from the training sample with nboot = 50 (the default value).
We can also compare these results with the error rates for some standard unrestricted classifiers such as LDA (MASS package) and Random F orest (randomForest package).For γ = 1, the test error rates for the restricted linear rules are 7.74% lower than for LDA (51.83946) and 15.38% lower than for Random Forest (56.52174).

Pattern recognition application
As an example of pattern recognition, we will use the data set contained in dawai package called Vehicle2.
R> library("dawai") R> data("Vehicle2") This data set is a subset from the Vehicle data set, available in the R package mlbench and taken from the UCI Repository Of Machine Learning Databases (http://www.ics.uci.edu/~mlearn/MLRepository.html), originally gathered in Siebert (1987).The purpose of the data set is to study how to distinguish 3D objects from a 2D image, i.e., how to classify a given silhouette as viewed from a camera from different angles and elevations into one of four types of vehicle, using a set of features extracted from the silhouette.The vehicles used were a double-decker bus, a Cheverolet van, a Saab 9000 and an Opel Manta 400, with the expectation that the bus, the van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
Vehicle2 is a data frame with 846 observations on 5 variables, four numerical and one nominal defining the class of the objects, i.e., the vehicle.The variables are Skew.maxis(skewness about minor axis), Kurt.Maxis (kurtosis about major axis), Holl.Ra (quotient hollows area / bounding polygon area), Sc.Var.maxis(quotient 2nd order moment about minor axis / area) and Class.
We will use this data set to exemplify the restricted quadratic discriminant rules.
We consider the four variables as predictors (p = 4) and the four available populations (k = 4).
R> data <-Vehicle2[, 1:4] R> grouping <-Vehicle2$Class R> levels(grouping) [1] "bus" "opel" "saab" "van" R> levels(grouping) <-c(4, 2, 1, 3) We have "ordered" the populations in terms of the vehicle size.It could be reasonable to think that the means of the first three variables decrease with the vehicle size (in fact, this ordering is verified by the whole data set), so let us suppose the following restrictions on the means: As this is a classical simple order on these predictors and all orderings are decreasing, we easily specify these restrictions by restext = "s>1,2,3".
We split the data set into a randomly selected training set and test set, fixing a seed in order to get the same results as the reader.Now we can build the restricted quadratic discriminant rules: R> obj <-rqda(data, grouping, subset = trainsubset, restext = "s>1,2,3") R> obj
Again, if we compare these results with the error rates for some standard unrestricted classifiers such as QDA (MASS package) and Random Forest, for γ = 1 the test error rates for the restricted quadratic rules are 7.98% lower than for QDA (33.9172) and 14.78% lower than for Random Forest (36.6242).
In both examples we can see that there is a significant improvement with respect to usual methods in statistical practice that do not take into account the additional information given by the restrictions.

Conclusions
In this paper the R package dawai has been presented.The package provides the functions needed to define linear or quadratic classification rules under order restrictions, to classify the samples and to evaluate the accuracy of the rules.
We have also extended in this paper the definitions given in previous papers (Fernández et al. 2006, Conde et al. 2012, Conde et al. 2013) from the case of equal covariances in the different populations to the case of unequal covariances among the populations and consequently defined the corresponding restricted quadratic discriminant rules.Another novelty is the definition of estimators of the accuracy of the rules for the general more than two populations case, for restricted linear and quadratic discriminant rules, thus completing the procedures presented in those three previous papers.
Though we have illustrated the proposed methodology using examples from biology and pattern recognition, the software can obviously be applied to a wide range of contexts such as medical image analysis, drug discovery and development, optical character and handwriting recognition, document classification, credit scoring. . .We expect the software described to be useful for researchers working in any of those fields.

Figure 1 :
Figure 1: Examples of the iterative procedure for mean vector estimation for an acute (a) and a non-acute cone (b).
n}, b = 1, . . ., B, are obtained from M n .For each bootstrap training sample we define the bootstrap version of the estimator of µ = (µ 1 , . . ., µ k ) that we denote as µ * b γ (with γ ∈ [0, 1]), as the limit when m → ∞ of the following iterative procedure similar to the one considered in Definition 1.Let µ (0)b γ = Y and µ (m)b γ = p A µ (m−1)b γ |C − γp A µ (m−1)b γ |C P , m = 1, 2, . . ., where matrix A is equal to S −1 * for the restricted linear discriminant rule and equal to S −1 * * for the restricted quadratic discriminant rule.Now, we denote as R * b l (γ) and R * b q (γ) the bootstrap versions of the classification rules R l (γ) and R q (γ), respectively.For each of the B bootstrap rules we classify the observations in the original training sample that do not belong to the corresponding bootstrap sample M * b n .The true error rate estimator BT 2 is the proportion of observations wrongly classified.The BT 2CV estimator is the BCV (Fu et al. 2005) version of BT 2. For each of the B bootstrap training samples, let CV b be the true error rate estimator obtained using the crossvalidation method on sample M * b n .Then BT 2CV = 1 B B b=1 CV b .

•
MASS (Ripley et al. 2011): Support Functions and Datasets for Venables and Ripley's MASS.Some more recent packages including new features and discrimination in specific conditions are • mda (Hastie et al. 2013): Mixture and flexible discriminant analysis.
n} is a size n randomly obtained (with replacement) sample from the original training sample (i.e., P((Y * i The apparent error rates suggest that the classes are not completely separated in the training sample.This is a situation, usual in practice, where the restricted rules are expected to perform better than the unrestricted ones, as we will see.Now we consider the test set, containing the observations in data not present in the training set, and classify them.As we know the classes that the observations in the test set belong to, we can estimate the true error rate.