Joint variable selection and classification with immunohistochemical data.

To determine if candidate cancer biomarkers have utility in a clinical setting, validation using immunohistochemical methods is typically done. Most analyses of such data have not incorporated the multivariate nature of the staining profiles. In this article, we consider modelling such data using recently developed ideas from the machine learning community. In particular, we consider the joint goals of feature selection and classification. We develop estimation procedures for the analysis of immunohistochemical profiles using the least absolute selection and shrinkage operator. These lead to novel and flexible models and algorithms for the analysis of compositional data. The techniques are illustrated using data from a cancer biomarker study.


Introduction
The development of high-throughput assays such as mass spectrometry and gene expression microarrays has led to the generation of large numbers of candidate biomarkers in current medical research. However, while such results and signatures tend to provide great potential for disease prognosis, translating the discovery into a clinically useful biomarker requires more investigation. An important first step typically is to validate the finding using so-called immunohistochemical staining patterns.
In immunohistochemical studies, staining patterns of the biomarker are measured across a variety of samples. One common way this is done is using a tissue microarray. 1 In this scheme, the "spots" on the glass slide represent tumor cores from different patients, and an antibody for the protein of interest is applied to the slide. The staining patterns are then correlated with patient characteristics. In most instances, the staining is assessed by a pathologist, who assigns a score on an ordinal scale, with larger values corresponding to higher levels of staining. Note that this structure is quite different from an expression microarray, in which the spots are individual genes and proteins of interest, while what is hybridized to the slide is a single sample.
Typically, an analysis of such data requires the creation of a univariate score measuring staining intensity for each sample. Then the score is associated with clinical outcomes using standard testing and regression methods. For example, if the clinical outcome is binary, one could use a parametric or non-parametric two-sample test for association. Alternatively, one could fit a linear regression of staining intensity on clinical outcome or a logistic regression of the outcome on staining intensity.
As noted by Etzioni et al 2 the staining of tumor samples to the antibody is not completely homogeneous. What is available for certain scoring systems is a multivariate profile of percent of tumor cells staining at each level of the various scoring categories. As an example, we consider tissue microarray data from a candidate prostate cancer biomarker, LIMK1. 3 LIMK1 is a dual specificity novel serine/threonine kinase which modulates actin dynamics through inactivation of the actin depolymerizing protein cofilin.
The function of LIMK1 in reorganization of the cytoskeleton has been studied extensively during developmental defects. 4,5 Recently, a role of LIMK1 in progression and invasiveness of breast and prostate cancer has been predicted. 6,7 In this paper, we explore the status of LIMK1 staining in the nucleus and cytoplasm as it relates to aggressiveness of prostate cancer.
In this dataset, there were five staining categories, and 50 samples were profiled for LIMK1 nuclear staining. The data for five randomly chosen observations are given in Table 1. Note that by definition, the percentages within each row must add up to one. Such data are referred to as compositional data. 8 A major advance by Etzioni et al 2 was to advance the use of composition data analytic techniques for the analysis of immunohistochemical data. They propose the use of Bayesian inference for the so-called logistic normal distribution for compositional data. However, fitting their model requires customized software that is typically not available to data analysts. In this article, we discuss the model proposed by Etzioni et al 2 and show that it is in fact equivalent to a particular linear discriminant analysis model. Linear discriminant analysis (LDA), pioneered by Fisher,9 has been a popular model in the classification literature. Software for fitting LDA is available in most mainstream statistical packages, such as MINITAB, SAS and Splus/R. Thus, the first aim of the article is to show that one can in fact fit the model of Etzioni et al 2 using LDA methods. A second goal of the article is to jointly perform classification and feature selection within this class of LDA models. Such an approach would allow for the automated inclusion of informative and exclusion of non-informative categories for discriminating samples. In the context of the tissue microarray example, this means that we want to find which staining categories that are informative for predicting aggressiveness. This will be done using the lasso penalty described initially by Tibshirani, 10 in conjunction with a model selection strategy.
The structure of this article is as follows. In Results and Discussion, we outline the data structures and discuss the logistic normal model formulation of Etzioni et al 2 for analysis of immunohistochemical profiles and equivalence with LDA. We also describe the lasso algorithm of Tibshirani. 10 The methods are then applied to the motivating dataset. In the Methods section, we describe the optimal scoring algorithm for converting the classification problem of linear discriminant analysis into a regression problem. 11 This will allow for the fusion of the classification using multivariate staining profiles along with automatic lasso-based selection of staining categories that are informative. In addition, we describe methods for additional covariate adjustment and model selection within this framework.

Results and Discussion data structures and logistic normal model
We will be assuming that we have data (D i , Y i , X), i = 1, ..., n, a random sample from (D, Y, X), where D denotes the group status, Y ≡ (Y 1 , ..., Y p ) is a p-dimensional staining profile, and X is a q-dimensional vector of covariates. D will take values 0 and 1. We let s 1 , ..., s p denote the scores assigned to (Y 1 , ..., Y p ); typically, we take (s 1 , ..., s p ) to be (0, ..., p -1) or (1, ..., p). It is assumed that Y i (i = 1, ..., p) takes values in (0, 1) and that Y i . While Y is assumed to have all non-zero components, in practice zeroes do exist. We follow the recommendations of Etzioni et al 2 and add in a random noise term.

Logistic normal Distribution for Compositional Profiles
In addition, the components of Z are nonnegative. Thus, transformed vectors are multivariate measurements on the product space (0, ∞) (p-1) . The probabilistic model proposed in Etzioni et al 2 for the analysis of immunohistochemical profiles is to assume that conditional on D, where µ D is a ( p -1)-dimensional mean vector and Σ is a covariance matrix. The induced distribution for (Y 1 /Y, ..., Y P /Y) is referred to as the logistic normal distribution in the compositional data analysis literature. 8,12 While interpretation for the parameters on the transformed scale (i.e. Z) is easy, it is harder to interpret on the original scale.
In terms of analyzing immunohistochemical profiles, Etzioni et al 2 adopted a hierarchical model in which priors were placed on µ D and Σ. They then used a Markov Chain Monte Carlo (MCMC) sampling algorithm to sample from the the posterior distribution of µ D . It was used to construct a 95% credible interval for mean shifts for the log-transformed profile. An easier estimation procedure that does not require implementing a Gibbs sampling algorithm is to fit a linear discriminant analysis model to the transformed data Z. This can be performed using virtually any standard statistical software package. The estimated linear discriminants from performing the linear discriminant analysis can be used in several ways. First, they can be used as a visualization method. Second, they can serve as a data-driven summary score on which further analysis can be performed.
One also notes that (2) can be generalized to allow for proportional covariance matrices across populations. This would then necessitate fitting a quadratic linear discriminant analysis model to Z.

Lasso estimation
Shifting gears, we discuss the Least Absolute Shrinkage and Selection (LASSO) algorithm .., U n ) are continuous variables, and β is an unknown p-dimensional vector of unknown regression coefficients to be estimated. The LASSO solution is given by where λ  0 is a penalty parameter, and β j denotes the jth component of β. Tibshirani 10 showed that placing an L 1 constraint on the sum of the magnitude of the regression coefficients yielded sparsity in the estimates of β. To be specific, for certain values of λ, it is possible for the lasso estimate of β to be identically zero. Further details on the numerical algorithm implemented here is given in Methods.

LIMK1 Biomarker study
Using the procedures described above as well as in the Methods section, we now consider a real-life application. The immunohistochemical data come from a putative prostate cancer biomarker, LIM kinase 1 (LIMK1). 3 In this study, the expression profile of LIMK1 was determined using a prostate tumor tissue array comprising 50 samples from tumors at different stages of progression. The pool of samples in the array included three uninvolved prostate tissues for comparison. TNM classification of tumors in the TMA indicated that 62% patients had histories of either lymph node or distant metastasis at the time of surgery or biopsy, and 88% of the tumors had Gleason scores of 7 or above. Gleason score (GS) is an aggregate measure of the aggressiveness of the tumor. It is composed of a major and minor Gleason score, each of which is scored on a scale of one to five. We dichotomized Gleason score as less than or greater than or equal to eight. Analyses of nuclear staining are considered first. Scatterplots of the multivariate nuclear staining profiles by pairwise category comparison are given in Figure 1. To associate staining with the clinical parameters (presence of metastases, Gleason score), we used the product score, multiplying the percentage staining by the staining intensity. Boxplots of the product score for nuclear staining versus presence of metastases and Gleason score are provided in Figures 2 and 3. While Figure 2 indicates that metastatic tumors have higher nuclear staining relative to non-metastatic tumors, there is less difference in nuclear staining across the different Gleason score categories. A t-test reveals the differences corresponding to the boxplot in Figure 2 to be statistically non-significant (P = 0.32 for presence of metastases), while an analysis of variance yields the association between nuclear  staining and Gleason score to also be nonsignificant (P = 0.31).
A series of logistic classification models using the proposed methods were run; they are summarized in Table 1. Based on the analyses we find that if we use nuclear staining profile to predict presence of metastases, then all four staining categories are informative. Using the BIC criterion, there was no improvement by including presence of metastases as a covariate. On the other hand, if one wishes to use nuclear staining to predict Gleason score, then only the third staining category is informative. There is no improvement by including presence of metastases as a covariate.
Next, we considered analyses based on cytoplasmic staining intensity. Pairwise scatter-plots of cytoplasmic staining are given in Figure 4. As with nuclear staining, we used the product score for associating the immunohistochemcial profile for cytoplasmic staining with Gleason score and presence of metastases. The boxplots of the LIMK1 cytoplasmic staining product score by presence of metastases and Gleason status are given in Figures 5 and 6. Analyses analogous to those for nuclear staining reveal nonsignificant associations (P = 0.75 and P = 0.40 for presence of metastases and Gleason score, respectively).
The logistic normal models results for analysis using the cytoplasmic staining profiles are given in Table 2. For predicting presence of metastases, only the first staining category is needed. However, if we seek to adjust for Gleason score, then categories 1, 3, and 4 are needed. However, the BIC shows that the model fit worsens. If we use cytoplasmic staining to predict Gleason score, then categories 2, 3, and 4 are selected by the LASSO procedure. Including presence of metastases does not improve the model fit. Comparing across the models listed in the Table,  we see that models with fewer staining categories selected tend to have better model fit. This supports the use of the LASSO algorithm for automating variable selection while fitting the logistic normal model to the immunohistochemical profiles. This also suggests that the proposed methodology, which in effect can fit a reduced submodel of the model of Etzioni et al 2 is a better fit to the data rather than the full model of Etzioni et al. 2 conclusions In this article, we have explored a compositional data model initially proposed by Etzioni et al 2 that is applicable to the modelling of immunohistochemical biomarker data such as those which might arise from tissue microarrays. We have shown that it has a natural link with linear discriminant analysis, which has been very well-studied in the statistical literature. Consequently, the Etzioni et al model can be fit using standard software packages for LDA, after some data manipulations are performed. The inference we perform is non-Bayesian, in contrast to the Bayesian inference done by Etzioni et al. 2 We also have developed an automated variable selection procedure within the class of models by incorporating LASSO estimation procedures. The real data example shows that this automated variable selection leads to a better fit. We have also outlined a model selection strategy in which the Etzioni et al model is compared to submodels in which categories are suppressed.
While we dealt with the situation in this paper where D is binary (as did Etzioni et al), the optimal scoring algorithm can be easily extended to deal with the case where D has more than two levels. One converts D into a n × (G -1) matrix, where G is the number of groups. The regression model that is fit is then a multivariate regression in that the response is multivariate.
Scientifically, while a score-based method such as the product score provides a simple summary statistic for staining data that can then be associated with clinical parameters in tests of hypotheses and regression models, it might oversimplify the data too much. This would be especially undesirable if there is substantial within-sample staining heterogeneity. Thus, methods which explicitly account for the multivariate nature of the staining offer a useful alternative. Compositional data methods are one type of multivariate approach. What our method allows the analyst to do is (1) model the staining profiles in a multivariate manner, (2) incorporate clinical variables

Lasso-based optimal scoring algorithm
In this section, we describe our proposal, which entails developing a sparse estimator in the logistic normal model for compositional data. This is done by using the optimal scoring algorithm of Hastie et al 11 to convert the logistic normal classification problem into a regression problem. This is done in the following way: The fitted values obtained at the end of the algorithm are proportional to the linear discriminant analysis coefficients. To extend the algorithm so that we jointly perform classification and automated variable selection, we simply replace step 3 of the algorithm by LASSO estimation of the type described in Results and Discussion. Based on the algorithm, regression coefficients for each variable in X will be estimated. Those with estimated regression coefficients that are zero are considered unimportant variables or features.
We will use the algorithm of Osborne et al 13 for LASSO estimation. Let σ be the index set, a subset of {1, ..., p}. The ith component of β is non-zero if and only if I ∈ σ. The algorithm of Osborne et al 13 operates by sequentially updating the index set. Let P denote the permutation matrix that arranges the non-zero components of η as the first s components, where s is the cardinality of σ. We have that Let θ σ be the sign vector of β σ . At each step of the algorithm, β must satisfy the L 1 constraint; this can be expressed as If the constraint is active, then the optimal solution for h is given by the least square method. Let h denote the solution, and let (β* = β + h. If sign(β*) σ = θ σ , then (β* is said to be sign feasible. If (β* is not sign feasible, then the following steps are taken: 1. Find the smallest α ∈ (0, 1) such that 0 = β k + αh k for a k ∈ σ and set  β β α = + h. 2. One of two steps may be taken here. Either i. Set θ k = -θ k and recompute h. If β + h is sign feasible for the revised θ, set (β* = β + h and go to step 3. or ii. Update σ by deleting k, resetting β and θ σ accordingly, and recompute h for the revised problem. 3. Iterate between steps 1 and 2 until a sign feasible  β is obtained. Set β β * . =  Once the sign feasibility is obtained, the optimality of the candidate solution is tested. This is done by calculating where Z σ is the design matrix Z with columns corresponding to P. By definition, the ith component of v 1 is θ i for 1  i  s. If the absolute magnitude of the ith component of v 2  1 for 1  i  (ps), then β* is a solution to the lasso problem. Otherwise, the following steps are taken: 1. Find the index j such that the jth component of v 2 has the largest magnitude. 2. Update σ by adding j to it and update β σ by adding a zero as its last element and β σ by appending the jth component of sign(v 2 ). 3. Set (β* = β and iterate between steps 1 and 2. This algorithm has been implemented as an R function (www.r-project.org) by the first author and can be obtained upon request.

Covariate adjustment and model selection
An important question not addressed by Etzioni et al 2 was adjusting for other co-variates in addition to D.