Functional Clustering Algorithm for High-Dimensional Proteomics Data

Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is largely smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of features of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. We propose here a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. We present a specific application of functional data analysis (FDA) to a high-throughput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and human T-cell leukemia virus type 1 (HTLV-1)-infected patients samples.


INTRODUCTION
A variety of mass spectrometry-based platforms are currently available for providing information on both protein patterns and protein identity [1,2]. Specifically, the first widely used such mass spectrometric technique is known as surface-enhanced laser desorption ionization (SELDI) coupled with time-of-flight (TOF) mass spectrometric detection [3,4,5]. The SELDI approach is based on the use of an energy-absorbing matrix such as sinapinic acid (SPH), large molecules such as peptides ionize instead of decomposing when subjected to a nitrogen UV laser. Thus, partially purified serum is crystallized with an SPH matrix and placed on a metal slide. Depending upon the range of masses the investigator wishes to study, there are a variety of possible slide surfaces; for example, the strong anion exchange (SAX) or the weak cation exchange (WCX) surface. The peptides are ionized by the pulsed laser beam and then traverse a magneticfield-containing column. Masses are separated according to their TOFs as the latter are proportional to the square of the mass-to-charge (m/z) ratio. Since nearly all of the resulting ions have unit charge, the mass-to-charge ratio is in most cases a mass. The spectrum (intensity level as a function of mass) is recorded, so the resulting data obtained on each serum sample are a series of intensity levels at each mass value on a common grid of masses (peaks).
Proteomic profiling is a new approach to clinical diagnosis, and many computational challenges still exist. Not only are the platforms themselves still improving, but the methods used to interpret the high-dimensional data are developing as well [6,7].
A variety of clustering approaches has been applied to high-dimensional genomics and proteomics data [8,9,10,11]. Hierarchical clustering methods give rise to nested partitions, meaning the intersection of a set in the partition at one level of the hierarchy with a set of the partition at a higher level of the hierarchy will always be equal to the set from the lower level or the empty set. The hierarchy can thus be graphically represented by a tree.
Functional data analysis (FDA) is a statistical data analysis represented by smooth curves or continuous functions µ i (t), i = 1, . . . , n, where n is the number of observations and t might or might not necessarily denote time but might have a general meaning. Here t denotes the mass (m/z). In practice, the information over µ i (t) is collected at a finite number of points, T i , thus observing the data vector y i = (y i1 , . . . , y iTi ) t . The basic statistical model of FDA is given by where t i j is the mass value at which the jth measurement is taken for the ith function µ i . The independent disturbance terms i (t i j ) are responsible for roughness in y i . FDA has been developed for analyzing functional (or curve) data. In FDA, data consists of functions not of vectors. Samples are taken at time points t 1 , t 2 , . . ., and regard µ i (t i j ) as multivariate observations. In this sense the original functional y i j can be regarded as the limit of µ i (t i j ) as the sampling interval tends to zero and the dimension of multivariate observations tends to infinity. Ramsay and Silverman [12,13] have discussed several methods for analyzing functional data, including functional regression analysis, functional principal component analysis (PCA), and functional canonical correlation analysis (CCA). These methodologies look attractive, because one often meets the cases where one wishes to apply regression analysis and PCA to such data. In the following we describe how to use the FDA tools for applying FDA and a new dissimilarity measure to classify the spectra data.
We propose to implement a hierarchical clustering algorithm for proteomics data using FDA. We use functional transformation to smooth and reduce the dimensionality of the spectra and develop a new algorithm for clustering high-dimensional proteomics data.

Serum samples from HTLV-1-infected patients
Protein expression profiles generated through SELDI analysis of sera from human t-cell leukemia virus type 1-(HTLV-1)-infected individuals were used to determine the changes in the cell proteome that characterize adult T-cell leukemia (ATL), an aggressive lymphoproliferative disease from HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP), a chronic progressive neurodegenerative disease. Both diseases are associated with the infection of T cells by HTLV-1. The HTLV-1 virally encoded oncoprotein Tax has been implicated in the retrovirus-mediated cellular transformation and is believed to contribute to the oncogenic process through induction of genomic instability affecting both DNA repair integrity and cell cycle progression [14,15]. Serum samples were obtained from the Virginia Prostate Center Tissue and body fluid bank. All samples had been procured from consenting patients according to protocols approved by the Institutional Review Board and stored frozen. None of the samples had been thawed more than twice.
Triplicate serum samples (n = 68) from healthy or normal (n 1 = 37), ATL (n 2 = 20), and HAM (n 3 = 11) patients were processed. A bioprocessor, which holds 12 chips in place, was used to process 96 samples at one time. Each chip contained one "QC spot" from normal pooled serum, which was applied to each chip along with the test samples in a random fashion. The QC spots served as quality control for assay and chip variability. The samples were blinded for the technicians who processed the samples. The reproducibility of the SELDI spectra, that is, mass and intensity from array to array on a single chip (intra-assay) and between chips (interassay), was determined with the pooled normal serum QC sample ( Figure 1).

SELDI mass spectrometry
Serum samples were analyzed by SELDI mass spectrometry as described earlier [16]. The spectral data generated was used in this study for the development of the novel FDA.

Hierarchical clustering using functional data analysis
We propose to implement a hierarchical clustering algorithm for proteomics data using FDA, which consists of detecting hidden group structures within a functional dataset. We apply a new dissimilarity measure to the smoothed (transformed) proteomics functions µ i . Then we develop a new metric that calculates the dissimilarity between different curves produced by protein expression. The development of metrics for curve and timeseries models was first addressed by Piccolo [17] and Corduas [18]. Heckman and Zamar proposed a dissimilarity measure δ HZ for clustering curves [19]. Their dissimilarity measure considers curve invariance under monotone transformations. Let Λ i = {λ (i) 1 , λ (i) 2 , . . . , λ (i) mi } be the collection of the estimated points where the curve µ i (t) has a local maximum and let m i be the number of maximals per observation or per sample (i) · δ HZ is defined as where This measure is powerful for regression curves which are mainly monotone. On the other hand, Cerioli et al [20] propose a dissimilarity measure δ C extending the one proposed by Ingrassia et al [21]. Cerioli's dissimilarity δ C is defined by Both dissimilarity measures show good performance for time-series data. Dissimilarity δ C does not involve all the indices m i of the smoothed curve. It also uses the shortest distance between curves by involving few data points obtained by FDA smoothing.
A flexible dissimilarity measure is the one that may combine the characteristic of both measures δ HZ and δ C . This means that a potential dissimilarity measure should use the collected estimated points of the original curve obtained from FDA so that no information is lost and should work on different type of smoothed curves without using the monotonicity restriction.
In this sense, we propose a functional-based dissimilarity δ B measure which uses the rank of the curve proposed by Heckman and Zamar and generalizes Cerioli et al dissimilarity measure as follows: Obviously, d ii = 0 and d il = 0, if µ i and µ l have the same shape (T i = T l ). We can adjust the formula above to obtain a dissimilarity measure that satisfies symmetry, by taking δ B as our proposed dissimilarity measure: We used three powerful hierarchical methods to derive clusters or patterns using δ B and we compare the performance of δ B to δ C and δ HZ . The hierarchical algorithms we used are (1) Pam which partitions the data into different clusters "around their medoids," (2) Clara which works as in "Pam." Once the number of clusters is specified and representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid [22]. The sum of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The subdataset for which the sum is minimal, is retained. Each sub-dataset is forced to contain the medoids obtained  Figure 3. Clustering proteomics data with Diana. from the best sub-dataset until then. (3) Diana is probably unique in computing a divisive hierarchy, whereas most other software for hierarchical clustering is agglomerative. Moreover, Diana provides the divisive coefficient which measures the amount of clustering structure found. The Diana-algorithm constructs a hierarchy of clustering starting with one large cluster containing all n observations. Clusters are divided until each cluster contains only a single observation. At each stage, the cluster with the largest diameter is selected [22].

Functional data transformation reduces the dimensionality of the spectra
The spectral data were collected from proteomics analysis of a total number of serum samples (n = 68) including healthy or normal (n 1 = 37), ATL (n 2 = 20), and HAM (n 3 = 11) patients. The dataset is represented by an n × p matrix X, where p = 25, 196 is the number of variables (peaks) measured on each sample and n = 68 is the number of samples (patients). Any clustering algo- rithm on a datum (68 × 25, 196) will fail because of the singularity of the covariance matrix (n < p) and it will be difficult in manipulating matrices with 68 rows and 25, 196 columns which has 1.7133 × 10 6 elements. This problem would not be raised for heuristic-based (ie, pairwise similarity-based) clustering algorithms.
To reduce the dimensionality of the spectral data, we applied FDA by fitting a P-spline curve µ i (t) to each sample y i . P-splines satisfy a penalized residual sum of squares criterion, where the penalty involves a specified degree of derivation for µ i (t). For example, cubic splines functions are P-splines of second order, penalizing the second derivative of µ i (t). P-splines curves of order 3 penalize the third derivative of µ i (t). P-splines curves of order 4 lead to an estimate of µ i (t) with continuous first and second derivatives. We choose here to fit a P-spline curve of order 4 ( Figure 2). The fitting step is performed by fixing the number of degrees of freedom that are implicit in the smoothing procedure [23].
The next step performed on the smoothed curves is to find the landmarks or indices T i . We collected the first derivative of µ i (t), say µ i (t), using a smoothing P-spline function available in R. Those derivatives are crucial at determining the cut-off points or indices of µ i (t). We performed this step by computing an approximate 95% pointwise confidence interval for the first derivative of µ i (t) [24]. When the lower limit of this interval is positive, we have the confidence that µ i (t) will be increasing. When the upper limit of this interval is negative, we have the confidence that µ i (t) will be decreasing. Inside the interval, when the derivative changes from negative to positive, we have an optimal value which is a minimum. When the derivative changes from positive to negative, we have an optimal value which is a maximum. The maximum is set, for convenience, as the largest value of µ i (t) in that interval. In this study, we restricted the choice of indices to maximal values. Let Λ i = {λ (i) 1 , λ (i) 2 , . . . , λ (i) mi } be the collection of the estimated points where the curve µ i (t) has a local maximum and let m i be the number of maximals per observation or per sample (i). Consequently, dissimilarity measure is calculated to derive the dissimilarity matrices of size (n × n) for all samples using the maximum values.

Clustering spectral data using functional data analysis
The application of functional data transformation led to the reduction of the dimensionality of the spectra to half. The size of mass indices become 12, 598. To cluster the reduced data, we calculated the three dissimilarity matrices M δC , M δB , and M δHZ . It appears that an unusual sample (patient 11) hides a possible pattern that we are trying to discover. Figure 3 shows a clustering dendrogram of the data using Diana approach. Pam and Clara gave the same results. This suggests that sample 11 would be important for further investigation.
When we removed observation 11, we detected a fewer fuzzy patterns with δ C (Figure 4), δ HZ (Figure 5), and δ B (Figure 6). To be more specific, we investigated clusters proposed by δ C and δ HZ . A large number of clusters were proposed by both approaches (about 10 clusters). This strange result might be caused by the monotonicity as- sumption when using δ HZ or the loss of information when using δ C . For δ B , we provided the dendogram of the data using Diana approach (Figure 7). Three clusters were apparent. One well-separated cluster and two overlapped ones. For δ HZ and δ C , no structure was apparent which confirms the limitations of both dissimilarities as explained before.
To check the performance of our method, we calculated the confusion matrix between the predicted clusters and the clinical clusters using Diana (Table 1) and Clara ( Table 2). We find that 3 patients out of 11 were misclassified for cluster 1 (HAM), 6 out of 20 were misclassified for cluster 2 (ATL), and 3 out of 37 were misclassified for cluster 3 (normal). Ham and ATL shared the majority of the misclassified observations which makes sense since both groups gather patients with a disease caused by the same retrospective virus. The error rate of misclassification for both clusters (HAM and ATL) is about 20%. For normal patient, the error rate of misclassification is about 8%. The total rate of misclassification is about 16%.
When we used Clara-based hierarchical cluster algorithm with δ B , the classification result has dramatically been improved (Figure 8). The error rate of misclassification is reduced to 7%.The error rate of misclassification between HAM and ATL is about 9%, 5% of normal patients was misclassified. This result shows that a hierarchical δ B dissimilarity algorithm based on minimizing the dissimilarity of observations to their closest medoid performs better than a divisive hierarchical clustering algorithm based on δ B .

DISCUSSION
Cancer biomarkers can be used to screen asymptomatic individuals in the population, assist diagnosis in   suspected cases, predict prognosis and response to specific treatments, and monitor patients after primary therapy. The introduction of new technologies to the proteome analysis field, such as mass spectrometry, have sparked new interest in cancer biomarkers allowing for more effective diagnosis of cancer by using complex proteomic patterns or for better classification of cancers, based on molecular signatures, respectively. These technologies provide wealth of information and rapidly generate large quantities of data.
Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and diseases biomarkers.
Clustering and analyzing proteomics data has been proven to be a challenging task.
Proteomics data are provided usually as curves or spectra with thousand of peaks. A clustering algorithm based on a matrix of n observations (n samples which is usually small) and p peaks (p variables which is usually a large number) will be unsuccessful. A matrix of size (n p) will be singular and any method based on a matrix M (n × p) will not be robust enough and will induce errors. A clustering algorithm based on a well-chosen dissimilarity matrix (n × n) is more appropriate and more robust given the relatively moderate size of the matrix.
The use of a smoothing function for the spectra performs better for time series or for monotonic curves. We have previously successfully applied this smoothing function to large-scale proteomics data [25].
The application of Euclidean or Mahalanobis distances for instance may not perform well for this proteomics dataset, since those distances usually successfully applied to a typical data with specific expression, spherical or ellipsoidal (normally distributed data). A new dissimilarity measure has to involve other criteria such as the wealth of data points for each observation and the parallel nature expressed by the proteomics curve (or time series). On the other hand, a robust dissimilarity measure may perform badly on a curve with large data points or peaks.
Functional smoothing of proteomics expression profiles or spectra has proven to be very helpful. This has allowed us to minimize the number of peaks to retain only the ones that passed the performance of the FDA smoothing. In this study, after using FDA, we succeeded in retaining 50% of the smoothed peaks. The FDA with the dissimilarity measure δ B shows better performance by comparison to δ C and δ HZ known to perform well along with FDA on times-series data or on monotonic curves.
The two remaining difficulties that naturally arose are (1) to find meaningful peaks that can be used to provide better discrimination between the clusters, (2) to propose the optimal number of clusters instead of choosing them a priori. The model selection criteria might be useful to answer those questions. In fact, model selection scores use two components for selecting the number of variables and the number of clusters in a given density-based cluster analysis. The first term is the lack of fit generally proportional to the likelihood function. The second term is the penalty term (complexity term). For such proteomics dataset, we propose to use the sum of the negative δ B dissimilarity measure between all the observations to their closest medoids as a lack of fit function. The penalty term might be simple to derive but biased using AIC and BIC, for example, or it can be more difficult to derive if one used a more robust method such as information complexity-based criteria.

ACKNOWLEDGMENT
This work was supported by the SRGP Award by the College of Business, University of Tennessee in Knoxville, by the Leukemia Lymphoma Society, and the National Institutes of Health.