Analysis of proteomics data: Impact of alignment on classiﬁcation

: The Fisher Rao curve registration is used for curve alignment. Quality of registration is carefully studied using zoomed views. A related linear warp is considered, and seen to give somewhat inferior performance. Alignment is also seen to give large improvements in the ultimate classiﬁcation problem.


Fisher Rao alignment and linear warping
The Fisher Rao domain warping approach proposed by Srivastava et al. (2011) does a good job in aligning the marked spike features in the proteomics data, presented by Koch et al. (2014). The resulting aligned functions and the warping functions are shown in the top two panels of Figure 1 respectively, colored by sample type. The numbers in the top left plot show the location of the marked spikes. The vertical ordering of the numbers is determined by the relative height of the peaks of the intensity function. It is seen that these landmarks are very well lined up after Fisher Rao alignment.
The Fisher Rao warping functions (top right in Figure 1) exhibit an approximately linear shape, especially on the interval between the two vertical dashed lines. To investigate how well an exactly linear transformation would work for these data, we replaced the segments on this interval with linear approximations. In particular, for each original warping function, a linear regression model is fitted based on the warping function values between the two dashed lines, and the resulting linear function is used to substitute the corresponding segment of the original warping function. See the bottom right panel for the linearized warping functions. The corresponding aligned functions are shown in the bottom left panel, where both the curves and the landmarks look similar to those from the previous Fisher Rao alignment (top left). Note that the color pattern (i.e. the order of the function values) at some landmarks, such as Spikes 9 and 11, becomes slightly different after the linearized warping. This is because, computationally, these functions are discretized at limited time points. We zoomed in at each marked spike to further compare the performance of the original Fisher Rao alignment with the new alignment using the linearized warping functions. For most of the marked spikes, the original Fisher Rao alignment is better than the linearized alignment. An example is shown in the top panels in Figure 2. However, for a few marked spikes such as Spike 7 (bottom), the linearized alignment (right) may have a better performance than the original Fisher Rao alignment (left). It is seen from the two plots that the linearized approach aligns the marked Spikes 7 of the red samples (with low intensity at Spike 7) better than the Fisher Rao approach, while for the other samples the Fisher Rao approach does a better job. This explains why, for this data set, simple linear methods, e.g. Bernardi et al. (2014), can give reasonable results, although the Fisher Rao results are slightly better.

Classification of responders vs. non-responders
We show that the Fisher Rao alignment greatly improves data visualization and classification of the responders to chemotherapy against the non-responders.
The PC score scatter plots before and after the Fisher Rao alignment are displayed in the left two panels in Figure 3. The symbols differentiate the responders (crosses) from the non-responders (circles). The first plot shows an overlap among different samples before alignment, while in the second plot they are better clustered (replications of the same biological sample are much closer to each other) and separated. To further investigate the difference between the responders and the non-responders, we projected the data onto the Distance-Weighted Discrimination (DWD) direction (Marron et al. (2007)) that separates these two classes. The right two panels in Figure 3 show the corresponding DWD scores before and after the alignment. It is seen that the two classes are much better separated after alignment, and the distribution of the two subpopulations is more Gaussian. These visual improvements brought by the Fisher Rao approach are quantitatively studied in Table 1. In particular, the clustering of the two classes was studied using the SWISS permutation tests (Cabanski et al. (2010)), and the mean difference between these two classes was studied using DiProPerm (Direction Projection Permutation) t-tests based on the DWD directions (See Wei et al. (2013) for details). The resulting p-values are listed in the table. It is seen that both data clustering and classification are greatly improved after the Fisher Rao alignment, which is consistent with the previous discussion of Figure 3. Table 1 also shows results from the linearized alignment. It is seen that the Fisher Rao warping exceeds the linearized warping in both clustering and classification of the data. Neither of these is statistically significant, but that is not surprising given the very small sample size available.
Finally, we investigate which peptides (or spikes) play an important role in classifying the responders against the non-responders. For example, in the top left panel of Figure 1, at Spike 3, the red/orange numbers (i.e. responders) are perfectly separated from the blue/cyan numbers (i.e. non-responders). That is, the reference peptide 3 is important in classification and is less prevalent in responders. Peptides 7 (small in responders) and 8 (large in responders) are also important, each with only one misclassified number. In order to identify all of the potentially important peptides, Figure 4    the reference spikes in the aligned functions. Peptides with big absolute loadings are important in classifying the responders. Note that the important peptides 3, 7 and 8 discussed above correspond to prominent peaks/valleys in the loading plot. On the other hand, some reference peptides, such as 1 and 14, do not contribute much in the classification, as their loadings are close to 0. It is also seen that, some unmarked peptides turn out to be important in classifying the responders, with prominent peaks/valleys in the loading plot, such as the big negative spike between reference peptides 7 and 8. Further study of these peptides should be considered.