A glance at the applications of Singular Spectrum Analysis in gene expression data

In recent years Singular Spectrum Analysis (SSA) has been used to solve many biomedical issues and is currently accepted as a potential technique in quantitative genetics studies. Presented in this article is a review of recent published genetics studies which have taken advantage of SSA. Since Singular Value Decomposition (SVD) is an important stage of this technique which can also be used as an independent analytical method in gene expression data, we also briefly touch upon some areas of the application of SVD. The review finds that at present, the most prominent area of applying SSA in genetics is filtering and signal extraction, which proves that SSA can be considered as a valuable aid and promising method for genetics analysis.


Introduction
Nowadays there exists a large amount of datasets in the field of genetics and expression measurement and there are many different methods and techniques for analysing these datasets [1][2][3]. However, it has been widely accepted that the major difficulties in working with such data no longer is the validity of expression measurements, but the reliability of inferences from the data as the achieved data is difficult to understand without the use of proper analytical tools [4] and if research result is obtained from an inappropriate model it can never be translated into a valid scientific context [5].
Historically, such data have been analysed using parametric methods [6,7]. However, constraining pre-assumptions needed for parametric approaches led towards the growing popularity of nonparametric methods. Recently it has been concluded that nonparametric techniques can be used as an alternative approach for analysing genetics data because of their inherent nature [8], and accordingly the applications in biomedical and genetics fields have expanded.
Among many non-parametric methods, Singular Spectrum Analysis (SSA) is a relatively new approach which has proven to be very successful. SSA has already transformed itself into a standard tool in the analysis of biomedical, mathematical, geometrical and several other time series [9][10][11] and recently it has also been applied in genetics which has illustrated its strong potential for such studies [12,13].
The emergence of SSA is usually associated with the work by Broomhead in 1986 [14]. However, the ideas of SSA were independently developed in Russia and in several groups in the UK and USA. Since after, several papers on the methodology and applications of SSA have been published (see, for example, [15,16]). An introduction to this technique can be identified in the paper by Elsner  Tsonis [17] and a comprehensive description with several examples of theoretical and practical aspects of SSA can also be found [8,18].
The main advantages of the SSA technique in the field of genetics can be attributed to its signal extraction and filtering capabilities [19], batch processing of a set of similar series [20] and derivation of an analytical formula of the signal [21]. The application of SSA for noise reduction in microarrays, and signal extraction in gene expression data has received more interest. The reason underlying the significant interest in the SSA technique's filtering capabilities are due to the fact that genetics data is often characterised by the existence of considerable noise, filtering this noisy data is considered as one of the most arduous tasks when analysing genetics data [22,23].
For example, microarray is a very useful method for acquiring quantitative data in genetics and researchers today are conducting most of their studies using this method. The main advantage of microarray is the capability of studying thousands of genes simultaneously. However, microarray data usually contains a high level of noise, which can reduce the performance of the results [24].
This article categorises and summarises almost all recently published articles associated with the application of SSA in genetics.
Presented below, is a short description of SSA technique in doing so we mainly follow [9] where a more detailed description is made available. Moreover, the R package for this technique including decomposition, forecasting and gap filling for univariate and multivariate time series can be downloaded via [25] Consider a set of genetics observations in a series of Y N = (y 1 , . . ., y N ) with length of N. After choosing a window length L where (2 ≤ L ≤ N − 1), we construct the L-lagged vectors X j = (y j , . Now X is our multivariate data with L characteristics and K observations. The columns X j of X, are the vectors, positioned in an L-dimensional space R L . Define the matrix XX T : SVD of XX T gives us the collections of L eigenvalues ( 1 ≥ · · · ≥ L ≥ 0) and the corresponding eigenvectors U 1 . . . U L where U i is the normalised eigenvector corresponding to the eigenvalue i (i = 1, . . ., L). A group of r (with 1 ≤ r < L) eigenvectors determines an r-dimensional hyperplane in the L-dimensional space RL of vectors X j . By choosing the first r eigenvectors U 1 , . . ., U r , then the squared L 2 -distance between this projection and X is equal to L j=r+1 j . Based on the SSA process, the L-dimensional data are projected onto this r-dimensional subspace and the final diagonal averaging gives us an appropriate approximation of the first one dimensional series.
The remainder of this paper is organised as follows. In the following section we present a review of papers involving the application of SSA and SVD 1 on signal extraction and noise reduction. The SSA based on minimum variance and a hybrid modelling approach combining SSA and autoregressive (AR) model are also discussed in depth in that section. Section 3 provides theoretical developments leading to what is termed as "two-dimensional SSA" and the paper ends with some conclusions in Section 4.

Signal extraction and filtering
In this section we identify existing applications of SSA for signal extraction and noise filtering in genetics.
The first such application is reported in 2006 where SSA was used for signal extraction of Drosophila melanogaster's gene 1 The SVD algorithms used in this paper are either based on Hankel or Teoplitz matrix. expression profile [19]. The idea of using SSA for signal extraction was then followed in an approximately similar study in [26] which led to an improved result. By 2008 a more technical study conducted on the methodology of signal extraction from the noisy Bicoid (Bcd) protein profile in Drosophila melanogaster was presented in [21]. The problem under investigation in that study was complicated by two facts: (i) the data contained outliers and (ii) that the data was exceedingly noisy and the noise consisted of an unknown structure. The author examined two approaches for reconstructing signal more precisely: the use of small window length and improvements to separability by adding a constant to the series.
In addition, the activation of the hunchback (hb) gene in response to different concentrations of Bcd gradient was studied in [27] and SSA was applied for filtering two kinds of noise; experimental noise and the noise caused by variability in nuclear order [27].

SSA based on minimum variance
Recently, a modified version of SSA was examined for filtering and extracting the bcd gene expression signal [28] and the results illustrated that SSA based on minimum variance can significantly outperform the previous methods used for filtering noisy Bcd [28].
SSA based on minimum variance mainly relies on the concept that by dividing the given noisy time series into the mutually orthogonal noise and signal+noise components, an enhanced estimation of the signal can be achieved. Thus, after performing SVD, by adapting the weights for different obtained singular components, an estimation of the Hankel matrix X, will be achieved which in turn is corresponds to a filtered series.
A short description of the SSA based on minimum variance is given below. For more details, see [28,29].
Let us begin with the Singular Value Decomposition (SVD) of the trajectory matrix X: where W is the diagonal matrix of the weights to be determined. The SVD of the matrix X can be written as: where U 1 ∈ R L×r , 1 ∈ R r×r and V 1 ∈ R K×r . Now, the issue is in selecting the weight matrix W. If we represent the SVD of the Hankel matrix related to the signal as S, by considering different criteria in choosing this matrix, different estimation of S can be achieved. The LS Estimate of S is the current widely used approach in selecting the weight matrix W. This approach is based on the idea of removing the noise subspace but keeping the noisy signal uncorrelated in the signal+noise subspace. However, the accuracy of this estimator is mainly dependent on the estimation of the signal rank r since selecting singular values in LS follows a binary approach. Although in using this estimator, considering any assumptions is not needed.
In MV Estimate of S proposed by Hassani in [29], removing the noise components in the signal + noise subspace has been improved considerably. However, to obtain the MV estimate considering some assumptions is essential (for more details see [29]).
Let us now consider the weight matrix W of the LS and MV estimates: where It should be noted that, U 1 and V 1 , of LS and MV estimates are the same, but the singular values are different. Fig. 1 shows the signal extracted by SSA-MV in [28] along with fitted trend obtained by Synthesis Diffusion Degradation (SDD) model as the most widely accepted model in analysing Bcd gradient. As it appears SSA-MV yields more promising result for Bcd signal extraction.

SSA combined with AR model
Among many applications of the microarray technique, the study of rhythmic cellular processes has been considered as an important application. Rhythmic cellular processes are mainly regulated by different gene products, and can be measured by using multiple DNA microarray experiments. Having a group of gene experiments over a time period, a time series of gene expression related to the rhythmic behaviour of that specific gene will be achieved. Several studies on extracting the regulatory information from time-series microarray data and detection of cyclic and nonuniformly sampled short time series of gene expressions can be found in [30][31][32]. The characteristics of the rhythmic gene expression is as follows [33]: the number of time points and cycles related to a profile is usually very few. For example, the 14 time point elutriation observations may be in constitution of just one cell-cycle [20], this data set usually contains many missing values which need to be determined [34], the intervals spaced between time points are not equal and the gene expression data is extremely noisy [33].
In 2008, Du et al. implied SSA for analysing microarray results for extracting the dominant trend from noisy expression profiles and reducing the effect of noise [12]. The authors have also proposed a new procedure for analysing the periodicity of the transcriptome of the intraerythrocytic developmental cycle (IDC) by combining autoregressive (AR) model and SSA. This combination enabled them to successfully identify almost 90% of genes (4496 periodic profiles) in P. falciparum, which was a considerable achievement in terms of detecting 777 additional periodic genes in comparison to the results in [35]. Fig. 2 depicts the improvement yielded by using SSA in analysing periodicity. As shown in this figure, due to the removal of noise using the SSA, spurious peaks are eliminated from the spectrum.
According to [33], the periodic profiles can be detected by combining SSA and AR as follows: • At the first step SSA is used for the reconstruction of each expression data. For this aim only those expression profiles with sum of first two eigenvalues over the sum of all eigenvalues greater than 0.6 are selected for reconstruction. • The second step is devoted to the calculation of the AR spectrum, frequency f i at peak value point and the ratio of the power in f i Regions of Interest (ROI) of the reconstructed profiles achieved in the first step. It should be noted that to obtain the ratio between the power of the signal within the frequency band [f i−1 , f i+1 ] and the total power we follow S = power i /power total . • If the obtained power ratio S is larger than 0.7, the corresponding profile would be classified as periodic [33].
Presented below is a summary of the theory combining SSA and AR based on [39]: Let the microarray data be g × N matrix (g N), where g is the number of gene expression profiles and N corresponds to the number of samples. Time series gene expression Y N = (y 1 , . . ., y N ) can then be written in a form of an AR (p) model by forwardbackward linear prediction as follows [39]: y i = ˛1y i−1 + ˛2y i−2 + · · · + ˛py i−p (i = p + 1, . . ., N). (5) Using the forward prediction linear system the AR coefficients (˛1, ˛2, . . ., ˛p) can be estimated and gene expression can be recognised as a linear system. In an AR (p) model, p has to be set in a way that the system models the desired trend and avoids redundant data such as noise. By choosing m gene expression profiles, the number of linear equations is equal to m × 2(N − p).
As mentioned above, the SSA technique is often applied prior to spectral analysis, as a filtering method, in order to achieve a better accuracy. This is done by ignoring the small singular values in the reconstruction stage. Removing the noise component from the original noisy signal Y N yielding the noise reduced seriesŶ N which can consequently be used for estimating the AR coefficients ˛i (i = 1, . . ., p), and ultimately estimated ˛i are obtained using Yule-Walker equations [39].

SVD
Although SVD is an stage of SSA technique, it has also been used independently as a very useful and applicable tool for analysing the microarrays data (see for example, [40]) in which the challenging problem of eliminating cross hybridization in real-time microarray data was considered. Vikalo et al. in [40] evaluated multiple techniques such as SVD for separating the components of the compound signal, to find a better estimation of the amounts of both hybridising and cross-hybridising targets.
However, the microarray technology has become increasingly affordable, the data is still very complicated to analyse. In 2010 Rau et al. in [41] presented an algorithm to infer the structure of gene regulatory networks using an empirical Bayes estimation procedure for the hyperparameters of a linear feedback statespace model and used SVD of the block-Hankel matrix for model selection. This approach enabled them to significantly reduce the computation time required for running the algorithm, as it eliminates the need to run the algorithm over a wide range of values for the hidden state dimension. Moreover, as discussed in [42] the numerical computations in SVD even for a huge data set such as microarray data is not time extensive. In that study a new approach for ranking genes is presented which is based on the genes degree of regulation and the authors showed that the ranking of genes according to regulation is genetically more meaningful rather than using the absolute expression or variation over time and this method can be a valuable aid in finding the regulatory pathways and networks. In that work SVD was performed to the block-Hankel matrix of observation autocovariances estimated from the gene expression data, and the number of singular values related to the large magnitude was performed as an estimate for the best state space dimension. The singular values of the estimated Hankel autocovariance matrix were calculated and standardised to a 0-1 scale. The number of singular values of magnitude larger than the threshold was used as an estimate for the state space dimension.
Based on the mentioned studies, SSA procedure is a flexible technique for the signal extraction and gene expression degradation modelling. The feasibility of capturing the signal from segmentation genes profile in Drosophila embryos and studying the rhythmic behaviour of a specific gene suggest that this technique may be of general use in evaluating other expressional systems. Thus, this method can be a valuable aid in analysing spatially inhomogeneous noisy data.

Application of two dimensional SSA
The 2D-SSA approach has been used to process two-dimensional scalar fields [8]. The first difference between 2D-SSA and SSA is about the window length. In 2D-SSA analysis we need to choose two different values for the window length L (L 1 , L 2 ), whilst in univariate SSA we only require to select one window length. Note that if L 2 = 1, then 2D-SSA is equivalent to SSA. By choosing L 2 = M, the interaction among different series is taken into account [43]. For more information see [44].
It is worthy to mention that the gene expression can be traced either in just anterior-posterior (AP) axis or both AP and dorsoventral (DV) axis which the latter case is the subject to 2D-SSA. The data points used in 2D-SSA study attributes to the intensity levels for the positions along both AP and DV axis and are considered as a sequenced series.
Bcd is a transcriptional regulator of downstream segmentation genes where the alteration in Bcd gradient shifts the downstream patterns [45]. However, it has been accepted that in the embryos, zygotic gene products are considerably more precisely positioned than the gradients related to maternal genes, which indicates an embryonic error depletion process [46]. In 2011 the anterior-posterior (AP) segmentation of Drosophila was studied in [47] to determine how gene regulation dynamics control noise. In this research the activation of the hb gene by the Bcd protein gradient in the anterior part was studied by modelling the noise observed in hb regulation using a chemical master equation approach. For solving this model, the MesoRD software package has been used which mainly follows a stochastic approach [48] and the results indicate that Hb output noise is mostly dependent on the transcription and translation dynamics of its own expression, and that the multiple Bcd binding sites located in the hb promoter also improve pattern formation [49].
Noise in that study is calculated by: where data is background-removed intensity of an energid (nucleus plus surrounding cytoplasm), trend is the signal extracted using 2D SSA (AP and DV) and m indicates the positions. This measurement is obtained for the activated region (15-45% EL). In 2012 Golyandina et al. used 2D SSA to measure betweennucleus variability (noise) seen in the gradient of Bcd in Drosophila embryos [50]. In that paper, they measured the noise using 2D SSA for comparing the results of fixed immunostained embryos with live embryos with fluorescent bcd (bcd-GFP).
It should be noted that using SSA for signal extraction gives the ability of using both dimensions (AP and DV) which leads to a more reliable result as in this case more detailed information regarding the data has been considered by the technique to give the results.

Conclusion
The aim of this paper was devoted to review the applications of SSA in genetics studies. Previous research has shown that the SSA technique is very effective in signal extraction and noise filtering in genetics data.
Theoretical developments presented here as two-dimensional SSA and SSA based on minimum variance have also enabled the researchers to achieve enhanced results and a better estimation of extracted signal. As a non-parametric method SSA has given us very promising results which are more reliable than those obtained by other methods. However, SSA has not revealed its full potential yet, areas like optimising the embedding dimension and change point detection are still open to pursue.