Bicoid Signal Extraction with a Selection of Parametric and Nonparametric Signal Processing Techniques

The maternal segmentation coordinate gene bicoid plays a significant role during Drosophila embryogenesis. The gradient of Bicoid, the protein encoded by this gene, determines most aspects of head and thorax development. This paper seeks to explore the applicability of a variety of signal processing techniques at extracting bicoid expression signal, and whether these methods can outperform the current model. We evaluate the use of six different powerful and widely-used models representing both parametric and nonparametric signal processing techniques to determine the most efficient method for signal extraction in bicoid. The results are evaluated using both real and simulated data. Our findings show that the Singular Spectrum Analysis technique proposed in this paper outperforms the synthesis diffusion degradation model for filtering the noisy protein profile of bicoid whilst the exponential smoothing technique was found to be the next best alternative followed by the autoregressive integrated moving average.


Introduction
Morphogens are molecules which determine a cell's destiny in a concentration-dependent mode by governing the pattern of tissue development and the position of various specialized cell types within a tissue in the process of morphogenesis [1][2][3]. A classic example of morphogens is bicoid (bcd), which is the first known morphogen identified by Nsslein-Volhard in 1988 [1] and encodes a homeobox transcription factor (in what follows, the italic lower-case bcd represents either the gene or mRNA and Bcd refers to protein). bcd is localised at the anterior end of the egg during the oogenesis [2] and translation of bcd begins after fertilization. Consequently, Bcd distributes along the anterior-posterior (AP) axis of the egg, forming a concentration gradient [2]. Such diffusion of Bcd by regulating the production of the anterior structures determines the position and size of head and thorax of an adult fruit fly (http:// highered.mcgraw-hill.com).
Several computational models have been published for Bcd gradient over the last three decades (see, for example, [3]). However, as the Bcd profile achieved by fluorescence antibodies technique is highly volatile, some proposed models, such as the simple synthesis diffusion degradation (SDD) model, only exhibited limited performance [4,5]. They fail to clearly explain some characteristics of the Bcd gradient, such as protein life time and length constant [3,[6][7]. Figure 1 shows a typical example of the Bcd gradient along the egg length at cleavage cycle 14 (3), effect of noise (i.e., fluctuations visible in Figure 1) in this gradient can be seen as the high volatile pattern. An initial look at the distribution suggests Bcd follows an exponential trend. However, owing to the high volatility seen in the series, the extraction of this signal is not a simple task.
SDD, which was formulated before the identification of bcd [4,[8][9][10][11], is the most widely-accepted among the models used to explain Bcd diffusion pattern. SDD ich. As a relatively simple model, SDD follows an exponential curve [12]: where A is the amplitude, x is distance from the anterior [13], and k is the length parameter obtained by fitting an exponential model to the bcd intensity profile and computing the position at which the concentration has dropped to 1=exp of the maximal value at the anterior (at x ¼ 0) [3]. However, this model is not fully consistent with all the experimental observations. For example based on [4], if Bcd molecules diffuse along the embryo with diffusion constant D and Bcd lifetime of s, the concentration of Bcd in this model follows: @mðx; tÞ @t ¼ D @ 2 @x 2 mðx; tÞ À s À1 p mðx; tÞ þ Sðx; tÞ ð 2Þ where, x and t represent positions along the egg and time, respectively, Sðx; tÞ is a source function describing the production of Bcd molecules, mðx; tÞ is the formed concentration, and s p represents protein lifetime [14]. Nevertheless, when using this model the time needed for attaining the steady state concentration profile is much longer than the protein lifetime s, whereas the length constant k is much smaller than the length of the embryo. Moreover, pattern of Bcd expression established by any model should be flexible to different time scales, egg length, and embryos sizes [4,10]. Not only in developmental studies but also in all fields of genetic studies, signal extraction and noise reduction are regarded as important tasks since genetic data are often characterized by the existence of considerable noise. Many methods are utilized for signal extraction, such as machine learning algorithms [15,16] and different background removal techniques [17][18][19]. In this paper we evaluate the use of powerful and popular signal processing techniques which include both parametric and nonparametric methods to provide a sound extraction of Bcd signal. Our aim is to examine whether the selected signal processing models can provide a more accurate signal extraction of Bcd in comparison to SDD.
The selection of models representing both parametric and nonparametric methods is important for several reasons. Firstly, as seen below, the residuals following signal extraction in Bcd are nonstationary. Secondly, parametric models rely on the underlying assumptions of normality and stationarity, and interestingly, SDD model is parametric. Thirdly, as noted in [20], for the parametric methods, assuming stationarity for the data, linearity of the model and normality of the residuals can provide only an approximation of the true situation. Therefore, a method that does not depend on these assumptions could be very useful for modelling and extracting the signal in Bcd data. Moreover, previous applications in solving signal extraction problems were taken into account when selecting models in this study. We use the SDD model as the overall benchmark as it is the most widely accepted approach for signal extraction in Bcd. We also consider the parametric autoregressive integrated moving average (ARIMA) [21], which has been applied for signal extraction in various fields both historically and recently (see for example, [22][23][24]). In addition, autoregressive fractionally integrated moving average (ARFIMA), which is mainly recognized as a parametric method suitable for long memory processes where the decay is slower than in an exponential process [24], is included for comparison as well. Other parametric models considered are state space models such as exponential smoothing (ETS), since SDD in itself follows an exponential curve [12]. Singular spectrum analysis (SSA) technique (like neural networks, NN) is a nonparametric signal processing model and does not rely on any assumptions [25]. The SSA technique was initially evaluated for gene expression [26] and has been previously applied for signal extraction [2,6,7,[26][27][28][29][30]. Therefore, the models used in this paper include an optimized version of ARIMA [31], an ARFIMA model [31], ETS [32], a feed forward NN model [32], and SSA [33].
Gene expression can be traced either in time or space. The data points used in this study represent the intensity levels for the positions along the AP axis and are considered as a sequenced series. Therefore, one-dimensional gene expression data are used for signal extraction and the second spatial coordinate (Dorsoventral DV axis) has not been considered in this study. Moreover, it is important to note that this paper is not aimed at showing any particular technique to be universally best for modelling the Bcd signal. Instead we are mainly interested in showing how the selected signal processing techniques compare and compete against each other, and the widely accepted SDD model. Any efforts at finding a universally optimal model for this purpose would require more extensive research which considers a wide range of filtering techniques and that objective is beyond the mandate of this paper.

ARIMA
An optimized version of the ARIMA model is provided through the forecast package in R, referred to as auto:arima, and a detailed description of the algorithm can be found in [32]. In brief, the number of differences is defined as d, which may be determined using either a Kwiatkowski-Phillips-Sch midt-Shin (KPSS) test, augmented dickey fuller test, or the Phillips-Perron test. The algorithm then minimises the Akaike information criterion (AIC) to determine the values for the order of autoregressive terms p, and the order of the moving average process q. The optimal model is chosen to be the model, which represents the smallest AIC. The decision on the inclusion or exclusion of the constant c is made depending on the value of d.
To expand on the above summary, we provide the following modelling equations for ARIMA based on [33]. A nonseasonal ARIMA model may be written as: where l is the mean of ð1 À BÞ d ðy t , c ¼ lð1 À / 1 À . . . À / p Þ and B is the backshift operator. In the R software, the inclusion of a constant in a non-stationary ARIMA model is equivalent to inducing a polynomial trend of order d in the forecast function. It should be noted that when d ¼ 0, l is the mean of y t . The seasonal ARIMA model can be expressed as [32]: where UðzÞ and HðzÞ are the polynomials of orders P and Q, and e t is white noise. There is an implied polynomial of order d þ D in the forecast function, if c-0. As mentioned previously, to determine the values of p and q, the AIC of the following form is minimised: where k=1 if c-0 and 0 otherwise, and L represents the maximum likelihood of the fitted model.

ARFIMA
The general form of an ARFIMA(p,d,q) model shares the same form as an ARIMA process shown in equations (3). However, in contrast to the ARIMA models, here d is allowed to take the form of non-integer values. The ARFIMA model used here is estimated automatically using the Hyndman and Khandakar [31] algorithm explained above, and the Haslett and Raftery [34] algorithm for estimating the parameters including d. Moreover, this Hyndman and Khandakar [31] ARFIMA algorithm combines the functions of fracdiff and auto:arima to automatically select and estimate an ARFIMA model. Initially, the fractional differencing parameter is assumed to be an ARFIMA(2,d,0) model. Thereafter the data are fractionally differenced using this estimated d and an ARMA model is selected for the resulting series using auto:arima. Finally, the full ARFIMA(p,d,q) model is reestimated using the fracdiff function.

ETS
The ETS technique overcomes a limitation found in earlier ETS models that did not provide a method for easy calculation of prediction intervals [35]. The ETS model from the forecast package considers the error, trend, and seasonal components along with over 30 possible options for choosing the best ETS model via optimization of initial values and parameters using the maximum likelihood estimator and selecting the best model based on the AIC. A detailed description of ETS can be found in [32].

NN
The NN model has been successfully used for gene expression profiling, clustering and also gene identification [36][37][38]. NN model is referred to as nnetar and provided through the forecast package for R. A detailed description of the model can be found in [32] along with an explanation on the underlying dynamics. In brief, the nnetar function trains 25 NNs by adopting random starting values and then obtains the mean of the resulting predictions to compute the forecasts. NN takes the form where x t consists of p lags of y t and T denotes transpose. Then, the function w has the logistic form: This form of NN is referred to as one hidden layer feed forward NN model and is the default version in the package. However, we consider the use of multiple hidden layers in order to select the best NN model for these types of data. The nonlinearity arises through the lagged y t , entering in a flexible way through the logistic functions of (8). The number of logistic functions (k) included is known as the number of hidden nodes.
The NN model in this paper is estimated using the automatic forecasting model, nnetar which is provided through the forecast package in R. For a detailed explanation on how the nnetar model operated, see the 'Package forecast' documentation (http://cran.r-project.org/web/packages/forecast/forecast.pdf). The parameters in the NN model are selected based on a loss function embedded into learning algorithm.

SSA
SSA has been applied for extracting the signal of Bcd and other segmentation genes [2,6,26,27]. The basic SSA method consists of two complementary stages: decomposition and reconstruction; of which each stage includes two separate steps. At the first stage the Bcd is decomposed into the sum of a small number of independent and interpretable components such as a slowly varying trend and a structureless noise [33], and at the second stage the noise-free Bcd is reconstructed [39]. The SSA modelling process for Bcd is summarized below, and in doing so we mainly follow [33].
The first step is concerned with mapping a one dimensional time series Y N ¼ ðy 1 ; . . . ; y N Þ into the multi-dimensional series This process is referred to as embedding whilst the vectors X i are called L-lagged vectors. The single choice of the embedding stage is the window length L, which is an integer such that 2 6 L 6 N À 1. This step results in the trajectory matrix X, which is also a Hankel matrix and takes the form: Next we obtain the singular value decomposition (SVD) of the trajectory matrix and represent it as a sum of rank-one bi-orthogonal elementary matrices. The eigenvalues of XX T are denoted by k 1 ; . . . ; k L in decreasing order of magnitude ðk 1 P . . . k L P 0Þ and by U 1 ; . . . ; U L the orthonormal system. Set then the SVD of the trajectory matrix can be written as: where . The expansion (9) is uniquely defined if all the eigenvalues have a multiplicity of one. The process of splitting the elementary matrices X i into several groups and summing the matrices within each group is called grouping and transfusing each resultant matrix from grouping step to a less noisy series is called diagonal averaging.
Note that usually the first eigenvalue corresponds to the trend of a given dataset when using SSA. Thus we extract the first eigenvalue alone and consider the remainder as noise, and then perform diagonal averaging to transform the matrix containing the first eigenvalue into a time series which will now provide the extracted signal from Bcd and can therefore be compared with the other models.
Details on which models are parametric or nonparametric are presented in Table S1.

Simulation results
A series of simulated data are used to evaluate the performance of different techniques. We begin the simulation by considering an exponential curve drawn from the SDD model as the benchmark. In order to obtain a noisy series similar to the real one, random error e of a normal distribution with zero mean and variance r 2 e with different amplitudes were added to different parts of the series [7]. This simulation is repeated 1000 times. Finally, by fitting the different mentioned models to the noisy simulated Bcd series, the following metrics are calculated in order to measure the accuracy of signal extraction. These include the mean absolute error (MAE), mean absolute percentage error (MAPE), and the ratio of the root mean squared error (RRMSE).

MAE
100 Â y Tþh À y Tþh;i y Tþh ; ð11Þ where,Ŝ l are the estimated values of s i , obtained via an alternate model andS l are the estimated values of s i obtained through SDD and N is the series length. The alternate model outperforms the SDD method if RRMSE < 1, and performs worse than SDD if RRMSE > 1. Table 1 reports the average RMSE values attained by each model following 1000 iterations and some other descriptives relating to the performance of each model. A significant reduction in the RMSE value is achieved by SSA, confirming that these results are more accurate than those estimated by SDD and other models. Based on the RRMSE criterion, the parametric SDD model reports the worst performance in comparison to the other models considered in this simulation. SDD is outperformed by 66%, 58%, and 52% by the ETS, ARIMA, and ARFIMA, respectively. Interestingly, the nonparametric feed-forward NN model is the second worst performer and outperforms the SDD model by 45%. It should be noted that we have examined the use of multiple hidden layers and selected a NN model with two hidden layers as the most appropriate for these data based on the lowest RMSE and MAE. According to RRMSE, SSA provides the best signal extraction and is successful at outperforming the SDD model by 80%. The MAE and MAPE criteria also confirm that SSA is the best model in comparison to SDD, ARIMA, ARFIMA, ETS, and NN, whereas SDD is in fact the worst performer in this case. The minimum and maximum columns clearly indicate that there is less variation in the results reported by SSA and accordingly we can confirm that SSA is the most stable model in this case.
In order to verify the statistical significance of the simulation results, we opted for the nonparametric two-sample Wilcoxon test, which could indicate whether the RMSE values attained from two given methods via simulation actually differ in terms of the size. Our results showed a significant difference between the RMSE values obtained via SDD and all other models (P = 0.01), further confirming the validity of the results.
The superior performance portrayed by the SSA technique could be explained by several factors. First, SSA model is a specialised filtering technique with the ability of decomposing a given time series and analysing the eigenvalues to accurately identify and separate the noise from the signal. The appropriateness of the separation between signal and noise obtained via SSA was confirmed by the very small values of w-correlation, indicting that the signal and its corresponding noise are almost w-orthogonal [40]. Secondly, it is also interesting to note that the minimum and maximum errors (Table 1) reported by SSA over the 1000 simulations are significantly lower than the minimums and maximums reported by the other models. As a result, it is clear that the SSA technique is more reliable and suitable for signal extraction in Bcd as the average RMSE, MAPE and MAE values are significantly lower than those reported by the other models over the 1000 iterations (more detailed results of the simulation study are available upon request).

Bicoid data
Next, we used the 17 series (http://urchin.spbcas.ru/flyex/) presented by Alexandrov et al. [26] as the real data for further analysis. A complete explanation on the method and biological characteristics of the 17 series can be found in [30,40,41].
The expression level of Bcd protein in each Drosophila embryo was measured by using fluorescently-tagged antibodies. Such quantification relies on the assumption that the actual protein concentrations detected by the antibodies and the fluorescence intensity are linearly related to the Bcd protein concentration in the embryos. In this study nuclear intensities were obtained from a rectangle of 50% of the DV height of the embryo, centred on the AP axis. These data present the gene expression of the AP coordinate between 20% and 80% egg length, which can be considered as a sequenced series. Similar to [26], we set to extract the signal from onedimensional gene expression data, hence, the second spatial coordinate (DV axis) has not been considered.
First we seek to extract the signal in the actual data using various signal processing techniques. The examples of the output from these efforts for embryo hz29 can be found in Figure 2. It is evident that in comparison to the other models, Figure 2 Signal extraction using various signal processing techniques and SDD model Black and red colours depict the noisy series and the extracted signal, respectively. SDD, synthesis diffusion degradation.
the SSA method provides a relatively smooth signal line. In addition, SDD provides a smooth line as opposed to ARIMA, ARFIMA, and NN models. However, SDD signal appears to be least accurate based on the RMSE, MAE, and MAPE criterions. Overall, the results from the application to real data appear to be consistent with the simulation findings based on data shown in Figure 2.
A close look at Figure 2 suggests that the SDD signal line is the smoothest one out of the evaluated options. However, SDD signal extraction is also the worst fit, as it fails to accurately model the signal amidst the fluctuations, although it appears to have filtered these fluctuations out. The feedforward NN model with two hidden layers has difficulties in filtering the fluctuations to accurately capture the signal in Bcd. Similar issues exist to various extents for ARIMA and ARFIMA models (ARFIMA being relatively worse than ARIMA). In contrast, ETS and SSA are the most effective ones. Yet, scrutinization of the ETS signal extraction graph revealed that ETS line loses its smoothness to some extent at the middle stages, whereas SSA model is able to provide a smooth signal line right throughout. Therefore, based on the smoothness of the extracted signal, we conclude that SSA does indeed capture the signal in Bcd relatively better than the other methods tested. Figure 3 shows the residuals from each model following signal extraction. The residuals for the Bcd data following signal extraction are nonstationary. To verify this observation, we examined each of the residuals using the augmented Dickey-Fuller (ADF) test for unit roots. Our results showed that the residuals are in fact nonstationary (P = 0.01). Interestingly, the parametric models of ARIMA and ARFIMA are able to provide a relatively sound signal extraction for Bcd, although the data are nonstationary. ARIMA algorithm used in this paper automatically considers taking the number of differences until the series becomes stationary, whereas for ARFIMA model, we evaluated a log transformation, which worsened the signal extraction. However, residuals from SDD model do not appear to be white noise but have a clear signal. Accordingly, we tested the residuals from the parametric models for white noise using the Ljung-Box test and indicated that SDD residuals are not white noise (P = 0.01). This further explains the comparatively mediocre performance shown by SDD when applied to real data.
Finally, we evaluate the correlation between the signal and noise extracted from each model based on Pearson, Kendall and Spearman correlation coefficients in order to analyze the noise separation capabilities. A correlation coefficient close to zero means that signal and noise do not have a correlation and hence are well seperable. The correlations for all 17 datasets are reported in Table 2. It is evident that all models tested have attained a satisfactory level of separation between noise and signal with SSA providing correlations below 0.10 in 15 out of the 17 cases. Moreover the correlations reported by SSA tend to be smaller (with the exception of a few cases). These results further support the relatively sound performance of SSA in extracting the Bcd signal. The filtering capabilities displayed by SSA are indeed advantageous for signal extraction. Extracting the Bcd degradation signal from the noisy data is central to this study. We tested various models using both simulated and real data to ensure the validity of the findings. The obtained results illustrate SSA outperforms the SDD model and other methods tested in this study for filtering noisy Bcd. SSA is more flexible than the SDD for the Bcd degradation modelling, whereas ETS was the next best alternative followed by ARIMA. However, it should be noted that we used the simplest form of the SDD model. More advanced models have been developed as the solution to simple SDD model (see for example [42]), which are expected to give more reliable results in Bcd signal extraction. Moreover, we tested both parametric and non-parametric algorithms in this study, it is of note that non-parametric algorithms are not explicitly better than parametric algorithms. Their performance depends and varies on the nature of the data in question. In the case of Bcd signal extraction, we find the nonparametric SSA model outperforming the rest. Interestingly the simple parametric model of ARIMA outperforms the nonparametric NN algorithm. The poor performance of the NN model can be attributed to its proneness to overfitting.
In conclusion, our results confirm that filtering is very important for Bcd curve fitting and the SSA technique yields a promising result for Bcd analysis. Also, in comparison to other parametric and nonparametric methods evaluated in this paper, using SSA for signal extraction gives the ability of using both dimensions (AP and DV) which leads to a more reliable result. It would be insightful to consider various other filtering approaches, such as nonparametric linear filtering, wavelets, and the NN models (as described in [43,44]) for trend extraction in Bcd in order to examine any differences in performance in comparison not only to ETS and more advanced SDD models, but also to SSA, which has shown promising results.