Robust background normalization method for one-channel microarrays Tek-kanallı mikrodizinlerde sağlam ardalan normalizasyon metodu

Background: Microarray technology, aims to measure the amount of changes in transcripted messages for each gene by RNA via quantifying the colour intensity on the arrays. But due to the different experimental conditions, these measurements can include both systematic and random erroneous signals. For this reason, we present a novel gene expression index, called multi-RGX (Multiple-probe Robust Gene Expression Index) for one-channel microarrays. Methods: Multi-RGX, different from other gene expression indices, considers the long-tailed symmetric (LTS) density, covering a wider range of distributions for modelling gene expressions on the log-scale, resulting in robust inference and it takes into account both probe and gene specific intensities. Furthermore, we derive the covariance-variance matrix of model parameters from the observed Fisher information matrix and test the performance of the multi-RGX method in three different datasets. Results: Our method is found to be a promising method regarding its alternatives in terms of accuracy and computational time. Conclusion: Multi-RGX gives accurate results with respect to its alternatives, with a reduction in computational cost.


Introduction
A microarray, also known as the DNA chip, gene chip or biochip, consists of thousands of DNA sequences, called 112 Tülay Akal et al.: Robust background normalization method for one-channel microarrays probes, which are attached to a solid surface to measure the gene expressions.By collecting the amount of the mRNA transcripts measured, an approximation to the gene expression level, are turned into pictures [1].
The oligonucleotide is a type of single channel microarrays with a 25-base pair long and the Affymetrix Gene-Chip is an example of the popular oligonucleotides.In this array, we observe probes that are the known segments of particular gene sequences.Each probe contains two parts, namely, the perfect match (PM) and the mismatch (MM).The former stands for the perfect transcription of cRNA and the latter is aimed to measure the faulty signals on the arrays by changing the 13th base pair of PM.
In a microarray study, we can define three main sources of variations in signals, namely, the nonspecific hybridization, the background signal and the stray signal.They are also shortly called as the background errors.Accordingly, we can call these three types of variations as the background errors [2,3].
Thereby, the term gene expression index stands for the mathematical method to estimate the true expression level in the oligonucleotides under such errorenous signals.There are a number of indices developed to infer the signals.In this study, we propose a novel gene expression index, called multiple-probe robust gene expression index, multi-RGX, for modelling the observed gene expressions in oligonucleotides.
Among many choices, one of the most common gene expression indices is the MBEI method [4], which cannot make inference for large numbers of arrays as the calculation is based on the least square approach.On the other hand, MAS5.0 [5], suggesting a smart solution to solve the less estimated intensities in PM regarding the estimated MM, may cause bias in inference of the true signal.Similarly, RMA [6] also fails under this condition.Similar to RMA, mgMOS [7] also accepts that the MM probe intensities are treated as only coming from the background signal.But different from RMA, this model makes use of latent variables, standing for different binding affinities of probes within a specified probe set in order to describe the correlation between the PM and MM intensities.
On the other side, GC-RMA [8] is the extended version of RMA by considering the GC content in the data and by adding, the first time, the presence of true signals in the MM probes.On the contrary, in practice, it equates the fraction of the true signal in MM to zero in order to simplify the calculation.But as the computation in the Bayesian framework is demanding, multi-mgMOS [9] considers performing it via a maximum likelihood approximation under the same model of BGX [10] so that the computational cost can reduce.However, being more efficient than BGX, the calculation is still time demanding for large datasets due to the application of the Bayesian method.Accordingly, FGX [2] is proposed to decrease the underlying cost by fully using the maximum likelihood approach in inference of model parameters.But, it is based on a strict normality assumption and the estimates are found for the probe mean of each gene.
Another recent method is called the Hook calibration method [11] which is a singlechip calibration alternative taking mismatched probes as the internal reference.Also, RMX [12] tries to find optimal robust estimators for multiple raw signals by making use of the Kolmogorov distance which is non-parametric.On the other hand, Lahti et al. [13] propose an online learning algorithm, called RPA which gives advantage to update probe-level parameters when the observations comes up.However, in modeling, it still uses the RMA method resulting in the same drawback in the description of the MM probes.
Hereby, in this study, we suggest a new gene expression index, which includes both gene and probe specific intensities, alike FGX and RGX, to model true signals in oligonucleotides.In this method, we again use the longtailed symmetric family (LTS) like RGX in order to keep the robustness.The robust estimators are less sensitive to outliers in the data and tolerate departures from target distributions compared to RGX.In inference of the model parameters, we implement the modified maximum likelihood method, and get explicit and close form for each estimator.Moreover, we derive the variances and covariances of model parameters via the observed Fisher information matrix.From the analyses by different datasets, we conclude that the multi-RGX method is promising in terms of accuracy and computational demand and can be a helpful tool for the application in computational biology, bioinformatics and biomolecular engineering.

Multiple-probe robust gene expression index: multi-RGX
As described in the previous section, the major aim of our study is to suggest a new gene expression index.Moreover, we aim that the new index can estimate the true signals more accurately, since we model not only gene specific intensities, but also both gene and probe intensities.Finally, multi-RGX covers a wider range of distributions for the signals, hence, does not ignore any information in the data.Accordingly, it suggests a robust method to model the true signals.
Hereby, in this novel model, called as multi-RGX (multiple-probe robust gene expression index), we assume that the intensities of the PM and MM probes on the logarithmic scale have the density from the LTS, similar to the idea of RGX.On the other hand, different from this, it does not follow an iterative process in the optimization of the likelihood function, rather, it works with the explicit estimates of the desired parameters.
In a standard Affymetrix chip, accepting that there are m probes, and in each probe, there exist n genes coming from the PM and MM probes, it is assumed that both PM and MM have true signals in such a way that PM covers the complete true signals, whereas, MM includes only a fraction of them.To define the amount of the underlying true signals in MM, a fraction term p is defined.Hereby, the intensities coming from the PM probes are distributed as LTS with a mean containing a constant term and another term, which stands for the true signal changing per gene.Additionally, its variance is constant over genes and probes.Indeed, this assumption may be considered strong in modelling.However, from the application of this assumption in FGX and RGX indices, it has been found that it is competitive in terms of the accuracies of the signals with respect to the gene specific variance as BGX implements.Moreover, from alternative modellings via the gene specific variance, it has been observed that none of the estimators can have explicit solutions under a gene specific variance [14], and the estimates can be obtained via optimization steps which cause an increase in the computational time.On the other hand, different from PM, MM has a mean containing the same constant term and the part of the true signal.Hence, in modelling of the PM and MM intensities for each probe j (j = 1, …, m) and each gene i (i = 1, …, n), we consider the following distributional assumption on the log-scale: PM LTS( , ) and MM LTS( , ), where S i shows the true signal for the gene i and μ H denotes the constant background signal.Furthermore, σ 2 indicates the constant variance and p stands for the fraction value within the range of zero to one that implies the ratio of the true signal in the MM probes.Thereby, the corresponding likelihood of the model in Equation ( 1) is found via where v ≥ 4 and k = 2v − 3.
Then, the first derivatives of the function lnL are taken with respect to each parameter as stated below: But it is observed that the underlying partial derivatives cannot produce unique solutions due to the repeated nonlinear functions in their formulas.Therefore, we suggest to use the modified maximum likelihood approach (MMLE) [15] that is based on approximating the nonlinear functions via the first order Taylor expansions around the ith population quantile t (i) of the student-t distribution with the (2v − 1) degrees of freedom.In the calculation, we apply the order statistics and define the partial derivations of the objective log-likelihood function in Equation ( 2) in terms of small linear sub-pieces.Moreover, it is found that the estimators based on MMLE are asymptotically efficient and equivalent to MLE [15,16].
114 Tülay Akal et al.: Robust background normalization method for one-channel microarrays In this study, we observe that the nonlinearity in the partial derivatives is caused by the following expressions such that Then, by approximating the common nonlinear functions 2 ( ) 1 their first-order Taylor expansions can be derived by 2 1 and .
Finally, by setting Equations ( 7)- (10) to zero, we get the optimal MML estimates of parameters.
As a result, ˆH µ is found as: On the other side, from Equation ( 9), we obtain the estimate of the true signal for each gene i as: However, by substituting Equation (12) into Equation ( 11), we can define ˆH µ in an alternative way as follows: . ( ˆ1) where The estimate of the common fraction term p is computed by four roots and the optimal one which maximizes the log-likelihoods is found as the following equation: Finally, from Equation (10), we can get MMLE of σ by: Adjusting the degree of freedom as 2nm − (n + 2):

Observed Fisher information matrix and estimators for variances and covariances
The variance of the asymptotic maximum likelihood estimator can be found from the Fisher information matrix [17] and as stated previously that MMLE is asymptotically equivalent to it, resulting in the same feature of estimators.
Accordingly, the Fisher information matrix I is generated from the second partial derivatives of the loglikelihood function with respect to each parameter as stated below: , where i, k = 1, …, n, present the gene and j = 1, …, m, shows the probe indicator, respectively.Moreover, l stands for the loglikelihood function lnL.Then, the I matrix is derived via In order to find the variance-covariance matrix, we partition the above matrix into the four sub-matrices.The details of this partition can be found in [18] and its application can be seen in [2,3].Once I is reorganized with respect to its sub matrices, the variances of the model parameters can be derived from simultaneous solutions of partial derivatives from the reorganized I.For an illustration, we show the variance of ˆH µ as follows:

Application via a real dataset
In the analysis, we use a benchmark Affymetrix spike-in dataset.This dataset consists of 11 spike-in genes coming from four bacterial ancestor genes and 59 arrays with 10,864 probe pairs.For the analysis, the 16 spike-in probe pairs are taken.Here, the gene expression values of the individual cRNA fragments, which are hybridized to U95A GeneChip arrays at the same concentration are analyzed under 14 different concentration levels [0.0, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, and 1024.0 pM (picoMolar)].Accordingly, every gene is described by 14 probes in each array, resulting in 224 observations per array.
Finally, for the analysis, we use the R programme language with our original codes for all calculations.
Hereby, in the estimation of the model parameters for each array, we check the possible shape parameters for the LTS distribution from 2 to 40 with the jump size 0.05.In the computation, we find that most of the array can be optimally defined under the shape parameter 40 or close to 40, which practically indicates the normal distribution.
As presented in Figure 1A, while the level of concentrations increases, a nonlinear pattern is seen in the structure of the estimated signals, like other methods.Figure 1B shows the associated graph via all other methods.Furthermore, the average estimated signals via multi-RGX versus nominal log-concentrations are plotted for each gene and the graph is reported in Figure 2.
Then to evaluate the performance in accuracy for each level of concentrations, we regress the spike-in genes which have the same concentration according to the distinct log-concentrations and apply the simple linear regression method.The computed R 2 is called as the signal detect R 2 and its associated slope term is named as the signal detect slope.Then, we compute the R 2 values and slope terms for each gene separately and later calculate the mean of both values individually.We call this measure as simply R 2 .Hereby as seen in Table 1, the signal detect R 2 , signal detect slope and R 2 are found as 0.79, 1.04 and 0.92, respectively.From the results, it is seen that R 2 is comparable with respect to its alternatives.Yet, under all concentrations simultaneously, the array indicates relatively more linear relation between estimated signals and concentrations on the nominal log-scale on the average, similar to multi-mgMOS, which has one of the perfection results for this type of modelling.But different from multi-mgMOS, our novel index is fully frequestist, i.e. based on the maximum likelihood approach.On the other   hand, it indicates a stronger linearity between the relation of signals and concentrations with respect to the FGX and RGX outputs.This can be seen another advantage of multi-RGX regarding other choices in the analyses.Besides, in order to assess the computational demand for alternative approaches, we compare the time for calculation via multi-RGX with other records in Table 2.It is seen that multi-RGX is significantly faster than BGX and multi-mgMOS.But with respect to the time for FGX, it is relatively slower.The reason is that we need to compute the optimal concomitant of probe sets, which converge to the true order in two or three iterations and this calculation causes extra time that is observed in tabulated values.Moreover, the calculation of FGX is based on the mean probe level, on the contrary, the computation of multi-RGX depends on both probe and gene specific values, which complicate the calculation.

Application via simulated dataset
To assess the quality of our novel gene expression index, when the data are far from normal or have outliers, we simulate two datasets from 10,000 Monte Carlo runs.Each Monte Carlo run is conducted for 10 genes under 20 probe pairs, resulting in 200 observations for every generated array.In the first data, we generate the PM and MM values under the normal distribution and in the second data, we produce the values under the mean shifted normal density.Then, we mix these two sets in order to get two location-mixture datasets with different ratios of outliers.Hereby, the mathematical description of the first set is presented as below: and the second location mixture can be described as the following structure: where N(•,•) indicates the univariate normal with the given parameters.
Here, the first data possess a large number of outliers.On the other hand, the second dataset owns a relatively moderate number of extreme observations.Moreover, in all simulations, we assume that every gene is measured under specific concentrations.Thereby, the data are simulated under S i setting to S i = 2, …, 11, for the gene i = 1, …, 10, respectively, on the original scale for each presumed concentration level.Furthermore, μ H , p, σ are equated as 1, 0.7 and 1, in order.Finally, the shift in the location, δ, is set to δ = 10 for both datasets.
In the assessment of accuracies for the estimated results, we select three criteria: the average error (AE), mean absolute error (MAE) and the root mean square error (RMSE) [19].
In Tables 3 and 4, we list the estimated values, i.e. S i , μ H , p and σ, of FGX, RGX and multi-RGX for each gene and model parameters, in order, for the expression in Equations ( 15)-( 16), respectively, by computing the associated absolute errors.From Tables 3 and 4, we observe that there is a large amount of bias in the estimates whose cause can be explained as follows: The first dataset is generated for 10 genes on the original scale, representing very low concentrations.Then, we take the nominal log-scale of all these simulated measurements in order to calculate the estimated signals.By this way, indeed, we evaluate the performance of all three models under very low concentrations since the estimates of all indices are problematic under these levels.Furthermore, as observed in Figure 1B, FGX, RGX and multi-RGX are able to measure the amount of the noisy signals on the array when there is no effect on the gene.Accordingly, from the outcomes we can see that this amount is approximately 21 units as measured under ˆ, H µ and if we subtract this shift from all estimated signals, we can find closer estimates of the signals with respect to their true values.
Additionally, in Table 5, we present the mean absolute error (MAE) and the root mean square error (RMSE) for the estimated signals via three selected models and in Table 6, we list the real and CPU time of each index in the calculation of the estimated model parameters for these two mixture data.
From all the tabulated values, it is seen that the estimates of multi-RGX are more accurate than the ones computed via FGX and RGX.But the computational demand of multi-RGX is more than other two alternatives in the real time.However, this difference is indeed can be tolerable if we compare the results of the CPU time.This output shows that the real time of multi-RGX can be even improved with the help of an effective programming since the calculation of the estimates is almost performed under the same computer time in the end.Additionally, to evaluate the performance of each alternative in a large dataset, we extent the calculation via the simulated data by 10,000 and 20,000 genes.Finally, we compute AE, MAE and RMSE for estimated S i in which the first data have 10,000 genes (i = 1, …, 10,000) and the second data have 20,000 genes (i = 1, …, 20,000).In contrast, for the remaining model parameters p, μ H and σ, we compute directly AE, MAE and RMSE, which are unique per study.
Hereby, in the simulation we set the true value of p, μ H and σ to 0.7, 1 and 1, in order.For the true signals S i , we initially generate values according to the number of total genes ranging from 2 to 11 from uniform distributions, assuming that the genes are measured under 10 distinct concentrations similar to the previous dataset.Also, the intensities are measured on the original scale due to the fact that we aim to evaluate the performance of all three methods under the most problematic range of intensities, i.e. the worst scenario for the comparison.Then, alike other estimates, we calculate the average of all model selection criteria, i.e.AAE, MAE and RMSE, for all signals.Thereby, the average of these criteria for signals is found by using 10,000 and 20,000 values for the first and second dataset, respectively.Whereas, they are directly computed for estimates of p, μ H and σ.The results are presented in Tables 7-9 for the first large dataset (with 10,000 genes) whose mixture proportions are arranged according to Equations ( 15)-( 16), in order.Moreover, the outcomes of the second large dataset (with 20,000 genes) are listed in Tables 9 and 10 whose mixture model is generated with respect to Equations ( 15)-( 16), respectively.
From all findings with large datasets, we observe that the performance of all three indices become very close to each other.In contrast, under small numbers of genes, indicating no unique best model, we find that FGX, RGX and multi-RGX estimates work well in the bench-mark dataset and for the simulated data with less numbers of genes, multi-RGX performs better than its strong alternatives.Hence, our suggested method can be seen as a promising approach to estimate the true signals for onechannel microarray datasets.

Conclusion
In this study, we have proposed a new gene expression index for one-channel microarrays, called multi-RGX, which uses the long-tailed symmetric distribution to describe the signals and takes into account both gene and probe specific measurements.Moreover, multi-RGX generates an explicit expression for each model parameter.
In the assessment of our proposal gene expression, we have compared the accuracy and the computational demand under certain criteria.According to the results, we have observed that multi-RGX can keep a high accuracy regarding alternative gene expression indices while preserving less computational cost under small datasets.
probes in increasing magnitude for each gene i in the PM and MM standardized intensities, respectively.Accordingly, the Taylor series.Thereby, the partial derivatives of MMLE are as follows:

Figure 2 :
Figure 2: Signal versus concentrations on log-scale for Affymetrix data.
Tülay Akal et al.: Robust background normalization method for one-channel microarrays 115

Table 1 :
Signal detect R 2 , signal detect slope and average R 2 , of all methods for Affymetrix data.

Table 3 :
Estimated model parameters of the first simulated location-mixture dataset (Expression in 15) via FGX, RGX and multi-RGX with their absolute errors (AE) and true values.

Table 4 :
Estimated model parameters of the second simulated location-mixture dataset (Expression in 16) via FGX, RGX and multi-RGX with their absolute errors (AE) and true values.

Table 5 :
Mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals in two simulated location-mixture datasets.

Table 6 :
Real (in seconds) and central processing unit (CPU) time of FGX, RGX and multi-RGX in the calculation of the estimated model parameters in two simulated location-mixture datasets.

Table 8 :
Average absolute error (AAE), mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals S i (i = 1, …, 10,000) in the large dataset with 10,000 genes according to the two simulated location-mixture models.

Table 9 :
Absolute error (AE) of FGX, RGX and multi-RGX in the calculation of the estimated p, μ H and σ in the large dataset with 20,000 genes according to two simulated location-mixture models.

Table 10 :
Average absolute error (AAE), mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals S i (i = 1, …, 20,000) in the large dataset with 10,000 genes according to the two simulated locationmixture models.

Table 7 :
Absolute error (AE) of FGX, RGX and multi-RGX in the calculation of the estimated p, μ H and σ in the large dataset with 10,000 genes according to two simulated location-mixture models.