On the Use of MDL Principle in Gene Expression Prediction

The structure and biological behavior of a cell are determined by the pattern of gene expressions within that cell. The so-called gene prediction problem refers to finding rules, or sets of possible rules, on how certain genes expressions determine the expression level of a given target gene. In this paper, we investigate the gene prediction problem and propose the use of new predictors, selected according to the minimum description length (MDL) principle. We compare the use of Boolean predictors, ternary predictors and perceptron predictors. We resort to MDL as a tool for selecting the proper size of the prediction window. MDL is also well suited for comparing predictors having different complexities. We show that the best description can be achieved by the Boolean and ternary predictors, since they obtain better fitting of the data with a lower complexity of the model. To illustrate the comparison, both synthetic and experimental data are used.


Gene prediction problem
The concept of gene expression was introduced four decades ago with the discovery of messenger RNA, when the theory of genetic regulation of protein synthesis was described [5]. Each human cell contains approximately three billion base pairs, which encode about 30 000 to 40 000 genes. In any given cell only a small fraction of these genes is being actively transcribed. Genes carry the essential information for protein synthesis. Cell metabolism, in both healthy or malignant states, is governed by complex interactions between genes. A gene may have a varying level of expression in the cell at a certain time. Many biological crucial processes can be probed by measuring the gene transcription levels, and models of regulatory pathways can be inferred from these measurements [10].
The availability of cDNA microarrays [11] makes it possible to measure simultaneously the expressions level for thousands of genes. The huge amount of data provided by cDNA microarray measurements is explored in order to answer fundamental questions about gene functions and their inter-dependencies, and it is hoped to provide answers to questions like what is the type of the disease affecting the cells and what are the best ways to treat it. Large-scale study of gene expression is expected to provide the transition from structural to functional genomics [4], where knowing the complete sequence of a genome is only the first step in understanding how it works.
As a first step in deciphering complex biological interactions at a biomolecular level it is necessary to study the microarray data in order to provide sets of possible rules on how the expressions of certain genes determine the expression level of a target gene.
Gene expression data obtained in microarray experiments may often be discretized as ternary data, the values 1,0,-1 carrying the meanings of overexpressed, normal, and repressed, respectively, which are the needed descriptors when defining regulatory pathways [2].
The basic problem addressed in this paper is the processing of ternary valued cDNA microarray data in order to find groups of genes, or factors, which are very likely to determine the activity of a target gene, as opposed to the oversimplified model of one (gene) factor influencing a single (gene) target. Previous solutions to this problem were given by using perceptron predictors, see [2], [6], which are shown to provide more reliable information than the determination of dependencies based solely on the correlation coefficient.
Our approach to prediction seeks for flexible classes of models, with good predictive properties, but also considers the complexity of the models as a penalizing factor, the balance between the two modelling aspects being set by the MDL principle.

Modelling and data compression
Is a model telling cheaply enough the data generation mechanism? If yes, it will be cheaper to use the model and its residuals to describe the data instead of using the data in raw form.
A well established fact is the ability of the "codelength" to be a universal yardstick to measure the complexity of models. The MDL principle [1] [7][8] [9] considers the following axiom as basis to the theory of modeling: given the data and the model class, select the model in the model class which achieves the shortest codelength for the data and the model. The codelength necessary for encoding the data may be given by a two part code: first encode the model parameters, second encode the model residuals.
By applying compression techniques we seek for the (universal) code giving the shortest codelength of the available data. Modelling (data analysis) seeks for a description of the degree of dependency of one gene given the other genes and the conditions in the available data.
We postulate a parametric model for the dependency g = f (g 1 , . . . , g n ), fit its parameters to the available data, and find the codelength for sending the parameters and the residuals. We then claim that a dependency is present according to the reduction in the codelength obtained by a given model class. If different model classes are consistently signaling dependencies, they are very likely to be rules of the nature.

Data available
The available data is organized in a matrix where row i is the set of measurements at instant i, and the column j (j = 1, . . . , n) represents the activity of gene g j . The entries M (i, j) take values in the set {−1, 0, 1}. In Table 1 we list the experimental data set, the same as the one used in [6].
If we had good estimates of the parameters of the stochastic model that corresponds to the data we could immediately derive estimates for the coding costs. However, collecting counts to evaluate frequencies, and from these to evaluate probabilities, is not a very safe method, because of the modest number of occurrences of the events of interest for us. Lacking the ability of accurate probability estimation, the entropies will not be good measures of average coding cost (which is the basic interpretation making entropy a measure of information content). Therefore it becomes natural to evaluate the coding length of a sequence by explicitly using a coding method.

Encode without a model
We take as example the case of T = 30 measurements, which we use in our experiments. Encoding "in clear" the column for gene g requires N = log 2 (3 30 ) = 47.55 bits. Other codes for the target gene alone may be used, but we find it convenient to make no assumption about the distribution of −1, 0, 1 at this point.

Encoding the residuals
When we use a predictive model we have to decide on a code for the residuals. At a first glance, the residuals may be more difficult to code than the original sequence, since the dynamic range is now expanded to {−2, −1, 0, 1, 2}. However, there is an obvious reason why residuals may be easier to code: a good prediction will make the sequence of residuals to contain the symbol 0 many times. Using arithmetic coding, or simply run length coding, will produce, in the case of good predictors, a smaller codelength than that of the original sequence. However we use even a simpler code for prediction residuals, described next.
The residuals will be encoded in such a way to penalize the errors. For each error we send in clear the actual (correct) value, using the code "0" = 01, "1" = 10, " − 1" = 11, the code 00 signaling the end of the residual sequence. After the actual value, we also transmit the location of the error using log 2 31 = 5 bits. An error will therefore be transmitted using 7 bits.

Comparison of prediction capabilities of several predictors
We consider here three predictor classes for ternary valued gene expressions: hard Boolean predictor, hard ternary predictor and finally, perceptron predictor. For all of them we discuss the predictor parameter estimation and the predictor order selection based on MDL principle. We are in the case of ternary valued data, with the ternary values {0, 1, 2} (if x ∈ {−1, 0, 1}, first use the mapping x ← x + 1 to transform to the set {0, 1, 2}).

Hard Boolean predictor specification and design
Denote x = [x 1 , . . . , x n ] the prediction window, having dimension n, and x i ∈ {0, 1, 2}. Define the thresholded vectors The prediction is defined asŷ where f (·) is a Boolean function with n variables.
To design the Boolean predictor we quantize to two intervals the conditional expectation in the binary planes, and assign f * ( We consider the two part code: first we encode the model "parameters" (we have to specify the function f * (·)), and secondly, we need to encode the residuals (prediction errors). The model cost for encoding the function f * (·) is if we assume a uniform apriori distribution over all Boolean functions. Other selections are also possible, i.e. to favor Boolean functions which are more active (e.g. depending on the absolute difference between number of ones and zeros in f ). Therefore if there are N e nonzero prediction errors, the overall coding length will be the model length plus the prediction error length: By starting from a data set of several possible gene factors, we can select the best combination by designing sequentially the predictors with prediction window sizes n=2, 3, 4, 5 and computing L(n) for each predictor. The MDL principle selects as optimal predictor size that n for which L(n) is minimum.

Ternary predictor specification and design
The optimal Ternary predictor is found by quantizing to three intervals the conditional expectation:ŷ . ., which needs 2 · 3 n bits, therefore the cost of encoding the ternary model is C M = 2 · 3 n bits, the same for all models, irrespective of our data.
However, we note that we do not know the optimal values h * (x) for those prediction windows x which we have not seen.
2. Therefore, in the second alternative the cost of the model depends on the actual data, since we have to specify to the encoder the optimal predictions only for the seen prediction windows. Denoting n x the number of different prediction windows found in the data set, the cost of encoding the model becomes C M = 2n x bits, and it is upper bounded by 2T where T is the number of measurements. This variant is used in our experimental determinations. Note that a similar variant can be used for the Boolean predictor.
The prediction errors will be encoded the same way as in the case of Boolean prediction. Therefore if N e nonzero prediction errors are obtained with the Ternary predictor, the code length for the prediction residuals is 7N e + 2.
The overall coding length will be the model length plus the prediction error length: Apparently the ternary predictor is defined only for the prediction windows seen in the experiment. There is a obvious way to use the observed data to define h * (x) for those prediction windows x we have not seen. One simple way is to consider h * (x) = h * (x ) where x is a seen prediction window, with size smaller than x. However the point with the current gene expression problem is that our goal is to identify outstanding gene dependencies, given the observed data, which does not make necessary to specify the predictor for unseen windows.

Perceptron predictor specification and design
The perceptron is found by quantizing to three intervals the best linear combination of samples in the predictor window:ŷ The parameters w and w 0 can be found using the Perceptron algorithm [6].

Perceptron predictor: Description length
For ternary valued data, the numbers of all distinct perceptrons with two or three inputs are known [12]. Using a perceptron model with two input genes requires C M = log 2 471 = 8.88 bits. A perceptron model with three input genes requires C M = log 2 85629 = 16.38 bits.

An artificial example
We use first an artificial example, where, by knowing the true state of the nature, we will be able to check the success of our procedure.
We assume to have 30 measurements, at times i = 1, . . . , 30, of 40 variables (factors) x 1 (i), . . . , x 40 (i) and 1 target y(i), and suppose the measurements are quantized to three levels: 0, 1, 2. The number of sequences of 30 symbols with ternary values is 3 30 = 2 · 10 14 , exceeding hugely the 40 sequences we have in the experiment. Suppose there is one function The data was generated by randomly choosing 40 sequences of 30 measurements, from all possible 2 · 10 14 ternary sequences, with an uniform prior on all sequences. The target function f * : {0, 1, 2} 4 → {0, 1, 2} was selected in the following way: for any input window (j 1 , j 2 , j 3 , j 4 ), the value f * (j 1 , j 2 , j 3 , j 4 ) was sampled from the random variable with values 0, 1, 2 and mass p = [ 1 3 , 1 3 , 1 3 ]. Then the target gene was constructed as the exact function y(i) = f * (x 1 (i), x 2 (i), x 3 (i), x 4 (i)). Therefore the true window for the target is [x 1 (i), x 2 (i), x 3 (i), x 4 (i)], referred for short as [1,2,3,4]. In the following we show that Boolean and Ternary predictors can be used in conjunction with the minimum description length to select prediction window candidates.
Target alone The complexity of the target (no conditioning on other genes) is evaluated by three methods: (a) a priori uniform distribution on all sequences, giving L(y) = − log 2 3 30 = 48 bits; (b) adaptive arithmetic coding, giving the value L(y) = 47 bits; (c) Move to Front coding (which favors correlated sequence), giving the length L(y) = 52 bits.
Since the target was generated as an uncorrelated sequence of ternary values, Move to Front gives rather different results than the first two methods, but we can safely assume that the complexity of the target alone is in the range 47-52 bits.
Window size=2 No predictor using only two genes can achieve a description length lower than the complexity of the target. In the left Table 2 we list the best models of order two, those giving a descriptive length less than 72, which makes 17 windows in total. The "correct" window (which is of order four, and contains gene 1 , gene 2 , gene 3 , and gene 4 ) has a "trace" in the list of the best models, occupying the second position with the window (gene 1 , gene 3 ).

Window size=3
The predictors using three genes becomes more successful in describing the target. In the middle Table 2 we list the best models of order three, those giving a descriptive length less than 55, which makes 28 windows in total.

Window size=4
We list the best models of order four (those having length less than 49, which makes 23 windows in total). In bold we show the "true" model, which was found in the 19'th position, ranked out of 91390 possible prediction windows. Observe that several variations of the true window show up frequently as winners in the final list.

A comparison of the three classes of models in a "12 gene" experiment
We consider here a comparison of the three predictors under MDL criterion, by using the experimental data from [6], listed in Table 1. The target gene is AHA.
In Table 3 we present the description length values, and also the number of errors, obtained with the best predictors in each of the three classes, for various sizes of the prediction window, when we restrain only to the set of genes (RCH1,p53,PC-1). For the Boolean predictor, the best description length is obtained by the model HB(RCH1, p53). Appending PC-1 to the prediction window does not improve the descriptive power of the model. The same conclusion can be drawn with Hard ternary predictors, and perceptrons. The MDL criterion makes therefore a clear and consistent selection of the best prediction window, while the coefficient of determination c d [6] simply shows a better fit of the model to the data when increasing the window size. The problem with c d is that it expresses only the fit to the training data, to avoid overtraining one needs another (validation) data set. MDL is a method which implicitly penalizes too large models, being shown to be a consistent order estimator, while several cross-validation methods are not.
We extend now the experiment to predicting the values of the target AHA based on combinations of two, three, or four of the genes in the first 12 columns of Table 1. In Tables 4, 5, 6 we show the best predictors for each window size. The best overall predictor is indicated to be AHA = HB(p53, AT F 3, RCH1), which has the Boolean function f (x 1 , x 2 , x 3 ) = x 1 x 2 + x 2 x 3 , and the description length L = 10, much lower than that of any other window size, or combination of genes. By comparing the description length of predictors of different sizes, we can decide now that most of the "good" predictors of order 3 or 4 are worse than corresponding predictors of lower order e.g. all predictors of size three with description length 24 are actually worse than the predictor of order two AHA = HB(p53, RCH1), which has description length 20.
We are capable therefore to organize the huge number of "good predictors" in Tables 4, 5, 6 and discard all those overly complex, to conclude that the relevant predictors in this experiments, in the light of the data. are only: HB(p53, RCH1), HB(p53, AT F 3, RCH1), and HT (p53, AT F 3, BCL3).   Table 4: "12 Gene" Experiment The Description length of the best predictors with window size =2. Length