BClass : A Bayesian Approach Based on Mixture Models for Clustering and Classiﬁcation of Heterogeneous Biological Data

Based on mixture models, we present a Bayesian method (called BClass ) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as ’alive’ and ’dormant’ depending upon the number of entities clas-siﬁed within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the deﬁnition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB , a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli . The clas-siﬁcation obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat . The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.


Introduction
The exponential growth of raw biological information represents an unprecedented challenge for biologists and bioinformaticians.Striking breakthroughs in biotechnology currently allow sequencing an average bacterial genome (the total DNA within an organism) in a matter of days.Since 1995 when the first complete sequence of a free-living bacterium Haemophilus influenzae was obtained (Flaischmann and et al 1995) to November 2004, there are already more than 200 complete genomes available, sampling the three domains of life, at The National Center for Biotechnology information (NCBI, http://www.ncbi.nih.gov), and more than 140 are still under way.Furthermore, high throughput experimental approaches have been developed to measure simultaneously the expression of all genes within an organism, both at the level of messenger RNA (Brown and Botstein 1999) and protein content (Anderson et al. 2000;Dutt and Lee 2000); such approaches have led to the emergence of the fields of transcriptomics and proteomics respectively.Statistical interpretation of transcriptome results is not a straightforward endeavor since different growth conditions, different RNA extraction procedures and different microarray-building systems hamper comparisons between experiments.Pitfalls, challenges and limits of these approaches have been adequately discussed elsewhere Danchin and Sekowska (2000).Experiments focused on individual biological systems, which involve a few genes, continue to appear in the literature and a substantial part of the resulting information is available in specialized databases.The inescapable consequence is that the pace at which raw experimental data is generated has greatly exceeded our ability to extract insightful knowledge from such information.For example, it is not trivial to infer from sets of coexpressed genes under a given experimental condition, which genes are corregulated, what are the regulatory proteins that regulate them, and even much more complicated, what is the network of regulatory interactions that produces the observed expression patterns.More robust methods and enhanced computational tools are required to come to grips with this problem by integrated analyses of heterogeneous data types.
We focused on the problem of clustering and knowledge discovery in RegulonDB, a biological database specialized in information about operon organization and transcriptional regulation of gene expression in the bacterium Escherichia coli (Salgado et al. 2004).The specific goal is to build a statistical framework that, hopefully, will allow uncovering previously unknown relationships among genes, given the diverse attributes that describe them.In our strategy we use a classical mixture model to tackle the problem of multivariate, heterogeneous classification and clustering.We call the resulting software BClass.Let X = (x iv ) be a (non-relational) database where x iv represents attribute v of gene (entry) i; i = 1, 2, . . ., n and v = 1, 2, . . ., C. Let also x i be the ith row of X; all the attributes for gene i.By heterogeneous databases we mean that the x iv 's may be very different in nature: continuous, discrete, categorical, etc. See Section 6 for an application example of BClass to RegulonDB.
A variety of clustering algorithms based on dissimilarity or distance measures are currently available.Methods for clustering and classification of heterogeneous, mixed, variables include converting variables to homogeneous types, analyzing variables separately and finding a weighted average of standardized dissimilarity measures (see, for example, Kaufman and Rousseau 1990;Gordon 1981).Regarding this latter method, many authors have suggested procedures to properly define the weights in a joint dissimilarity measure.However, as yet, there is the caveat that no standard method has been widely accepted and much heuristics is here needed (see Everitt 1993).As an alternative, mixture models have the advantage of not relying either on distance measures nor on building (or attempting to build) a statistical model of the data analyzed (see Everitt 1993;McLachlan and Basford 1988) The now well known software AutoClass of Cheeseman and Stutz (1996) uses mixtures for clustering and classification in complex databases; it is largely based on the techniques presented by Titterington et al. (1985).AutoClass uses expected maximization (EM) approximations to find, mainly, maximum a posteriori (MAP) estimators (point estimators).In this paper, we build a mixture model for X using the assumption of conditional independence across elements in the mixture and attributes; this assumption has been shown both empirically and theoretically to be highly effective (Hand and Keming 2001).We embarked on doing a full Bayesian analysis using Markov Chain Monte Carlo (MCMC, see for example Gamerman 1997) methods to approximate complete posteriors, specifically, all grouping probabilities.This means that, in particular, given the number of groups or components in the mixture (J), we obtain a matrix P = (p ij ) where p ij is the posterior probability for entry i to belong to group j, j = 1, 2, . . ., J. The whole process may be regarded as a transformation of X to P , where now the "attributes" of each entry are its grouping probabilities p ij 's.These new "attributes" are perfectly homogeneous.We then may use, as posterior interpretation tools, simple clustering techniques on P to find clusters.Therefore, both groups and posterior grouping probabilities (p ij ) are well defined mathematical objects, and "Clusters" are rather intuitive, and more interpretable, in terms of the database under consideration.Overall, we follow AutoClass philosophy of "automatic" classification, that is, setting the least number of parameters and relying on reasonable default values.
The problem of deciding the appropriate number of components in a mixture has been studied by many authors.However, there are only a handful of full Bayesian approaches that state a prior and calculate a posterior for the number of components J in the mixture.The difficulty here is that, as J varies, the dimension of the model changes, and standard MCMC approaches cannot handle chains of variable dimension.Philips and Smith (1996) propose using "jump diffusions", while Richardson and Green (1997) make an application in mixture models of the now popular "Reversible Jump" MCMC, developed by Green (1995).Stephens (1997) proposes yet another approach, based on point-process simulation, to deal with variable number of components.These strategies approach the problem in a full Bayesian setting being applications of general methods for variable dimension models and tend to become quite elaborate.Here, we tailored a procedure that allows handling different number of components, yet keeping J fixed.This is achieved using a vector of indicator ("dimensionality") variables that designate the "alive" and the "dormant" components.Using an alternative prior for the mixing probabilities that prefers parsimonious parameterizations, we construct a Markov Chain Monte Carlo (MCMC) relying on standard Gibbs sampling and Metropolis-Hastings algorithms (see Besag et al. 1995).It is worth noting that our method is suited for any set of component distributions, whereas the aforementioned approaches have been explored mainly for Normal mixtures.
The paper is organized as follows.In Section 2 we explain our model and the general form of the hierarchical structure.Specific priors for various distributions of gene attributes are explained in Section 3 and in Section 4 we discuss the posterior analysis of the MCMC output.In Section 5 we describe our general strategy to implement the system in Lisp-Stat.
In Section 6 we apply BClass to the analysis of RegulonDB (Salgado et al. 2004) and some comparisons are made with the software AutoClass.Finally, a general discussion of the paper is given in Section 7.

The model
We use the standard mixture model with allocation variables as explained, for example, in Richardson and Green (1997).However, we add a vector of indicator variables to control the "alive" and the "dormant" groups in the mixture.That is, given a set of (unknown) parameters h, the database X has distribution where π j ≥ 0, J j=1 π j = 1, are the mixing probabilities, f (X | θ j ) are the within-group (component) distributions, and the indicator variables φ j = 0, 1 are the "dimensionality variables".Thus if φ j = 0, then group j is dormant (not considered), and if φ j = 1, group j is alive.We consider the number of components J to be fixed to a suitable large number and by integrating out the φ j 's we tackle the problem of "finding" the number of components in the mixture (the components that remained "alive").We then have that h = (π, φ, θ), where π = (π 1 , π 2 , . . ., π J ), φ = (φ 1 , φ 2 , . . ., φ J ) and θ = (θ 1 , θ 2 , . . ., θ J ).
We assume that all the component distributions f (x i | θ j ) belong to the same parametric family.We also assume conditional independence among components and attributes in the following sense, where θ jv are the parameters needed for component j and attribute v.The above assumptions mean that we consider all attributes as conditionally independent within each group.This is a fairly reasonable assumption since, once all other sorts of variability are discarded, the internal group variability may be expected to behave as random, non-correlated, error.As supporting evidence, the studies of Hand and Keming (2001) presented an empirical and theoretical performance analysis of the assumption of conditional independence among attributes.They conclude that, although at first glance naive and most likely incorrect, The independence Bayes model seems often to perform surprisingly well.(Hand and Keming 2001, p. 395).Part of the reason is that less parameters are needed to be estimated, thus outperforming models that consider interaction parameters, specially for several attributes.Furthermore, generally speaking, little is known about the databases studied and thus simple models need to be used (this approach is also taken in AutoClass and has given promising results, see Cheeseman and Stutz (1996).In order to consider nonindependent attributes in the database, we could use multivariate distributions with some correlation structure.The generalization of the techniques developed here to use multivariate attributes is relatively straightforward (see Fraley and Raftery 1998, as an example using normal variables only).However, in this paper we take f (x iv | θ jv ) to be univariate, thus we write x iv , a scalar.For clarification, say we have C = 3 attributes x i = (x i1 , x i2 , x i3 ) for each gene i, and that these are Normal, Multinomial with four levels and Poisson.In such a case we would have f Certainly, in general, we do not know which group each gene belongs to.Thus, for gene i, the group this gene belongs to is given by the latent allocation variable J i = 1, 2, . . ., J. Given these allocation variables, we have where J is the vector of J i 's.That is, given J , x i is drawn from its corresponding group J i .
For a given J our parameters are (J , π, φ, θ).We assume that f (J , π, φ, θ) = f (J , π, φ)f (θ) and a priori, ), and we assume that P [φ j = 1] = α independently (α will be taken equal to 1 2 , non-informative).Therefore, to construct our prior we are using the decomposition We take ) and the definition of f (θ jv ) is left for Section 3.Alternatively, φ and π may be considered not independent a priori by, for example, taking , and making φ j to depend on π j , somehow.However, this will only be relevant for the case when prior information is availbale to distinguish the π j 's, which is not the case for f (π).An influence diagram (see, for example, Richardson and Green 1997) for our model is presented in Figure 1.
We leave the more mathematical aspects of defining f (π) and the design of the Markov Chain Monte Carlo algorithm for the Appendix.We now turn to consider the distributions for the attributes in the data base.

Distributions for different attributes
In this section we present distributions for specific types of variables that are commonly used in biological databases.However, it should be clear from the type of hierarchical modeling used that any other type of distribution may also be considered, provided a default (proper) prior is stated and sampling from its unknown parameters is properly described.Proper priors are needed since we could encounter empty groups and in such a case we would need to sample from the prior itself, during the MCMC iterations.As far as the MCMC sampler is concerned, no further adjustments are required and the sampling scheme for the rest of the parameters remains the same.Moreover, the prediction section below regarding missing or unobserved values, is completely general and not solely restricted to the distributions presented herein.

Normal heterocedastic
With respect to Normal variables we have θ jv = (µ j , λ j ) and x iv | J i = j, µ, λ ∼ N (µ j , λ j ) where µ j is the mean and λ j the precision (inverse of the variance).Let also µ be the vector of µ j 's and λ the vector of λ j 's.Richardson and Green (1997) take µ j ∼ N (µ 0 , λ 0 ) and λ j ∼ Ga(α, β) independently for all j and µ 0 , λ 0 and α fixed to some data dependent values.They consider a further hierarchy in the parameters by letting β ∼ Ga(g, h) with g and h fixed (Stephens 1997, also follows this approach), to construct a quasi-non-informative yet proper prior.We follow the same approach here.We fix the following parameters to , where a = min x iv and b = max x iv .Richardson and Green (1997) and Stephens (1997) point out that these values convey the belief that the "λ j 's are similar, without being informative about their absolute size".To update µ j , λ j and β, we simulate from their full conditionals (Gibbs kernel).That is, a N (µ n j , λ n j ), where λ n j = λ 0 +n j λ j and µ 2 ) for λ j and a Ga(g + Jα, h + J j=1 λ j ) for β.

Normal homocedastic
We consider a homocedastic Normal variable, for which x iv | J i = j, µ, λ ∼ N (µ j , λ), with the same precision λ across all groups.It is likely that in clustering problems we would prefer this variable, instead of a heterocedastic one (see Richardson and Green 1997) since it will tend to split the range of the data in more or less uniform intervals.Again, we take µ j ∼ N (µ 0 , λ 0 ) a priori, and f (λ) ∝ λ −1 as a prior for λ.To update µ j we simulate from N (µ n j , λ n j ), where λ n j = λ 0 + n j λ and λ from a Ga( 1 2 n, 1 2 J j=1 J i =j (x iv − µ j ) 2 ) (the full conditionals), which are always proper although the prior for λ is not.As above, we take µ 0 = a+b 2 (the data range mid point) and λ 0 = 1 (b−a) 2 , a large precision.

Multinomial
Regarding Multinomial attributes (eg.categorical), we assume that x iv = 1, 2, . . ., L (that is, we translate labels into a numeric scale).The conjugate prior is a Dirichlet; however, in this case, there is not a unique standard reference prior, with Di(1/L, . . ., 1/L) being a common non-informative prior used.From simulated studies, where only one Multinomial variable is considered, we have seen that a Di(1/2nL, . . ., 1/2nL) leads to a mixture where only L components are effectively used and each component in fact takes only one level.We thus consider the latter as our default (sample size dependent) prior and simulate the vector of Multinomial probabilities in group j from its full conditional (Gibbs kernel), which is a Di(1/2nL+n j1 , 1/2nL+n j2 , . . ., 1/2nL+n jL ), where n jl = |{i : x iv = l, J i = j}| (the current Multinomial counts for group j).

Poisson
A basic procedure to deal with some types of ordinal variables is to use a Poisson model.We assume that x iv = 0, 1, . ... Given x iv ∼ P o(λ j ) the standard reference prior is f (λ j ) ∝ λ −1/2 j , which is not proper.Following the hierarchical priors used for the Normal case, we take λ j ∼ Ga(α, β), fix α = 1 and take f (β) ∝ β −1 (as a reference prior for β).We then simulate from the full conditional of λ j which is a Ga(α + J i =j x iv , β + n j ) and we also simulate β from its full conditional (Gibbs kernels), that is a Ga(Jα, J j=1 λ j ).

Prediction
In the context of MCMC, finding predictive distributions for missing (unobserved) attributes is straightforward.Simply, a missing attribute x iv is taken as an unknown parameter and included in the Markov Chain simulation.From Section 2 we see that the full conditional of x iv is f (x iv | θ J i v ), that is, sampling from the model used for attribute v.A "predictive" sample is then obtained for x iv , which can be used either to approximate its predictive distribution or some point estimate like its predictive expectation.

Posterior analysis
As in any complex Bayesian analysis, an important problem is analyzing posterior information.BClass produces approximations for the posterior grouping probabilities P = (p ij ) = P (J i = j | X), a n × J matrix, and posterior expectations for π and θ, E[ As explained earlier one can regard the grouping probabilities as homogeneous attributes.For small n, it is possible to calculate a distance matrix (with simple Euclidean distance) and plot a dendrogram.However, even for moderate n (= 500), the dendrograms are difficult to read since most branches overlap.An alternative and simpler method is to plot the points a 1 p i1 + a 2 p i2 + • • • + a J p iJ where the a j 's are equally spaced along the unit circle.Thus the grouping probabilities for each entry i (p i1 , p i2 , . . ., p iJ ) are mapped within the unit circle and similar grouping probabilities are plotted as nearby points (the contrary is not necessarily true though, and thus some care is needed in the interpretation of the resulting plots).These plots, hereafter referred to as "archipelago" plots, are used in the example to visually recognize clusters.We worked around the problem of cluster definition by plotting the grouping probabilities, say 10 times, with the order of the groups randomized.When these plots are linked, clusters can be reliably defined as points located nearby in all the plots-false clusters always split apart when permuting the order of the groups.
With respect to the dimensionality variables, we only keep track of k (l) = J j=1 φ (l) j , the number of "alive" groups at pass l.Using the sample k (l) we approximate probabilities for the number of groups used.This is also done in the example.

BClass implementation in Lisp-Stat
BClass is fully implemented in Lisp-Stat for several reasons: (1) Lisp programming is remarkably straightforward; (2) it allows to focus on the problem at hand without much concern about the technical innards of the language; (3) most functions are vectorized; and (4) the Lisp-Stat's dynamic graphical-linking capabilities facilitate tremendously the exploratory analysis of both the input data and final results.
BClass was implemented following an object-oriented strategy.Every supported statistical model (i.e.Poisson, multinomial, normal homocedastic and normal heterocedastic) consists of a prototype that can be used to instantiate as many objects as necessary.Therefore, before running BClass, a preliminary analysis is required so as to decide which distribution type should be used to model each attribute.
The system includes a comprehensive tutorial plus a demo script to illustrate the readers how to load their own data into BClass, fine tune internal parameters (i.e. the number of groups J in the mixture, the number of iterations for the MCMC algorithm, etc.), how to start the classification process, plus how to save or reload the BClass output in order to postpone or continue a particular analysis.Two approaches (plot-based and query-based) are combined to define, interpret, and evaluate the quality of the resulting clusters, which are formed by observations sharing similar grouping probabilities.The system is open source code, it consists of 16 script files, the tutorial and the demo.BClass is available free of charge from the journal's web page and from the sites http://www.cimat.mx/~jac/softwareand http://www.ccg.unam.mx/amedrano/BClass.

Case study
In order to understand the instructive example here presented several, biological concepts are necessary.First, DNA is a linear sequence of four types of concatenated building blocks called nucleotides or bases: adenine (A), thymine (T), guanine (G) and cytosine (C).Structurally, DNA is a macromolecule with two complementary strands arranged in a double helix, each strand being read in opposite directions (forward or reverse) by the cellular machinery.Second, a gene is a DNA fragment that encodes the necessary information to produce proteins, which in turn are responsible for carrying out most of the cellular processes indispensable for sustaining life.Genes may be physically positioned in any of the two DNA strands and their length is the sum of As, Ts, Cs, and Gs in their sequence.Third, a bacterial chromosome is a circular self-replicating DNA sequence that contains in a linear array all or most of the genes.Fourth, transcriptional regulation refers to the molecular mechanism responsible for strategically expressing the genes required by the cell in order to survive environmental changes, reproduce, metabolize nutrients, etc. Fifth, the protein product of genes can be classified in several types, here we will deal only with four of them: enzyme (to accelerate biochemical reactions), regulator (to control whether or not the protein product of other genes will be manufactured), transport (to take nutrients, toxins, etc in and out of the cell), and leader (specific sequences at the beginning of some genes that may function in targeting reactions or regulation).Sixth, gene function in this example refers to metabolic processes in which the genes participate (e.g.amino acid biosynthesis, energy production, cell division, etc.).These definitions, albeit sufficient, are by no means complete.
RegulonDB (Salgado et al. 2004) has information on transcriptional regulation for more than 2300 genes.We selected a set of 435 genes for which all attributes are completely described, with the exception of a few missing values (see below).We used the following attributes for the analysis: The DNA strand (forward or reverse), gene length in base pairs (bp), position within the chromosome (in minutes), gene type (enzyme, leader, regulator, transporter and miscellaneous), gene function (20 functional classes) and regulation mode (positive, negative or dual).These attributes, in turn, are modeled as Multinomial with 2 levels, Normal homocedastic, Normal homocedastic, Multinomial with 5 levels, Multinomial with 20 levels and Multinomial with 3 levels, respectively.There are 3 genes that have an unclassified or unknown function, and 11 genes with an unknown mode of regulation.Following the technique explained in Section 3.5 we predicted the values for these missing data.We considered a mixture of (J =)30 components.
Given the complexity of the multidimensional model entertained and the fact that Lisp-Stat is an interpreter, the convergence of the MCMC chain turned out to be slow.We took quite a long burn-in of 100,000 sweeps followed by a run of 50,000 sweeps.The π j 's were sorted and sampling was carried out every 5 sweeps.With the current BClass implementation in Lisp-Stat, these calculations took nearly 9 hours in a Intel Xeon processor (2.4 GHz) running Linux Red Hat 7.3.In this analysis we concentrated on calculating the grouping probabilities, that is, on obtaining the matrix P = (p ij ).The archipelago plot for P (see Section 4) is shown in figure 2.
As explained above, using the sample k (l) we approximate probabilities for the number of groups used, see figure 3. We see that the most likely number of components is 29, and thus we have some confidence that we are using an appropriate number of groups.
So far, we have analyzed more than 15 clusters arising from the archipelago plots.For illustration, however, we present a brief analysis of the cluster shown in figure 2. This cluster has 13 genes.We see that all genes are regulated positively (their expression needs to be activated), all but two have a miscellaneous type, all are on the reverse strand, 12 genes participate in central intermediary metabolism (function 3) and the function of one gene is unknown.For this latter gene, phnQ, there is a predictive probability of more than 0.98 that its product is involved in central intermediary metabolism (function 3), as the rest of the genes in the cluster.Concerning their position within the chromosome the genes form two clusters: 12 genes form one cluster between minutes 92.93 and 93.11 and one gene, known as gcvH, is in minute 65.68.Regarding the size of the genes, they are in the range 309 to 1137 bp (small to average size), with the smallest gene being gcvH.In bacteria, a substantial fraction of genes is co-transcribed (expressed at the same time) defining the so-called "operons".All but gcvH belong to an operon containing 15 genes.This is the largest known operon in E. coli as most operons contain 2 or 3 genes.As far as we know, there is no reported evidence that these genes are functionally linked.Further theoretical and experimental research should be directed at studying the biological meaning of the association between the former gene and the latter operon.A preliminary transcriptome analysis of 4 growth conditions (i.e.heat shock, osmotic shock, minimal media, and IPTG) indicates that none of these genes change their expression significantly, so potential coexpression under other non-essayed growth conditions cannot be completely ruled out.

Cluster
Besides the gene phnQ mentioned above, genes yhdG and dsdX have a predictive probability of nearly 1 that their product is involved in transport (function 14).Regarding the missing values for the type of regulation, gene treB has a probability of more than 0.99 of having a dual regulation type.The rest of the missing values analyzed do not have high predictive probabilities.It might be interesting to confirm these predictions with further experimental research.
For the sake of comparison, AutoClass was also applied to RegulonDB, for exactly the same genes and variables as above.Since AutoClass relies on MAP (point) estimates it only reports the most likely group each entry belongs to.AutoClass found 7 groups after 10,000 iterations (a minimum of 50 is recommended), however, none of these groups could be identified as one single cluster in the archipelago plot produced by BClass.Moreover, the groups produced by AutoClass were split into several well compacted more interpretable clusters by BClass.Indeed, it is not surprising that we can extract more significant information out of BClass since we do have access to the whole vector of grouping probabilities for each entry and not just the most likely group.More comprehensive evaluations are needed to fully address the issue of comparing the performance and capabilities of each software.

Discussion
Here we present a general tool for Bayesian classification of genetic databases using mixture models.The methodology is "open" in the sense that, in principle, any set or combination of quantitative genetic attributes may be analyzed, with few new technicalities to be taken care of.
As explained in Section A.2, without any identifiability constraint, the posterior distribution has J! symmetric components.If a MCMC sampler is run without such constraints, samples will be drawn from any of these components.The output of such sampler should be analyzed with much care, avoiding mixing samples from different components.By post-processing the MCMC output, Stephens (1997) proposes two methods to "relabel" a sample from a mixture model.Stephens' second method could be applied here since it is independent of the type of component distributions used.However, this method post-processes the probabilities of the full conditionals for each J i in every MCMC iteration.This involves keeping s matrices of size n × J, where s is the MCMC sample size; a formidable task even for moderate n.The ordering constraint in the π j 's followed in this paper does avoid some of the label switching problems, provided that the groups do not have similar π j 's.The label switching problem has just recently been addressed, and thus further research efforts are required to find a simpler on-line solution.
The number of components used in the mixture is controlled by the indicator dimensionality variables φ j 's, while the entropy prior on the mixing probabilities enable the method to obtain parsimonious parametrizations.We use standard Metropolis-Hastings, which leads to reasonable simple algorithms, in contrast with the complexity of methods proposed elsewhere (Philips and Smith 1996;Richardson and Green 1997;Stephens 1997).Moreover, our approach is independent of the families of component distributions used, making unnecessary any fine tuning of details for each new distribution introduced.
As discussed above, the rationale behind BClass offers several advantages, particularly in the case of genetic databases, compared to other clustering techniques.These ideas should be useful and robust for applications in diverse areas of research such as satellite imaging, social sciences, medicine etc. besides biology.
As far as we know, mixture models have been applied to biology at the level of morphological traits (na and noz 1998) but not to mine databases in molecular biology.Bayesian statistics has been certainly used in Bioinformatics in learning strategies to identify common motifs in related sequences (Neuwald et al. 1995), in sequence alignment (Zhu et al. 1998), phylogenetics (Lewis and Swofford 2001), as well as in transcriptome analysis (Baldi and Long 2001).Given the explosion and growth of biological databases (see the January issues of the journal Nucleic Acids Research where biological databases are presented), it is reasonable to foresee the increasing importance of mixture models as a tool for clustering analysis and biological knowledge discovery.A well known example of the contribution of clustering techniques to molecular biology is the determination of 3 general codon usage classes in E. coli (Médigue et al. 1991).These major codon usage groups have been observed in other bacteria as well, which fueled the development of improved computational methods for gene prediction (Borodovsky et al. 1995).More recently, the emergence of post-genomic experimental tools has generated an outburst of large data sets of expression profiles for all genes within complete genomes.The virtues of clustering analysis in functional genomics have been clearly illustrated by Eisen et al. (1998).Currently, clustering techniques are routinely used in transcriptome analysis to discover genes that might contribute to disease, identify potential drug targets, and sets of corregulated genes that will play an essential role in the eventual characterization of complete genomic regulatory networks (see D 'Haeseleer et al. 2000).Even though expression values are homogeneous, the methodology here presented permits an integrated analysis through the inclusion of other additional gene attributes.For instance, BClass is able to analyze simultaneously transcriptome data, functional categories, regulation modes, position within the chromosome and/or other biological attributes that could strategically improve the search for sets of co-regulated genes.Future work with BClass shall illustrate the extent to which heterogeneous classification impinges upon integrative genomic analyses.
The application of BClass to a data subset of RegulonDB is presented here as a preliminary example.However, it was apparent that the current list-based implementation of BClass in Lisp-Stat, albeit functional, is not adequate to analyze large data sets because the code is interpreted.Given that the time required for the execution of the MCMC algorithm is a function of the number of parameters and the number of observations, we have re-implemented the whole system using an array-based approach and a commercial version of common Lisp (Allegro), in order to compile the source code and obtain a fast binary stand-alone application.
Here we concentrated on efficiency issues to minimize time-consuming bottlenecks in the code.However, although the overall BClass speed was significantly increased, it is not fast enough to handle thousands of entries in a few minutes, often requiring several hours or even days.Given this situation, we are currently working on a parallel version of BClass that will be able to run much faster.

A.2. MCMC
It is routine in almost any modern Bayesian analysis to use Markov Chain Monte Carlo (MCMC) methods to approximate posterior densities; for a review of the subject see Besag et al. (1995).We design our MCMC sampler in the following way.We simulate in turn the parameters θ jv , for j = 1, 2, . . ., J, v = 1, 2, . . ., C from their full conditionals.The allocation variables J i 's are simulated using a Metropolis step and we simulate π and φ in a single stage using a Metropolis-Hastings step.
It is easy to see that the full conditionals for θ jv are proportional to J i =j f (x i | θ jv )f (θ jv ).This means that θ jv will be drawn as if from a posterior distribution using the likelihood J i =j f (x i | θ jv ) and prior f (θ jv ).We will concentrate on conjugate forms for f (θ jv ) and, in general, θ jv will be easy to simulate.There is, however, the difficulty that improper priors may not be used (see Roeder and Wasserman 1997).The definition of f (θ jv ) for specific type of variables is left for Section 3.
With respect to the allocation variables J , we make a proposal J i for J i with P (J i = j) ∝ φ j (that is, uniformly from the alive components).This represents a Metropolis (symmetrical) proposal and it is not difficult to see that its acceptance probability is In our experience, both with simulated and real data, we have seen that this Metropolis step has better mixing than simulating directly from the full conditional of J i , with the added benefit that only two evaluations of the likelihood are involved.
We construct proposals for π and φ to be accepted or rejected through a Metropolis-Hastings acceptance probability.Let n j = |{i : J i = j}|.We use a Dirichlet Di(n 1 +1, n 2 +1, . . ., n J +1) to simulate a proposal π = (π 1 , π 2 , . . ., π J ) for π.At a first step we ignore the ordering imposed on the π j 's.Due to alternative relabelings, the posterior distribution will have J! symmetric components.Given a simulated value h * from the posterior, a relabeling of h * may be considered to be a simulated value from the posterior.We therefore run the MCMC sampler and after a burn-in, when samples may be viewed as drawn from the posterior, we relabel the components to have the correct ordering in the π j 's.After the burn-in we relabel the samples every 5 or 10 passes and use only those samples both to ensure the correct ordering in the π j 's and avoid correlation.Thus in what follows we take the prior for π as in (1) ignoring the ordering constraint.

Figure 1 :
Figure 1: Directed acyclic graph for our model.

Figure 2 :Figure 3 :
Figure 2: Archipelago plot for the 435 genes analyzed from the RegulonDB E. coli gene database.