Increasing Power by Sharing Information from Genetic Background and Treatment in Clustering of Gene Expression Time Series

Clustering of gene expression time series gives insight into which genes may be co-regulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic background. This paper develops a new clustering method that allows each cluster to be parameterised according to whether the behaviour of the genes across conditions is correlated or anti-correlated. By specifying correlation between such genes,more information is gain within the cluster about how the genes interrelate. Amyotrophic lateral sclerosis (ALS) is an irreversible neurodegenerative disorder that kills the motor neurons and results in death within 2 to 3 years from the symptom onset. Speed of progression for different patients are heterogeneous with significant variability. The SOD1 G93A transgenic mice from different backgrounds (129Sv and C57) showed consistent phenotypic differences for disease progression. A hierarchy of Gaussian isused processes to model condition-specific and gene-specific temporal co-variances. This study demonstrated about finding some significant gene expression profiles and clusters of associated or co-regulated gene expressions together from four groups of data (SOD1G93A and Ntg from 129Sv and C57 backgrounds). Our study shows the effectiveness of sharing information between replicates and different model conditions when modelling gene expression time series. Further gene enrichment score analysis and ontology pathway analysis of some specified clusters for a particular group may lead toward identifying features underlying the differential speed of disease progression.


Introduction
The dynamic behaviour or analysis of time series data in particular clusters is important for exploring and understanding gene networks. In many conventional time series models, one key requirement is data with regular intervals. Gene expression experiments data with regular intervals might be less informative or may not be optimal from a statistical perspective or even may not be cost effective for various reasons. A model designed to obtain data with regular intervals may not elicit as much information as a method designed to collect pertinent special temporal features. Again, in many cases multiple biological replicates are available when the same experiments are repeated multiple times. For these cases simply considering only one experiment or taking the mean values from different replicates may not be the best solution. Interesting information might be discarded while dealing only with one data set or with their mean values. Amyotrophic lateral sclerosis (ALS) is a diverse neurodegenerative disorder with around 10% of familial cases and the remaining sporadic. The disease is currently irreversible from onset and heterogeneous with variable severity in terms of speed of progression of the disease course. Injury and cell death of motor neurons in the brainstem, spinal cord and motor cortex are the main reasons of this relentlessly progressive disorder [Haverkamp et al.,1995, Ferraiuolo et al., 2011Peviani et al.,2010, andA. Brockington et al., 2013]. Among the familial ALS [fALS] 20% is caused by mutation in the Cu/ZnSuperoxide Dismutase1 (SOD1) gene. The median survival of this lethal disorder is less than 5 years, only 20% patients live longer than 5 years and less than 10% patients survive more than 10 years from the symptom onset [Beghi et al., 2011, R. A. Saccon,et al., 2013. The speed of disease progression is not clear from the biological basis. Even in fALS, affected members clearly show the clinical het-erogeneity in terms of site of onset, age and progression rate of the disease. [Camu et al., 1999] reported the presence of potential gene modifiers and pathways that particularly affect the disease phenotype. Mutation in the SOD1 gene notably characterized the distinctive nature by intrafamilial and interfamilial variabilities in the phenotype. Many of the clinical and pathological features of human ALS can be replicated very well by transgenic mice. These murine models also mimic the human disease and show the heterogeneity in the disease progression for the clinical phenotype. These variabilities may be related with expression levels of mutant SOD1 protein or specific SOD1 mutations [Turner et al., 2003].
In a previous study [Pizzasegola, 2009 ;Alrashid and Al-Aaraji, 2015]it was reported that disease progression is much faster in 129Sv mice with the survival of 129±5 days, while the C57 mouse strain can survive 180±16 days. Both the 129Sv and C57 carry the same copy numbers of human mutant SOD1 and express the same amount of mutantSOD1 G93A messenger RNA in the spinal cord. [Marino et al., 2015] reported about the differences in protein quality control of these mouse models in terms of speed of progression of the disease course. The aim of our paper is to specify the significantly different genes that may affect the speed of ALS progression by building a new model. Gaussian process is used and here a coregionalization principle is introduced while developing the kernel of the Bayesian hierarchical Gaussian process model. There might be some degree of temporal continuity between different replicates of various strains. So, the kernel designed considering coregionalization model will consider the shared information between the replicates and strains. We used programming language python based tool GPy 1 , to develop our model. Later we optimized these models and compared them based on likelihood scores and select the best. We investigated the clusters obtained from our model. We have calculated the enrichment scores  for every cluster using the tool DAVID 2 (Database for Annotation, Visualization and Integrated Discovery)  and identified clusters which have very high enrichment scores. We carried out further analysis on some clusters with high enrichment score and demonstrated some interesting characteristics in their dynamic behaviour at the four time stages (pre-symptom, onset, symptom and end-stage) of disease course. Our functional annotation clustering and pathway analysis reveal some interesting information for a group of genes which might have some functionality for the speed of propagation of ALS particularly with reference to this specific type of mouse model.

Related Work
Gene expression time series data has been used extensively over the last few decades and implemented for insilico experiments to investigate various fundamental biological processes. Among the many processes examined, some of the notable examples are cell cycle, cell signalling [Barenco et al., 2006], regulatory activity [Sanguinettiet al., 2006], and developmental process [Tomancaket al., 2002]. Gaussian process has applied to gene expression time series widely with several aims and analyses, such as transcription factor target identification [Honkelaa et al., 2010], inference of RNA Polymerase transcription dynamics [Maina et al., 2014], and ranking differentially expressed time series [Kalaitzis and Lawrence, 2011]. Hierarchical models can significantly improve the inference in the Bayesian statistical problems [Gelman et al., 2004]while dealing with multiple relatedgroups of data allowing exchange of information. Inference on the whole structure of data is always preferable thanpartial independent structure. Estimating replicate time shifts were proposed by [Liu et al., 2010], where they used Gaussian process regression with uncertain measurement of mRNA expression. This method requires a large number of variables optimization.Previously, [Nget al., 2006;Medvedovic, 2004] used clustering method to model replicates using a hierarchical structure. Both of the model computes the replicate variance as multivariate Gaussian around some gene-specific mean. In a clustering application Gaussian process regression could be useful for parsimonious temporal inference. Temporal covariance of genes within a cluster can be designed by adding a hierarchical layer, again covariance between multiple biological replicates can be constructed considering one more hierarchical layer [Hensman et al., 2013]. Whilst GPs also overcome the requirement of evenly spaced time points for time expression data.

Gaussian Process Definition
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution [Rasmussen and Williams, 2006].It is a continuous stochastic process and defines probability distributions for functions. It can be also viewed as a random variable indexed by a continuous variable: chosen from a random function variables , with corresponding indexed inputs . In Gaussian processes, variables from these random functions are normally distributed and as a whole can be represent as a multivariate Gaussian distribution: (1) where is the mean and is covariance of Gaussian distribution . The Gaussian distribution is over vectors but the Gaussian process is over functions. We need to define the mean function and covariance function for a Gaussian process prior. If is a real process, a Gaussian process is completely defined by its mean function and covariance function given in equation 2 and equation 3 respectively. Usually the and the covariance function are defined as where represents the expected value. We denote the Gaussian process as (4) The covariance matrix is constructed from the covariance function and .

Gaussian Process Regression
Gaussian process regression can be done using the marginal and conditional properties of multivariate Gaussian distribution. Let's consider that we have some observations f of a function at observation point . Now we wish to predict the values of that function at observation points , which we are representing by . Then the joint probability of f and can be obtained from where the covariance matrix has elements derived from the covariance function such that the element of is given by [ ] [ ] The conditional property of a multivariate Gaussian is used to perform regression the. The conditional property can be represented by ( | ) (6) In ideal case the observations is noise free but in practice it is always corrupted with some noise. Let's consider is the corrupted version of . If we consider this noise as Gaussian noise then we can write , where is the variance of the noise and I is the identity matrix with appropriate size and marginalise the observation . Then the joint probability of and can be represented by . (7) Regression with Gaussian process can be seen as a Bayesian method. From the knowledge of a prior over a function we proceed to a posterior and this happens in a closed from of equation 6. To construct the covariance function still we need to consider the hyperparameters. The most efficient and commonly used selection technique for hyperparameters in Gaussian process is maximum likelihood. If we consider all the hyperparameters, and lin to a vector , then we can use gradient methods to optimize p (y| ) with respect to . The Log likelihood is given by

Hierarchical Gaussian Process
Our gene expression time series came from four different strain and there are four biological replicates. So for every individual gene wecanincorporate these in a hierarchical fashion. Let denotes gene expression of gene in the biological replicates and biological strain. Measurements were made at four different times and collected into a vector . The data for strain's gene is denoted by and .
( ) and ̂ [ ] and representsthe hyperparamters forthe covariance function and . The structure of the covariance matrix for two genes n hyperparameters for and are given by otherwise.
While designing different kernels k we have used Coregionalization model. Here we constructed a hierarchical GP [11] based model to analyse the gene expression time series data collected from four mouse models with different genetic background (129Sv and C57 with transgenic and non-transgenic). We also considered their replications (four in our case) and build a covariance matrix based on their shared information and the time points were pre-symptom, onset, symptom and end stage of the disease course.

Kernel Design with Coregionalization
Gaussian process models have been used already to capture structure in the data arising from temporal correlation. Our innovation is to realise that there is actually additional correlation structure relating to the genetic background of the organism (in our case, the mice strains) and the status as control/experiment (in our case the presence or absence of the SOD1 mutation). By acknowledging such structure in thecovariance matrix, we can increase the power of our method. Standard approaches force each of these conditions to be fully independent. Our model allows the correlation structure to be learned. Our formalism for introducing correlations across conditions and strains is the coregionalization principle [Alvarez and Lawrence, 2011] that originates in geostatistics [Wackernagel, et al., 2003]. Coregionalization matrices allow us to share the information between the replicates and strains. In machine learning language, this approach is some-times known as 'multi-task learning' where eachcondition and strain is assumed to be a different task. However, in statistical terms it is simply a multi-variate regression or amultiple output model.An appropriate general model that can capture the dependencies between all the data points and conditions is known as the linear model of coregionalization (LMC) is a model where output is a linear combination of independent random functions. (A detail explanation of the coregionalization model is available at [Alvarez and Lawrence, 2011;Alvarez et al., 2012]). If we can consider our problem with a set of output functions for input domain, then output function of LMC can be expressed as ∑ Here the interpretation is that are a set of functions that each share the same covariance function (one can think of them as some form of underlying latent processes that determine system behaviour). The parameters represent the relationship between a given latent function, q and an observed condition and or strain. If we consider there can be several different covariance functions associated with separate latent sets then equation 15 is expressed as and the cross covariance function between in terms of the function is given by For the so-called homotopic case Lawerence, 2011,Wackernagel, 2003] the covariance matrix for the joint process f can be rewritten as a sum of Kroneckerproducts,finally, we can write the covariance as ∑ ∑ where represents Kronecker product, and is the coregionalization matrix. The positive semi-definite covariance functions of the latent processes, can be chosen from wide range of covariance functions. Here we used a combination of exponentiated quadratic kernel (also known as squared exponential or RBF kernel) to describe the properties of the function which underlay each cluster. We used a white noise kernel in additive form to deal with the noise of the process. The experimental conditions of acquisition of gene expression measurements cannot be ideally controlled, so the measurements could be corrupted by noise, incorporated either at the biological origin or introduced in the measurement process. Figure 1 shows a simple representation of the coregionalization kernel in the input space and the representation of an optimized kernel where we considered only 100 genes.

Clustering
Our aim was to discover groups of genes that were exhibiting the same functionalbehavior across times and conditions. Our coregionalization approach allows us to cluster these sub groups through a mixture of Gaussian process models: each component is a function over time, genetic background and condition. Partitioning genes into clusters can be done by some using our Gaussian process prior over the functions and a Dirichlet process prior for the mixing coefficients. This can be achieved through Gibbs sampling [Dunson et al., 2010], but this can be slow in practice. A potentially improved model was proposed by [Hensman et al., 2013], where they consider the structure of covariance across the gene and separately across replicates. The use a variation lower bound for model inference. Each gene is placed in an individual cluster and later merged with a greedy selection process to maximize the log marginal likelihood of time series data. Hyperparameters are optimized when no merges are possible to improve the overall marginal likelihood. Then expectation maximization algorithm is used with new covariance function

Microarray Data Analysis:
We used the Affymetrix data from [Nardoet al., 2013]. In this experiment spinal cord tissues were obtained from C57 and 129Sv transgenic SOD1 G93A mice and age-matched non-transgenic littermates at the presymptomatic, the early symptomatic (onset) stage, symptomatic and end stage. The transcription profiles of laser captured motor neurons isolated from the lumbar ventral spinal cords of the rapid progressor (129Sv SOD1 G93A ), slow progress (C57 SOD1 G93A ) mice at four stages of the disease (presymptomatic, onset, symptomatic, end stage) and respective non-transgenic littermates were generated using the murine GeneChip Mouse Genome 430 2.0 Plus (Affy MOE4302). We used Bioconductorpacakge Puma [Pearson et al., 2009] to extract the point estimates of gene expression levels from the GeneChipAffymetrix data.

Select differentially expressed genes:
All the gene expression time series data extracted from Affymetrix data might not be differentially expressed and filtering out the requisite genes is obvious. Considering the temporal nature of data using Gaussian process [Kalaitzis et al., 2011, Alrashid andAlarajii, 2015] can be used to analyse the time series gene expression and filter the quiet or inactive genes from the differentiallyexpressed ones. In addition, identifying genes that have a good signal-to-noise ratio (SNR) is also used to filter down the total number of genes that need further analysis. We can rank the genes by the ratio of the mean replicatewise variance to the variance of the replicate-wise means. In our analysis, we used a combination of both of the approach. First, we made the initial ranking of the gene expressions (45, 037 genes for our case) using method of [Kalaitzis et al., 2011] and then we use the SNR to choose 10, 000 genes for further analysis. Before the filtering the gene expression levels of each replicates were normalized to zero-mean over all the samples.

Cluster analysis:
In the previous analysis stage, we derived 10,000 genefrom the total of 45,037 probe sets which were more dynamically differentially expressed. We applied our own hierarchical Gaussian process cluster model on these genes and collected the results for further experiments. Figure 2 shows a small part of our result. For any individual graph, along xaxis the four time stages are pre-symptom, onset, symptom and end-stage. We have used four different colours (yellow, red, green and blue) to separate four mouse strains (129Sv SOD1,129Sv Ntg,C57 Ntg and C57 SOD1 respectively). Any individual cluster contains a number of genes which might be biologically associated or co-regulated and we mention the number of the genes belong to that cluster at the corner of the plot. In the plot, a solid line rep-resents posterior mean function and shaded area represents 95% confidence interval. We have found a total of 203 different clusters with a variety of number of genes. Many of the clusters indicated different dynamic behaviour of the gene set. Many of the clusters were attractive for further analysis but that is beyond the scope this study. We included some examples in the Figure 3. We have limited our consideration to the clusters where the strain 129Sv SOD1 (yellow color in our representation) has different characteristics and focussed our consideration for further analysis.

Enrichment score analysis:
A typical biological process is regulated with a group of genes. If we apply a high throughput screen technology then the co-functioning genes are very much more likely to appear together with a higher potential (or enrichment) score. These logical reasons instigate the analysis of a gene list or group of genes moving from individual gene oriented view. The enrichment score is a quantitative measure derived from some wellknown statistical methods like Binomial probability, hyper-geometric distribution, Chisquare, Fisher's exact test. In a previous study, [Huanget al., 2009] reported about 68 Bioinformatics tools to compute the enrichment score and grouped them in three major categories. DAVID [Huanget al., 2009] is a widely used tool developed based on Fisher's Exact and extensively used for singular enrichment analysis (SEA) and modular enrichment analysis (MEA). We used DAVID on our clusters of genes to calculate the enrichment score for individual clusters. Figure 4 shows the result. Whilst in an analysis a group of genes with an enrichment score of 1.3 can be considered as a threshold value to decide whether this list of genes is enriched or not, here for our 203 clusters we have found at least 15 clusters have an enrichment score of 2. Pathway analysis:. Pathway analysis allows us to gain an insight of the underlying biology of the differentially expressed genes. Path-way analysis can reduce the complexity and increase the explanatory power where high-throughput sequencing and gene profiling are used to investigate whether a gene or a list of gene have any roles for a phenotype or a given phenomena. It is also used for the analysis of gene ontology, physical interaction networks, inference of pathways from expression and sequence data, and further comparisons. In a given condition it can identify the pathway by correlating information with a pathway knowledge base. We identified some clusters (which were deemed interesting in the earlier stages) and performed gene ontology enrichment analysis (one example is given at Table 1) and pathway analysis on individual clusters. We identified one of our clusters (cluster197; Figure 3) which was selected at the earlier stage for further analysis and has a relatively high enrichment score (2.16) which is related with ALS. In previous study [Brockingtonet al., 2013] reported about SOD1 related genes and ALS. One of the SOD1 related genes, Derlin 1, can accumulate with other misfolded proteins and cause the neuron death and belongs to our chosen cluster. Figure 5 shows the pathway analysis that we have found for one of our cluster using tool DAVID. We have also found some other genes from the same cluster are responsible for neuron death and related with some other neural disorder like parkinson.  (top to bottom, left to right: cluster 01 to 08, 16 to 23, 31 to 38 and 46 to 53). Along x-axis the four time stages are pre-symptom, onset, symptom and end-stage (all the data points together formed like solid yellow vertical lines). Four different colours yellow, red, green and blue are representing four mouse strains 129Sv SOD1,129Sv Ntg, C57 Ntg and C57 SOD1 respectively. Number at the corner indicates number of genes belong to this cluster.
Solid line represents a posterior mean function and shaded area represents 95% confidence interval.

Conclusions
We have performed genome-wide analysis to cluster genes systematically and analyse the rationale behind the variation in the speed of propagation for ALS. Our particular innovation was to include the condition and genetic background of the organisms within the underlying functional component of our clusters. This ensured that sub-groups where the underlying expression behaved similarly were more likely to cluster together. The hierarchical Gaussian process we used considers multiple replicates. For validation, we have used a widely acceptable gene ontology and functional annotation tool to validate our clusters and their characteristics obtained from our model. We found a number of clusters are highly enriched. Characteristics curve and enrichment scores analyse helped us to narrow down our search and lead toward finding the lists of genes or clusters which could be involved in the speed of disease propagation. Our pathway analysis found a gene which is known to be involved in the disease process. Here we started with whole genome set and ended with a single gene. This finding leads us to conclude that the model we have developed based on Gaussian process can cluster the genes successfully and they are very much informative. These clusters can be useful for further analysis. Even the model we have developed using hierarchical Gaussian process will be useful to investigate other biological activity where clustering is required.

Acknowledgments
We would like to thank Giovanni Nardo, Raffaele Iennaco, NicoloFusi, Marianna Marino, Maria C. Trolese, Laura Ferraiuolo, Pamela J. Shaw and Caterina Bendotti (authors of [Nardo et al., 2013]) who shared their data and allowed us to do further analysis. We would also like to acknowledge Ministry of Science and Information & Communication Technology, Bangladesh for funding Muhammad Arifur Rahman, we