Time-series RNA-seq analysis package (TRAP) and its application to the analysis of rice, Oryza sativa L. ssp. Japonica, upon drought stress
Introduction
Biological processes involve a set of genes that interact dynamically and differentially in specific conditions. Some of the genes are activated and suppressed in condition specific manners and gene activation/suppression affects regulation of other genes directly or indirectly. Thus understanding biological processes from transcriptome (whole genome gene expression) data is a very important analysis task. One of the most effective methods to understand biological processes is to match up genes that are activated and suppressed in terms of curated pathway databases such as KEGG [1] or DAVID [2]. Techniques for measuring transcriptome have rapidly changed from microarray to sequencing as the cost for sequencing cost decreases. The high throughput RNA sequencing (RNA-seq) technique has advantages over microarray technique such as the decreased noise level and more replicable results [3], compared to microarrays. For this reason, the number of time-series RNA-seq data sets in the public domain has increased dramatically over the past few years [4]. The main issue with RNA-seq is to handle sequencing data that is much bigger than microarray data. When transcriptome data is measured over time, analysis of RNA-seq data requires to consider the time dimension. Thus, analysis of whole genome transcriptome data is a quite involved task. Two major hurdles are:
- •
A number of tools developed for microarray data are not suitable for the RNA-seq data.
- •
Analysis of RNA-seq data requires use of multiple tools in a pipeline. Especially analysis of time series RNA-seq data is very complicated, thus only bioinformatics experts can construct such pipeline.
To address these challenges, we developed a comprehensive package, Time-series RNA-seq Analysis Package (TRAP), for analyzing time series transcriptome data. There have been many packages developed for analyzing time-series gene expression data [3]. The most widely used technique is to identify DEGs. Packages for finding DEGs from time-series data include SAM [5], LIMMA [6], EDGE [7], maSigPro [8] and BETR [9]. However, most of them are developed for microarray data and they assume that gene expression follows normal distribution. Users with RNA-seq data, therefore, should perform additional conversion process or use the certain type of distribution (e.g., Poisson or Binomial) which needs further validation. Some of the package does not support time-series analysis for RNA-seq data, some have constraints on the number of replicates and some are out dated, not compatible with current operating systems. Another important analysis topic is clustering genes that have similar time-series expression patterns. Several clustering packages are available. CAGED [10] uses autoregressive equations, GQL [11] uses Hidden Markov Model, STEM [12] uses expression profiles, TimeClust [13] uses stochastic models, DynaMiteC [14] uses impulse models, and PESTS [15] uses various features selected by user.
Finding DEGs and clustering is the first step for identifying genes that may have an important role in relation to phenotypes. However, the analysis needs to go at least one step further to extract biological implication from the gene list. The most widely used method for this additional analysis step is pathway analysis. Through research on biological process over several decades, we now have significantly accumulated knowledge about the biological pathways, genes and proteins, and their interactions. The pathway knowledge is available as public repositories such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [1] or Gene Ontology [16]. By combining the pathway information and a list of significant genes from the experimental results, inference of active pathways is possible and this method is called knowledge base-driven pathway analysis [17]. Most of the packages for DEGs and clustering introduced above do not include pathway analysis functions. Instead, there have been studies of inferring pathways from time-series expression data using prior knowledge and DBN [18], Granger causality and Minimum Spanning Tree [19], Ordinary Differential Equation [20]. Albeit these studies have given us important algorithms and findings for pathway inference, they are not of practical use, as the source code or software for their analysis are not available and the analysis result do not include pathways from public databases.
On the other hand, several knowledge base driven pathway analysis algorithms using public databases have been developed over time and they include Over-representation Analysis (ORA) and Pathway Topology (PT)-based analysis [17]. ORA [21] evaluates pathways by using DEGs and ORA uses GO database to show biological process underlying the phenotype. However, the biological process from ORA does not use specific biological pathway information such as KEGG. In addition, most of the packages are web-based or Windows-based applications which are not suitable for big data sets such as RNA-seq data. The PT-based approach utilizes the pathway network and interactions in addition to the gene expression values. SPIA [22], one of the PT algorithm, is originally developed as a component of the BioConductor package for human dataset but additional steps are needed to be used for other species.
This article describes a Time-series RNA-seq Analysis Package (TRAP), integrating all necessary tasks such as finding DEGs, clustering and pathway analysis for time-series data. TRAP is an easy-to-use and comprehensive package since it automatically performs a series of analysis steps from mapping short reads to pathway analysis. The key features of TRAP are as follows:
- •
TRAP is a comprehensive package that puts together the state of the art technologies for time series gene expression data analysis such as ORA and SPIA for pathway analysis and gene expression change labeling for clustering.
- •
TRAP is designed for RNA-seq data and it performs all necessary analysis steps automatically from mapping RNA-seq short reads to the reference to pathway analysis to visualization of pathway interactions.
- •
TRAP extends pathway analysis methods, ORA and SPIA, to time series analysis and estimates statistical values for combined dataset by an advanced metric.
- •
TRAP is successfully used to analyze time series RNA-seq data for investigating biological mechanisms for the drought resistant rice.
Fig. 1 illustrates the analysis procedure of TRAP. TRAP reads the user configuration file that includes a list of sample names, location of the sequenced reads, and user-defined thresholds for selecting DEGs. After preprocessing, a series of analysis steps of TRAP as described in a white box of Fig. 1 are executed. For each sample, paired reads are mapped to the reference genome by Tophat [23]. Using the mapping results, TRAP estimates gene expression levels by Cufflinks [24] according to the gene annotation file from Rice Annotation Project Database (RAP-DB) [25]. As an output of Cufflinks, gene expression levels measured in FPKM (Fragments Per Kilobase of exon per Million fragments mapped) unit are stored in the text files. TRAP combines the gene expression values with KEGG pathway information to perform three types of analysis: one time point pathway analysis, time-series pathway analysis and time-series clustering. The pathway information in KEGG database is described in the form of KEGG Markup Language (KGML) file and the user can define a new pathway making his own KGML file. The output of the pathway analysis is two text files and an image file. The text files contain a list of DEGs and a list of pathways with their P-values and other statistics. The image file is a graph-representation of the result for easy interpretation. Clustering result includes two text files, a list of genes and a list of significant pathways for each cluster. Tools, database, and algorithms used in TRAP are listed in Table 1.
The following is the TRAP input and output structure.
- 1.
Input
Before staring TRAP, user must fill in config.txt file to help TRAP read the following information. If user already has Tophat or Cufflinks results, the optional fields can be ignored.
Sample pair name
Sample names in control-treated order. Each line corresponds to one time point.
Sample name and file path
Sample names and their sequencing data file path in the server. If it is the paired-end sequencing data, place the white space between the path.
Tophat path (optional for Tophat)
Tophat path on the server.
Reference genome path (optional for Tophat)
Reference genome path on the server.
Cufflinks path (optional for Cufflinks and Cuffdiff)
Cufflinks path on the server.
Annotation file path (optional for Cufflinks)
Provided by TRAP (for Oryza sativa L. ssp.Japonica). The file should be in a readable format for Cufflinks, e.g., gtf/gff.
Gene name conversion file path
Provided by TRAP (for Oryza sativa L. ssp.Japonica). If the user-defined annotation file is used and the gene names in the annotation file is different from those in KEGG database, the conversion file is needed.
DEG and clustering cutoff
- 2.
Output The output files are stored in TRAP_result folder.
Pathway analysis results
One_time_genes.txt: A list of DEGs from each time point.
One_time_pathways.txt: A list of pathways and their size, P-values, status from one time point analysis.
Time-series_genes.txt: A list of DEGs from time-series analysis.
Time-series_pathways.txt: A list of pathways and their size, P-values, status from time-series analysis.
Clustering results
Clustering_genes.txt: A list of clusters and genes in the clusters.
Clustering_pathways.txt: Pathway analysis result for each cluster.
Section snippets
One time point pathway analysis
The first analysis in TRAP is finding significant pathways for each time point. TRAP takes gene expression values of a certain time point as an input and executes two analysis methods to the values: ORA and SPIA. As a result, it provides a list of DEGs and a list of pathways with the P-values from two methods. This is repeated for Lt time points independently.
The analysis starts with selecting DEGs from the genes. We define DEGs as genes which have significantly different gene expression value
Results and discussion
To evaluate how good TRAP is, we used rice (Oryza sativa L. Japonica nipponbare) mRNA-seq data generated by Illumina sequencing. The dataset compares drought-resistant AP2/EREBP transgenic rice samples with normal nontransgenic rice samples sequenced at three time points, 0, 1, and 6 h, after applying drought stress. Two AP2/EREBP transcription factors are amplified in rice by using overexpression vectors OsCc1:AP37 and OsCc1:AP59, respectively, resulting in increased drought resistance [28].
Conclusion
This study introduced TRAP, a comprehensive package for analysis time series RNA-seq data that performs pathway and clustering analysis. TRAP implements ORA and SPIA methods for analyzing pathways with and without interactions between genes. Besides providing results for each time point, TRAP considers all time points, i.e., time series, and estimates a single pathway-level statistic value using modified versions of ORA and SPIA. Gene expression change labeling used for clustering enables users
Acknowledgments
This work is supported by a grant from the Next-Generation BioGreen 21 Program (No. PJ009037022012), Rural Development Administration, Republic of Korea; the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622).
References (38)
- et al.
Leaf membrane lipids and drought tolerance in young coconut palms (Cocos nucifera l.)
Eur. J. Agron.
(1997) - et al.
Effects of water stress on the molecular species composition of polar lipids from Vigna unguiculata L. leaves
Plant Sci.
(1990) - et al.
Superoxide dismutase an enzymic function for erythrocuprein (hemocuprein)
J. Biol. Chem.
(1969) - et al.
Enhanced drought tolerance of transgenic rice plants expressing a pea manganese superoxide dismutase
J. Plant Physiol.
(2005) - et al.
Salt tolerance of transgenic rice overexpressing yeast mitochondrial Mn-SOD in chloroplasts
Plant Sci.
(1999) - et al.
Kegg: kyoto encyclopedia of genes and genomes
Nucleic Acids Res.
(2000) - et al.
Systematic and integrative analysis of large gene lists using david bioinformatics resources
Nature Protocols
(2008) - et al.
Studying and modelling dynamic biological processes using time-series gene expression data
Nature Rev. Genet.
(2012) - et al.
Ncbi geo: archive for functional genomics data sets 10 years on
Nucleic Acids Res.
(2011) - et al.
Significance analysis of microarrays applied to the ionizing radiation response
Proc. Natl. Acad. Sci.
(2001)