Strategies to avoid drowning in the deep sequencing data flood

The enormous technological progress in the field of functional genomics during the last 15 years had a significant impact on animal sciences. With the development of Next Generation Sequencing it became feasible to analyze genomes and transcriptomes within short time frames and affordable costs. One major challenge of this rapid development is to manage the data flood and to perform data analysis and integration in an optimal manner. This review provides some information about a typical analysis pipeline for RNASequencing (RNA-Seq) data and a strategy for the analysis of small RNA-Seq data derived from species with poor annotation for non-coding RNA genes. Furthermore, problems regarding gene annotation in livestock species and their possible implications for data analysis and interpretation are discussed. Despite of not yet solved problems and challenges with respect to data analysis and integration the approaches in the field of functional genome analysis opened up new ways to try to understand the complex trait fertility.

The tremendous technological advances in the field of functional genomics during the last decades had a strong impact on research in animal sciences.This is reflected, e.g., by a dramatic increase of the number of publications containing respective keywords (Fig. 1).The first wave started with the broad application of DNA microarrays end of the nineties, and a similar rise is observed for studies using RNA sequencing (RNA-Seq).With respect to livestock the increase of the number of published transcriptome studies showed a shift of two to three years.
The development of Next Generation Sequencing (NGS) facilitated the analysis of genomes and transcriptomes in an extremely short time at affordable costs (Goodwin et al., 2016).With the newest instruments for the generation of so-called "short reads" (up to 2X 150 bp, Illumina HiSeq 4000) it is currently possible to obtain up to 1.5 Tera bases corresponding to 5 billion reads per run or 12 genomes or 100 transcriptomes or 180 exomes per instrument run which takes 3.5 days.Furthermore, Third Generation sequencers deliver extremely long reads and can be used to sequence full-length RNA molecules (messenger as well as long non-coding RNAs) or to bridge longer repetitive genomic sequences to fill the gaps of the current versions of genome sequence assemblies (Goodwin et al., 2016).But also in the field of proteome analysis, the techniques have advanced, particularly mass spectrometric methods.Improvement has been achieved mainly with respect to sensitivity and quantification (Zhang et al., 2014a, b).Furthermore, NGS techniques have been refined in order to analyze tiny amounts of RNA or DNA.Whereas early RNA-Seq library preparation protocols needed starting material (total RNA) in the microgram range, modern standard protocols start from 100 ng of total RNA.Special protocols were developed to perform RNA-Seq even for a few or single cells such as oocytes and early embryos but also with parts of neuronal cells (Liu et al., 2014;Hrdlickova et al., 2016;Marr et al., 2016).
With this rapid development, particularly for NGS, a big challenge came up with respect to data analysis, interpretation, and integration (Rajasundaram and Selbig, 2016;Sun and Hu, 2016;Suravajhala et al., 2016).More and more data sets are generated for the analysis of gene expression at the level of RNA and proteins as well as for the genome-wide identification of sequence variants correlating with the trait fertility (Bauersachs, 2014;Bauersachs and Wolf, 2015).The combination of data from genome-wide association studies (GWAS) or quantitative trait locus (QTL) studies with corresponding data derived from gene expression analyses has a great potential to improve the understanding of the trait fertility with respect to the effects of sequence variations on gene expression regulation.A number of attempts to integrate these data have been performed for cattle (Pimentel et al., 2011;Minten et al., 2013;Moore et al., 2016).
In addition to the classical gene products mRNA and protein also non-coding RNA molecules are investigated which mainly have a role in regulation of gene expression (Bidarimath et al., 2014;Kotaja, 2014).Particularly, microRNAs (miRNAs), short non-coding regulatory RNAs, play a major role in the regulation of gene expression mainly at the level of repression of translation of specific target mRNAs as well as mRNA degradation (Krol et al., 2010).The expression of miRNAs in endometrium and in the embryo/conceptus has already been investigated in a number of studies (Ponsuksili et al., 2014;Krawczynski et al., 2015a, b).
For various reasons, such as not well standardized data analysis pipelines, incomplete genome sequence assemblies for livestock species, and incomplete gene annotation the analysis and the comparability of different data sets is complicated.This Anim.Reprod., v.13, n.3, p.153-159, Jul./Sept.2016 is even more complicated if omics data has been generated in different labs using various technological platforms (Bauersachs, 2014).To solve these problems will be one of the main tasks for future research if the scientific community is interested in exploiting the potential of omics studies and in a real progress in the field, i.e., the understanding of fertility as a complex trait.

Typical data analysis pipeline and statistical analysis
A typical data analysis pipeline for RNA-Seq data comprises several steps starting from the obtained sequence reads (Fastq files).Usually, the sequence reads are first trimmed based on quality scores (e.g. with Trimmomatic), i.e., bases with low quality at the ends (mainly found at the 3' end) are removed.Since RNA-Seq libraries often contain a certain percentage of cDNA inserts shorter than the read length, some reads run into the adapter sequence which has to be removed using a respective tool.To get information for the quality of the sequence data, Fastq files are checked before and after processing steps (e.g.FastQC) to ensure that all files have a comparable quality and to identify potential sequencing artifacts.After these data processing steps, the remaining reads are usually mapped to a reference genome or a transcriptome.The first is usually performed by the use of a spliced read mapper, e.g., Tophat2 (Kim et al., 2013) or HISAT (Kim et al., 2015).After assigning the sequence reads to a specific location in the genome the reads are counted for each exon, transcript or each gene.This can be performed on the basis of available gene annotation from NCBI or Ensembl.Alternatively, the data itself can be used to complement existing gene annotation using tools like Cufflinks or StringTie (Trapnell et al., 2012;Pertea et al., 2015).In the first years of RNA-Seq data analysis most of the tools were only available in command line mode running on Linux systems.With the integration into the Galaxy platform, a web browserbased genome analysis tool (Blankenberg et al., 2010), complex large-scale analyses can be performed without informatics or programming expertise (Giardine et al., 2005).Finally, these steps result in a read count table that is used for analysis of differential gene expression.Widely used tools for the analysis of read count data and the identification of differentially expressed genes (DEG) are the BioConductor R packages EdgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014).Since a local installation of Galaxy on a LINUX server is necessary to analyze bigger data sets such as RNA-Seq data an alternative way is to do the complete analysis of RNA-Seq data by the use of R and BioConductor on a desktop computer (Anders et al., 2013).

Analysis of small RNA-Seq data sets with special adaptation to poorly annotated species
For the analysis of small RNA-Seq data sets a modified analysis pipeline is needed compared to the basic analysis pipeline for RNA-Seq data since the resulting reads represent, at least in theory, the entire sequence of a small ncRNA.The typical processing steps of the FastQ files starting with quality control up to adapter clipping are similar.However, the use of spliced mappers like TopHat2 (Kim et al., 2013) or HISAT (Kim et al., 2015) for the analysis of smallRNA-Seq data sets is not appropriate because small RNAs are usually neither spliced nor found in coding regions of annotated genes.This leads to the necessity of a different mapping and sequence annotation strategy.For example, mapping to a reference genome using the Burrows-Wheeler Alignment tool (BWA; Li and Durbin, 2009) to map against a reference genome or NCBI BLAST (Altschul et al., 1997) for short sequences which is also available in Galaxy (Blankenberg et al., 2010) are suitable options.The BWA aligner works best for well annotated genomes where almost all short ncRNAs are known.So, the obtained sequences are just mapped to the corresponding genes and miRNA sequences (canonical and isomiRs) can be easily analyzed with tools like miRDeep2 (Friedlander et al., 2012).Because the BLAST algorithm is too slow for the analysis of too high numbers of sequence comparisons the number of unique sequences found in smallRNA-Seq libraries have to be appropriately filtered, e.g., based on a counts per million (CPM) cut-off, to reduce the number of sequences from hundreds of thousands or even millions to several thousand.This filtering removes at the same time sequences without biological relevance or sequences which are very likely to be the result of sequencing artifacts.An example for this data analysis strategy is shown in Fig. 2.
A challenging problem for livestock species including pig and cattle is the rather low number of annotated small ncRNAs, which complicates the use of BWA for mapping and miRDeep2 for identification of miRNAs.Furthermore, small RNA libraries usually contain also many other small RNAs in addition to miRNAs, such as fragments of ribosomal RNAs (rRNA), transfer RNAs (tRNA), small nucleolar RNAs (snoRNA), small nuclear RNAs (snRNA), and Piwi-associated small RNAs (piRNAs) in case of germline cells (Cole et al., 2009).Although the prediction of novel miRNAs can be performed by the use of miRDeep2 (Friedlander et al., 2012), the annotation of sequence fragments derived from other RNA molecules is more difficult.In contrast in humans, a great variety of ncRNAs is known compared to other mammalian species.This information can be used to improve annotation of small RNA data from other species since many of these RNAs are highly conserved.The use of BLASTn-short (local installation in Galaxy) for sequence comparison to all available sequences for RNA molecules of the target species and the inclusion of ortholog information derived from well annotated species from different annotation sources significantly improves the annotation of identified sequences found in the small RNA-Seq results to 80-90%, depending on the species and the sample type (Jochen Bick, 2016; ETH Zurich; personal communication).The consideration of the frequent occurrence of sequences representing isoforms of miRNAs (isomiRs; Krawczynski et al., 2015a;Zhang et al., 2016) can further improve sequence annotation.IsomiRs result from imprecise and alternative cleavage during the pre-miRNA processing and post-transcriptional modifications.The isomiRs show different miRNA stability, sub-cellular localization, and target selection (Zhang et al., 2016).Since post-transcriptional modification during miRNA processing also leads to the addition of nucleotides not matching to the genome sequence those isomiRs cannot be easily mapped using BWA and/or miRDeep2.Using the annotation strategy based on BLASTn searches following statistical data analyses can be performed including various types of small ncRNAs or miRNAs only.Furthermore, based on the attempt to annotate as much as possible of the obtained sequences, percentages of read counts in relation to the total number of read counts can be calculated for individual types of ncRNAs.This can also help to identify technical or biological outliers in a data set.
Figure 2. Workflow for data analysis of small RNA-Seq data performed by the use of Galaxy tools.The workflow goes from left to the right and includes processing of sequence files, generation of a read count table, and annotation of the obtained sequences.Fastq: sequence files derived from Illumina sequencer; FastQC: tool for quality control of fastq files; Trimmomatic and FastqMcf: tools for processing fastq files; BLAST: Basic Local Alignment Search Tool.Anim.Reprod., v.13, n.3, p.153-159, Jul./Sept.2016

Gene annotation in livestock
The efforts to sequence the human genome (Lander et al., 2001) were extremely high and cost alone the US tax payer almost three billion dollars.In addition to the genome sequence itself large projects were performed to sequence full-length cDNAs from mRNA derived from almost all human tissues (Wiemann et al., 2001) to obtain information about transcribed regions in the genome, gene structures, and transcript isoforms.Meanwhile, genomes have been sequenced also for livestock species (Elsik et al., 2009;Wade et al., 2009;Groenen et al., 2012).However, gene annotation for these species is still based in large part on the comparison to human or mouse orthologous genes.In addition, different annotation pipelines, e.g., NCBI and Ensembl, provide gene models which show sometimes substantial differences making the correct assignment of genes annotated with different pipelines the same genomic locus difficult.The corresponding information for the assignment of genes in Entrez Gene to genes in Ensembl is incomplete (NCBI->Ensembl) or incorrect (Ensembl->Entrez Gene).This is a serious problem if useful information such as ortholog annotation found in one database should be assigned to gene IDs of the other database.
For the functional annotation and downstream bioinformatics analysis the use of gene IDs of livestock species is not optimal and leads to information loss.The reason for that is incomplete annotation, i.e., many genes still do not have the official gene symbol and are not assigned to functional annotation databases such as Gene Ontologies (Ashburner et al., 2000) and KEGG pathway database (Kanehisa and Goto, 2000).To avoid this loss of information the putative human ortholog information can be used.One resource for ortholog information is for example EnsemblCompara (Vilella et al., 2009;Pignatelli et al., 2016).In order to combine information derived from different databases provided by the NCBI and Ensembl we are developing a Mammalian Ortholog and Annotation database (MOA-Db) integrated in the Galaxy platform in our group (Jochen Bick, 2016; ETH Zurich; unpublished results).A schematic overview of the MOA-Db is shown in Fig. 3.

Conclusions
The development of functional genomics approaches opened new ways to improve our understanding of the complex trait fertility.After the first wave of enthusiasm it is becoming more and more evident that there is a number of big challenges in the context of data analysis and integration.The main problems are inherent in missing standards for data analysis pipelines, integration of different kinds of data sets, bias in data sets related to different laboratories, protocols, and the use of different platforms.Furthermore, a particular challenge is the integration of results from different omics approaches, such as genome, transcriptome, proteome, and metabolome analysis.A major obstacle for the integration of omics data sets is the existence of a plethora of different databases and corresponding identifiers as well as incomplete, inconsistent, and not coordinated gene annotations, e.g., when comparing genome annotation at NCBI and Ensembl.In addition, an insufficient and/or erroneous gene or protein annotation leads to a significant loss of information and in the worst case to wrong data interpretation.Despite of all these problems and challenges, the development of the new sequencing technologies and the foreseeable even more exciting developments with respect to functional genomics technologies promise a new era of research in the animal sciences.

Figure 3 .
Figure 3. Development of a Mammalian Ortholog and Annotation database (MOA-Db).Based on information derived from public databases and available RNA-Seq data sets an annotation database is built for a number of mammalian species including gene annotation from NCBI and Ensembl as well as ortholog information.The ortholog relationships are based on information extracted from databases such as EnsemblCompara as well as on crosswise global BLAST comparisons of all transcripts annotated at NCBI for each species.