From plant genomes to phenotypes

Recent advances in sequencing technologies have greatly accelerated the rate of plant genome and applied breeding research. Despite this advancing trend, plant genomes continue to present numerous di ﬃ culties to the standard tools and pipelines not only for genome assembly but also gene annotation and downstream analysis. Here we give a perspective on tools, resources and services necessary to assemble and analyze plant genomes and link them to plant phenotypes.


Introduction
The last decade has seen tremendous progress in the field of plant genomes which began with the model plant, Arabidopsis thaliana, whose genome was published in 2000 (Arabidopsis Genome, 2000). This was followed shortly with the genomes of the first crop plants, rice and poplar and since then both crop and non-crop plants from diverse clades have been sequenced and assembled (Bolger et al., 2014c) (http://www.plabipd.de/portal/sequence-timeline). Until recently, these sequencing projects were carried out by large genome sequencing consortia bringing together expertise in many different fields. The next generation sequencing era gradually enabled individual labs or small consortia to undertake whole plant genome assembly projects Pucker et al., 2016). Further improvements in long-read technologies help bridge repeat regions, a major obstacle to completing a genome, which previously required complicated mate-pair libraries . Still, despite these tremendous advances, many plant genomes contain particular complexities which make genome assemblies and analysis difficult (Claros et al., 2012).
Given the fact that the "1000$ human genome" has been called a reality by Illumina, one might wonder why some plants are resilient to having their genomes assembled (Fig. 1).
One of the primary issues with plants is the highly repetitive nature of many plant genomes which presents a major inherent problem for the assembly process. A recent study has highlighted this issue by systematically comparing over 40 plants with over 60 vertebrate genomes using an unbiased kmer based approach and showed the high repeat content in plants . Though this issue has been greatly reduced by long-read technologies, these reads are still unable to span the large tandem repeat regions found in many plant genomes. This problem is further exacerbated in many cases by the sheer size of plant genomes which in extreme cases can reach to tens of gigabases. In addition, plant genomes frequently have high ploidy levels which resulted from either duplications as seen in the autopolyploid potato genome or from a combination of genomes from different species which gives rise to allopolyploid species such as wheat, Camelina and rape seed. As an example, the ca 17 GB wheat genome features three subgenomes (International Wheat Genome Sequencing, 2014), each of which contains a set of homologous genes (genes with purportedly the same ancestor from different genome complements). Last but not least, many plants are self-incompatible (Fujii et al., 2016) thereby resisting attempts to achieve a homozygous state. The genome assembly process is greatly simplified however when there is only a single allele per locus to assemble. These genomic features combine to magnify the seemingly trivial task of producing a high quality genome assembly.
Plants are extremely versatile organisms featuring many hundreds of thousands of metabolites occurring through the plant kingdom and probably more than ten thousand metabolites per species (De Luca and St Pierre, 2000;Fernie, 2007). Although many of these metabolites share common pathways, there are still a large number of genes involved overall. In some cases, divergent genes which have largely retained sequence homology need to be separately annotated due to their involvement in different pathways and thereby different metabolites.
Furthermore, plant sciences feature several communities with specific needs and interests. As an example, a researcher working on sugar beet (Dohm et al., 2014) or carrots (Iorizzo et al., 2016) will be more interested in below ground organs and their development than a researcher working on tomatoes or barley. However, a common ground is that the plants are (also) looked at from a breeding perspective in order to improve plant yield and/or resilience. This process is greatly expedited by the development of statistical analysis and model based analysis of plant genetic (genomic) and phenotypic data (Hammer et al., 2006) as well as the maturation and development of plant phenotyping technology (Fiorani and Schurr, 2013).
Within this review we attempt to shed light on the analysis of plant genomes, describe current problems as well as how plant genomes can be best leveraged in conjunction with high throughput phenotyping to accelerate selective plant breeding.
We provide a detailed list of tools which can be used in the process of genome assembly, annotation and linking it to phenotypic plant data.

De novo genome assembly
Plant de-novo genome assembly is notoriously difficult (Claros et al., 2012), mainly due to the problems mentioned above. This has prompted the development of tools which can cope with these difficulties, some of which also serve the wider scientific community. A notable example of this is the raw data preprocessing tools 'Trimmomatic' http://www.usadellab.org/cms/?page=trimmomatic (Bolger et al., 2014b) which was developed due to the necessity for a highly efficient adapter trimmer during a plant genome sequencing project and has since been widely adopted by the whole scientific community due to its flexibility, speed and efficiency. After read data preprocessing, error correction of reads is frequently carried out which can either be done by the assembler tool itself, as in the case of Canu https://github.com/marbl/canu (Koren et al., 2017) or SOAP denovo http://soap.genomics.org.cn/soapdenovo.html (Luo et al., 2012), or by using separate tools such as those reviewed in  or by using graph based analysis (Joppich et al., 2015). The later graph based approach demonstrated that such an analysis would correct reads which are missed by other typical frameworks which employ kmer counting methods.
The assembling of sequencing data is often undertaken as a 'trial and error' approach. The available tools each have their own strengths and weakness and multiple runs whilst optimizing parameters is often necessary to produce the best assembly. The source of the input data can also determine which assembler is used. There are three main aspects which differentiate the different assemblers; 1) the underlying algorithm used 2) how aggressive the tool is at dealing with false positives/negatives 3) the heuristics which are employed by the assembler. Each of these aspects combine to produce vastly different results which are not guaranteed to be correct. Determining how to assess the correctness of the output from an assembler is also challenging and has been covered by the Assemblathon challenge (DOI: 10.1186/2047-217X-2-10).
Assemblers which have so far been used in projects associated with plant genomes include commonly used short read assemblers such as ABySS http://www.bcgsc.ca/platform/bioinfo/software/abyss (Simpson et al., 2009), DISCOVAR (de novo) https://software.broadinstitute.org/software/ discovar/blog/ (Weisenfeld et al., 2014) and velvet https://www.ebi.ac.uk/ ∼zerbino/velvet/ (Zerbino and Birney, 2008). Long reads have been assembled using assemblers such as Canu, SMART denovo https://github. com/ruanjue/smartdenovo and Falcon https://github.com/ PacificBiosciences/FALCON as well as Hybrid assemblers such as SPAdes http://bioinf.spbau.ru/spades (Bankevich et al., 2012). Additionally, a modified version of SOAPdenovo was used to assemble the genome of a wild tomato (Bolger et al., 2014a). Even commercial software, such as the CLC assembly cell or NRGene can sometimes be helpful for plant genome assemblies (Bauer et al., 2017;International Barley Genome Sequencing et al., 2012). The wide variety of assembler software used reflects the diversity encountered when assembling plant genomes. This is aided by assembly strategies relying on a very large set of genetic markers guiding the assembly process (Hirsch et al., 2016) and experimental techniques which provide large scale proximity information (Kalhor et al., 2011;Mascher et al., 2017). As assemblers and sequencing technologies improve further, we can expect a greater number of high quality genome assemblies from large and complex plant genomes. These are expected to pose new challenges for downstream bioinformatic processing in regard to management, annotation, comparative analyses and visualization.

Genome annotation
Assembling a high quality genome is indeed an arduous task, but is still only the first step to providing meaningful biological data. Gigabases of nucleotide data without any form of interpretation is useful only in very niche scientific pursuits. To fully realize the value of a genome assembly, it must undergo a process known as annotation. This process can be further characterized into structural annotation, whereby the structures of genomic features are delineated, and functional annotation, whereby functions are ascribed to these structures.
The main structures of interest in genomes are typically genes which provide the core instructions which all organisms depend on. Typical gene finding tools such as AUGUSTUS http://augustus.gobics.de/ (Stanke and Waack, 2003) and/or the plant adapted MAKER-P http://www.yandell-lab.org/software/maker-p.html (Campbell et al., 2014) will annotate a genome with or (sometimes) without external evidence. Despite recent efforts to make these tools more automated and user friendly http://exon.gatech.edu/braker1.html (Hoff et al., 2016), obtaining optimal results still requires a certain level of expertise. Incorporating extrinsic RNAseq evidence generally results in superior results, but care needs to be taken not to overlook real genes (false negatives) or mispredict genes (false positives).
Indeed, due to the ever increasing demand for gene finding, GCBN (the Germany BioGreenformatics Network) at HMGU is developing a fully automated pipeline for plant gene finding.
Another annotation strategy exploits the fact that closely related plant species often retain gene order, i.e. synteny, which can be leveraged for gene finding and prediction purposes. This strategy relies on structured databases or tools such as those offered by CoGe (https:// genomevolution.org/coge/) (Tang et al., 2015). For example, CoGe offers "SynMap" to compare syntenic regions and e.g. SynFind to find homologs. Another particular successful example is the GenomeZipper developed originally for barley (Mayer et al., 2009) and has not only been used for other Triticae like rye (Martis et al., 2013) but the approach was even adopted by other groups for more exotic species such as the mulberry tree (He et al., 2013).
Another major task in genome structural annotation involves the identification of transposons and other repetitive elements. This is particular relevant for plants as many (large) genomes are extensively composed of transposable elements and/or their relics. These have been calculated at 3.7 Gbp (81%) of mainly retrotransposons in the most recent 4.6 Gbp assembly of barley . In order to address this challenge, one can detect some of them de-novo using e.g. the GCBN-IPK developed K-masker (Schmutzer et al., 2014) or the REPET package https://urgi.versailles.inra.fr/Tools/REPET (Flutre et al., 2011), but this needs to be complemented by tapping into existing plant data projects. For that purpose GCBN-HMGU collects, curates and classifies plant transposable elements in a partially automated large scale approach to complement the more animal focused REPBASE database http://www.girinst.org/repbase/ (Bao et al., 2015). As GCBN is continuously annotating new genomes, more and more data is fed to this database . This now leaves GCBN-HMGU as the last plant specific provider, as the TIGR plant repeat database (Ouyang and Buell, 2004) was discontinued in 2017 due to lack of funding. Transposable elements, which were once dismissed as merely invasive items, are experiencing a resurgence in interest as evidence accumulates to show that plants sometimes domesticate transposons or transposon promoters (Bolger et al., 2014a). This has been shown to cause divergence in regulation of flowering time (Lutz et al., 2015) and also that their role in genome and epigenome modification by providing a means of variability is important (Maumus and Quesneville, 2014).
Once the structure of the genes have been identified in a genome, it is then necessary to ascribe function to these genes. Whilst it may appear than the function for many genes has already been discerned, one has to keep in mind that many plant genes currently have no function that can be attributed simply due to the lack of knowledge. This holds true even for the highly researched model plant Arabidopsis thaliana (Bolger et al., 2017). A first approach for functional annotation can be to ascribe function based on sequence similarity. This can be as straightforward as performing a blast search against as similar species. In the case of Triticae, one could use the GCBN-IPK Blast server http:// webblast.ipk-gatersleben.de/, a well annotated resource for barley and rye which integrates many different datasets such as exome capture specific sets which are commonly not accessible.
Generally, a comprehensive annotation pipeline will integrate similarity searches, domain architecture analysis along with other available data as discussed in detail in Bolger et al. (2017). GCBN-FZJ currently provides the Mercator annotation pipeline http://www. plabipd.de/portal/mercator-sequence-annotation, which was developed specifically for plants (Lohse et al., 2014). This pipeline is currently being reworked to enhance performance both in terms of speed and accuracy and incorporates sensitive detection techniques using a manually curated knowledge base (http://www.plabipd.de/portal/ mercator-ii-alpha-version-). This online pipeline additionally classifies genes according to function and provides a simple, easy to interpret human readable annotation. Thus GCBN is complementing automated resources for gene functional annotation such as BLAST2GO (Conesa and Gotz, 2008) which specifically uses GO terms and is a generalist tool, KAAS (Moriya et al., 2007) which focusing on KEGG pathways and the generalist plant focus TRAPID (Van Bel et al., 2013). Here, BLAST2GO performs well for plants, however to unlock its full potential, a license is required for faster analysis. Differences and particular advantages of the different tools have recently been reviewed (Bolger et al., 2017).
For specialized annotation needs, there are many databases dedicated to particular protein families or functions such as the ARAMEMNON database http://aramemnon.uni-koeln.de/ (Schwacke et al., 2003), an extensive data resource which focuses on data pertaining to plant membrane proteins. This multi-species database provides a comprehensive resource including sequences, topological predictions and subcellular localization predictions. The database also maintains and frequently updates functional descriptions of proteins based on publications.

Genome re-sequencing
Scientists or breeders often work on species which already have their genomes sequenced. Whilst this resource is sufficient for many endeavors, there are many situations which warrant re-sequencing of a species. Assemblies which are created from re-sequencing a plant are typically performed using the existing assembly as a guide or reference. This form of reference-based assembly is considerably easier than a denovo genome assembly but can still present numerous difficulties.
The challenges presented by plant genomes, such as large size, heterozygosity and ploidy levels remain relevant for reference-based assemblies, albeit to a lesser extent than de novo assembly. Ploidy however remains the greatest challenge, since it results in multiple copies of the same genes which may need to be distinguished during data analysis. In the case of allopolyploids species e.g. rape seed; this is of particular importance as the genes on the corresponding loci may have functionally diverged. Distinguishing between the genes at corresponding loci in autopolyploid is usually of lower importance given that the genome duplication event resulted in (nearly) identical copies of the genes, and thus it is seldom the case that different alleles perform very diverged functions. The typical alignment and variant calling pipelines depend on (i) mapping a read to the correct genomic region, (ii) determine if there is a variant present in this region and (iii) in the case of autopolyploid species such as potato, one needs to be able to call beyond simple two-allelic combinations when considering single nucleotide polymorphisms (SNPs). In case of allopolyploid species, the genes from the corresponding loci are sufficiently divergent to require analysis as separate genes. It should also be noted that plant genomes do not typically have the high number of validated SNPs which are available to scientists working on the human genome.
The use of multiple pipelines to overcome these problems has been evaluated  where the consensus results from three different variant callers on rape seed were compiled. In another pre-study, 8 variant calling tools on an even larger collection of pipelines to ascertain which performs best on the maize crop (Muraya et al., 2015). Training courses are available on this topic with the course material available online (http://www.plabipd.de/portal/workshop). Furthermore, GCBN-FZJ is currently comparing complete pipelines commencing with adapter trimming using Trimmomatic (Bolger et al., 2014b) to alignment using bowtie http://bowtie-bio.sourceforge.net/ and BWA http://bio-bwa.sourceforge.net/followed by SNP calling programs. This data is then compared to the data stemming from independent sources.

Phenotypes
A major goal behind plant genomics is to understand and predict the effect, changes to the genome have on phenotypes. While plant selection and crossing to manipulate phenotypes has been practiced well before the days of Mendel, revealing the underlying genetic code has provided scientists with a new toolbox to further refine and expedited this process. Given the recent surge of genomic data, there is currently concern among the scientific community that this genomic revolution is outpacing the availability of phenotyping data especially if the latter is not shared in a useful manner (Zamir, 2013).
Genomic data has a major role to play in crop genetic improvements and breeding programs. However, considerable gain can only be achieved by tightly coupling genomic discovery to plant phenomics (Cobb et al., 2013). While many applications for high-throughput and minimally-invasive phenotyping methods are being developed for both controlled environment and field experiments (Fahlgren et al., 2015;Fiorani and Schurr, 2013), data analysis remains a challenge. This is due to many factors including differences based on plant growth environment  and even based on data storage, common ontologies and standards which had been lacking. These factors combine to limit the potential of many tools due to the low interoperability of the data and tools. To overcome these shortcoming, the minimal information standard on plant phenotype data (MIAPPE − Minimum Information About Phenotyping Experiment) was proposed (Krajewski et al., 2015) and a first implementations developed (Cwiek-Kupczynska et al., 2016) using ISA-Tab (Investigation/Study/Assay tabdelimited), a framework used to collect and communicate complex metadata. The MIAPPE checklist consists of attributes that can be classified within the following sections: general metadata, timing and location of experiments, biosource, environment (aerial, soil), treatments, experimental design, sample collection, processing, and management, and observed variables. When appropriate, publicly available ontologies are indicated for each section as recommended terminology.
Plant phenotyping is the quantitative appraisal of traits from a given genotype in a given environment and experiment, which range from scalar (plant height), multi-value (chemical or transcriptional) to image-based (pictures) and include both directly measured attributes and those derived from analysis, e.g. leaf area from shoot images. These heterogeneous data presents problems not only for analysis, but additionally for long-term access in a useful manner once the results have been published as standards are only emerging slowly (Cwiek-Kupczynska et al., 2016;Krajewski et al., 2015). To tackle this issue, GCBN-IPK has developed the Plant Genomics and Phenomics Research Data Repository (PGP) https://edal.ipk-gatersleben.de/repos/pgp/ (Arend et al., 2016a) as an infrastructure to comprehensively publish plant research data. This covers cross-domain datasets which are not being published in hitherto developed central repositories for reasons of data volume and/or data domain. This includes data such as plant phenotyping and microscopy images, incomplete genomic data, genotyping data, visualizations of morphological plant models, mass spectrometry data as well as software code and related documents. PGP is based on the e!DAL data publication infrastructure (Arend et al., 2014). Using this infrastructure, a reference experiment comprising multiple data domains is described using ISATab and published in PGP as a part of a research article (Arend et al., 2016b;Junker et al., 2014). All semantic and technical documentations, measured parameters, protocols and references to ontologies are manually described using ISATab format. The dataset is published as DOI:10.5447/IPK/2016/7 All raw files of such ISATab formatted data publications are stored in the PGP repository. PGP has been used to publish 115 DOIs which refer to more than 156,000 files. The PGP repository is accepted as a data repository for the Nature Publishing Group and is registered in re3data.org, OpenAIRE and DataCite, three of the main meta repositories for research data.

Data analysis
Data analysis is often the most overlooked task during the planning of experiments, but has the potential to provide the highest returns on effort investments. It is indeed true that the quality of data gathered will massively influence the quality of the outcome, but it is equally true that even the best quality data is unlikely to surrender insights without appropriate data analysis.
Expression data for plants is ubiquitous in almost all public resources, ranging from microarray to RNASeq data. Indeed the plant community is well served with resources which have collated this data such as GENEVESTIGATOR https://genevestigator.com/ (Zimmermann et al., 2004(Zimmermann et al., , 2008, which is the one of the highest cited gene expression resource in plant biology, albeit at a monetary cost for full use. Other resources which are offered for free include BAR http://bar.utoronto. ca/ (Winter et al., 2007), which mostly focuses on model organisms and by GCBN in the form of the RNASeqExpressionBrowser (http://pgsb. helmholtz-muenchen.de/plant/RNASeqExpressionBrowser/index.jsp) resources (Nussbaumer et al., 2014). The latter is an open source web interface featuring expression data for barley and bread wheat and is used mainly by wet lab scientists to facilitate data interpretation for their genes of interest. These resources allow users to query for potentially candidate genes and provides immediate answer to questions such as: Is my favorite gene up-regulated under drought conditions? Or does this gene react to pathogens?
In cases where users generate their own expression data, analysis and interpretation of this data requires a more hands-on approach. For these analyses, GCBN hosts the RobiNA http://mapman.gabipd.org/ robin (Lohse et al., 2012(Lohse et al., , 2010 and MapMan downloadable tools http://mapman.gabipd.org/ (Jaiswal and Usadel, 2016;Urbanczyk-Wochniak et al., 2006), which provide complete solutions for expression analysis. RobiNA allows users to analyze microarray and RNAseq data by providing a graphical user interface to model the experimental design and thus perform appropriate analysis. MapMan is a platformindependent application which allows the analysis and biological interpretation of plant omics data by mapping genes, metabolites and proteins onto metabolic, regulatory and developmental pathways. Unlike similar available tools for the GO ontology, MapMan attempts to (i) minimize the redundancy in overviews, i.e. in large diagrams, genes are usually shown only once and (ii) to accurately access and annotate the pathways based on the same framework that Mercator uses.
On the next level, these expression compendia can be used to make new inferences about gene-gene interaction. Indeed, the collation of expression datasets to derive co-expression networks, thus allowing simple gene based queries to find genes behaving similarly is a tried and tested service used by the plant community (Usadel et al., 2009). Moreover, GCBN is maintaining CSBDB.de which is the first published plant co-expression database (Steinhauser et al., 2004) and also the use of novel algorithms to glean the best possible data out of expression data, developed at the Max Planck Institute of Molecular Plant Physiology (Mutwil et al., 2010). Further data analysis tools offered by GCBN include a tool to compute measures of association and functional inference in the form of Corto http://www.usadellab.org/cms/index. php?page=corto. This approach of 'guilt by association' has been further developed within GCBN and used to successfully predict seed attributes in the model species Arabidopsis thaliana (Vasilevski et al., 2012;Voiniciuc et al., 2015aVoiniciuc et al., , 2015b. Currently, efforts are underway by GCBN to apply these techniques to rape seed and other Brasicacae, leveraging the model Arabidopsis. From the whole plant transcriptomics perspective, users are sometimes confronted with the problem of identifying the transcriptional response of a mutant isolated from an informed candidate gene screen looking e.g. for drought specific mutants. However, it rarely happened that the transcriptomic response of a mutant exactly matches the intended profile. Often, one observes either a mixed response where the response actually consists of the primary response to e.g. drought and maybe a secondary one to other environmental factors. Sometimes one might even encounter an unexpected transcriptomic response. Here the research community working on human data has developed novel algorithms to compare such data sets to large gene expression compendia in what is called "physiospace" (Lenz et al., 2013). Here an individual data set is compared to a large compendium of data after transformation; GCBN has adapted this approach for plant data to unravel transcriptomic responses.
In order to facilitate plant genome comparison, GCBN-HMGU has developed a tool called CrowsNest which leverages conserved gene order between species. CrowsNest http://pgsb.helmholtz-muenchen. de/plant/crowsNest/index.jsp allows the user to explore syntenic relationships between species at different levels of detail, ranging from whole genome comparisons over chromososmes down to single genes. These connections are especially valuable for knowledge transfer from well characterized genes of reference species to as yet uncharacterized genes of newly sequenced species.
The aforementioned generic plant resources are complemented by specialized databases and services. GCBN-FZJ has developed a specialized database for large plant enzyme gene families, and as another example DroughtDB http://pgsb.helmholtz-muenchen.de/droughtdb/ (Alter et al., 2015) at GCBN-HMGU. It contains a manually curated set of drought stress genes from model species (Arabidopsis and rice) that have an experimentally verified function in drought tolerance. It interconnects them between nine species, including maize and barley, via computed orthology. This resource allows breeders to easily check candidate genes or explore candidate genetic regions for the occurrence of these curated genes.
Finally, it is of utmost importance, especially for crops, to be able to link genotypes to phenotypes. Whilst GCBN is not involved in the development of the underlying statistics, it employs state of the art tools such as fastLMM https://www.microsoft.com/en-us/research/project/ fastlmm/ (Lippert et al., 2011) and helps in interpreting such datasets by defining genes underlying QTL (Millet et al., 2016).

Discussion
As high throughput sequencing and phenotyping technologies mature, it might be expected that the development of new bioinformatics services and pipelines becomes redundant due to achieving optimal solutions. The reality is however the opposite as the increasing use of these technologies warrants new techniques/algorithms to deal with this ever expanding mass of data. Even from a simple storage point of view, providing sustainable fast access to large data requires considerable expertise. Integration of data, especially heterogeneous phenotyping data, into existing or new data structures frequently needs to be performed before any data analysis can be carried out. This step additionally depends on the definition of standards, without which, phenotyping data is impossible to compare and integrate. As the distinction between wet-lab scientist and data scientist becomes increasing blurred, given that all aspects of research require some degree of data handling and analysis, training in fundamental bioinformatics techniques is imperative.
Within Germany GCBN is uniquely situated to undertake these services. Internationally, this ties into the ELIXIR project of Europe, where the GCBN plant side will likely be strengthening ELIXIR and complement its activities. Internationally, the US CyVerse infrastructure project which started out as iPLANT (Merchant et al., 2016) is a similar endeavor. Certain aspects of the setup are however different, as CyVerse mainly re-uses tools developed by the community such as Trimmomatic from GCBN in the CyVerse pipelines and allows them to be run on large infrastructures. GCBN actively maintains and develops their own services for the need of wet lab biologists. Therefore, these two initiatives are complementing each other perfectly.

Conclusions
The impact which the genomics revolution has made on plant science is undeniable. Within a decade of the first published plant genome, sequencing has become a staple of most plant laboratories. As these technologies further improve and their cost decreases, the need for bioinformatics analysis and tools will clearly follow this trend. GCBN is already providing bioinformatics solutions not only to genomics data, but also to phenotyping data. It is after all the overarching goal of many scientists to use genomics data to predict and manipulate plant phenotypes. This goal requires extensive bioinformatics analysis on both genomics as well as the phenotyping data, a target GCBN is working on achieving.