One thousand plant transcriptomes and the phylogenomics of green plants

Green plants (Viridiplantae) include around 450,000–500,000 species1,2 of great diversity and have important roles in terrestrial and aquatic ecosystems. Here, as part of the One Thousand Plant Transcriptomes Initiative, we sequenced the vegetative transcriptomes of 1,124 species that span the diversity of plants in a broad sense (Archaeplastida), including green plants (Viridiplantae), glaucophytes (Glaucophyta) and red algae (Rhodophyta). Our analysis provides a robust phylogenomic framework for examining the evolution of green plants. Most inferred species relationships are well supported across multiple species tree and supermatrix analyses, but discordance among plastid and nuclear gene trees at a few important nodes highlights the complexity of plant genome evolution, including polyploidy, periods of rapid speciation, and extinction. Incomplete sorting of ancestral variation, polyploidization and massive expansions of gene families punctuate the evolutionary history of green plants. Notably, we find that large expansions of gene families preceded the origins of green plants, land plants and vascular plants, whereas whole-genome duplications are inferred to have occurred repeatedly throughout the evolution of flowering plants and ferns. The increasing availability of high-quality plant genome sequences and advances in functional genomics are enabling research on genome evolution across the green tree of life.


Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection

Data analysis
Transcripts were assembled using SOAPdenovo-Trans assembler (version of 2012-04-05); NCBI BLAST, TransRate, CEGMA6 and BUSCO were used to assess assembly quality, translations were performed using TransPipe and Genewise 2.2.2, Gene and species tree estimates RAxML v. 8.1.17, FastTree-2 v. 2.1.5, and ExaML v. 3.0.14,ASTRAL-II v. 5.0.3 was used to estimate species trees; scripts for postprocessing, DiscoVista, of trees -https://github.com/smirarab/1kp ; genome duplications were investigated using the DupPipe, PAML, and the MAPS pipelines including the GuestTreeGen program withinGenPhyloData -https://bitbucket.org/barkerlab/maps ; analysis of gene family expansions included HMMER v3.1b2 and scrips available at https://github.com/GrosseLab/OneKP-gene-family-evo For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

April 2018
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: All studies must disclose on these points even when the disclosure is negative.

Study description
Research sample Sampling strategy

Data collection
Gene and species phylogenies were estimated in order to infer: relationships across the green tree of life (Viridiplantae), the timing of genome-scale duplication events, and the timing of gene family expansions.
RNA was isolated from young vegetative tissue from 1342 samples representing 1147 species across all major subclades of Viridiplantae, glaucophytes (Glaucophyta) and red algae (Rhodophyta) and used to generate RNA seq reads and assemblies.
Samples were collected as available in living collections. Species were chosen for RNA seq with a priority to maximize taxonomic diversity across Viridiplantae and outgroups RNA samples were derived from vouchered material in living collections as described in Table 1.
Timing and spatial scale Samples were collected as available. No attempt was made to control for environmental variation Data exclusions RNA samples exhibiting evidence of contamination were excluded from phylogenetic analyses. Contamination was diagnosed through BLAST comparisons to ribosomal RNA and plastid gene databases.