Access to RNA-sequencing data from 1,173 plant species: The 1000 Plant transcriptomes initiative (1KP)

Abstract Background The 1000 Plant transcriptomes initiative (1KP) explored genetic diversity by sequencing RNA from 1,342 samples representing 1,173 species of green plants (Viridiplantae). Findings This data release accompanies the initiative's final/capstone publication on a set of 3 analyses inferring species trees, whole genome duplications, and gene family expansions. These and previous analyses are based on de novo transcriptome assemblies and related gene predictions. Here, we assess their data and assembly qualities and explain how we detected potential contaminations. Conclusions These data will be useful to plant and/or evolutionary scientists with interests in particular gene families, either across the green plant tree of life or in more focused lineages.

Additional data (contamination analysis, etc) will be submitted to GigaDB after this online process, as per the journal instructions.
angiosperms, and gymnosperms. Importantly, our selection criteria eschewed the model organisms and 66 crop species where other plant sequencing efforts have historically been concentrated. 67 68 Major papers describing the project have been published elsewhere [1,2]. This Data Note describes the 69 sequence data set and provides additional details on the sample and sequence processing as well as 70 quality assessments of these data. were provided by a global network of collaborators who obtained materials from a variety of sources, 78 including field collection of wild plants, greenhouses, botanical gardens, laboratory specimens, and 79 algal culture collections. To ensure an abundance of expressed genes, we preferred live growing cells, 80 e.g. young leaves, flowers, or shoots, although many samples were also from roots, or other tissues. 81 Because of the sample diversity, we did not attempt to define specific standards on growth conditions, 82 time of collection, or age of tissue. For more details, see the supplemental methods in the capstone 83 paper [1]. 84 85 RNA extraction 86 87 Given the biochemical diversity of these samples, no one RNA extraction protocol was appropriate for 88 all samples. Most samples were extracted using commonly known protocols or using commercial kits. 89 For complete details of the many specific protocols used, please see Appendix S1 of Johnson et al. [3] 90 and Jordon-Thaden et al. [4]. Depending on the sample, RNA extractions might have been done by the 91 sample provider, a collaborator near the provider, or the sequencing lab (BGI-Shenzhen Purity and contamination 172 especially the fact that axenic cultures are not a viable option in most instances, ensures that there will 177 always be some contamination of the plant tissue by other environmental nucleic acids. These can 178 reasonably be expected to include bacterial, fungal, and insect species that live in and on the plant 179 tissues, and more rarely, from contact with larger species such as frogs, mice, birds and humans. 180 181 For most analyses, these minor contaminants are not expected to matter, as only the most abundant of 182 such contaminants will be present in sufficient quantities to assemble. In many cases, they are also 183 sufficiently diverged from the intended species that they can be easily recognised as non-plant genes. 184 Unfortunately, this is not always the case. Some analyses are further protected by looking at the whole 185 of the available transcriptome, whereby the many genes from the target species will overpower a few 186 contaminants. Single gene family analyses do not have this advantage and must rely on other methods 187 to reject non-plant genes. 188 189 Another possibility is significant contamination during sample processing when plant RNA is 190 transferred between adjacent samples, or when whole samples are accidentally mis-labeled. 191 192 We tried to guard against these problems by several analyses, one of which compared the assembled 193 sequences by BLASTn to a reference set of nuclear 18S rRNA sequences from the SILVA SSU rRNA 194 database (http://www.arb-silva.de) [13]. The BLASTn alignment to an assembly with the lowest 195 expectation-value is taken to indicate the assembly has a similar taxonomic origin as the reference 196 sequence. However, alignments of less than 300 bp or expectation-values above 1E-9 often align to 197 several distantly related species and were ignored. 198 For most samples we found an 18S sequence most-similar to a SILVA sequence from the same 200 taxonomic family as the expected sample species. This is not true for all our samples, and may indicate 201 a failure to assemble the 18S sequence, limitations in the taxonomic identification from the BLASTn 202 results, or mis-labelling of sample. In a few cases, additional (and possibly contaminant) 18S 203 sequences were found. Because the 18S rRNA sequence is highly expressed, we expect that this 204 method is likely to be sensitive to low levels of contamination. In a few cases, the taxonomic 205 irregularities were judged sufficiently severe that samples were excluded from various analyses. 206

207
The accompanying data includes two accessory files containing details of this SILVA based SSU 208 validation for each sample. The first lists whether the sample is overall judged to be validated as 209 containing the expected taxon, and whether it had alignments to any other plant sequences (described 210 as "worrisome contamination"). The second file, more detailed, lists each scaffold identified as being 211 18S-like sequence, and which reference sequence it matched against. 212 213 214 Pairwise Cross-contamination of Assemblies 215 216 Cross contamination between the datasets was identified by using a genome-scale sequence search 217 pipeline, adapted from previous studies [14][15][16]. Briefly, each pair of assemblies (nucleotide) was 218 compared and a threshold identity level established, above which sequences are likely to be 219 contamination between the pair. While best for identifying technical contamination between libraries 220 (e.g. due to mixing of RNA samples), this technique could also detect other biological contamination 221 events (e.g. contamination of pairs of libraries with common commensal organisms). An additional 222 search step, using the entire 1KP sequence library, identified the probable evolutionary origin of each 223 sequences. 224 The pair-wise comparison used LAST v. 963 [17] using the --cR01 option, and the respective matches 226 were grouped and ordered by similarity. To avoid artifactually excluding sequences between closely 227 related species, which may have very high degrees of similarity [13], pairs of libraries from the same 228 family, along with pairs of libraries separated by two or fewer branches in the consensus 1kp multigene 229 phylogeny, were excluded from the searches [2]. 230

231
The expected distribution of the matched sequence identities has a maximum at the pairwise identity 232 reflecting the evolutionary distance between the two species [15,16]. In contrast, a cross-contaminated 233 pair should contain many sequences of near 100% similarity, and the similarity value which has the 234 first minimum number of sequences below this level (i.e. the first inflexion point in a curve plotting the 235 total number of sequences of each percentage similarity value) can be used as a threshold for 236 discriminating contaminating sequences [15,16]. The code is available at https://github.com/Plant-237 and-diatom-genomics-IBENS-Paris/Decontamination-pipeline. 238

239
The output of this analysis is pairs of apparent orthologs whose sequence similarities are higher than 240 the cut-off in one or both libraries, i.e. potential contamination. To discriminate donors and recipients 241 in each contaminant pair, each of these potential contaminants was searched against all the non-242 contaminant assemblies by BLASTn, using the option -max_target_seqs 3 [18]. Queries with at least 243 one of the three best alignments against a sequence from the same family, or from a taxon separated by 244 fewer than two branches within the 1kp tree [2], were excluded from the list of potential contaminants; 245 whereas sequences that yielded best hits exclusively against more distantly related taxa, were verified 246 as potential contaminants. Clean and contaminant FASTA sequence files for each library are available 247 in the accompanying data. We assessed the quality of each assembled scaffold using Transrate [20], which detects several classes 279 of common assembly errors and assigns a quality score to each scaffold. Users of the data may choose 280 to omit those portions of the assembly judged as low-quality when doing their own analyses. 281 As with all RNA-seq data, some genes are more highly expressed than others. While the CEGMA and 295 BUSCO gene sets are intended to demonstrate the completeness of the transcriptomes, they are 296 sensitive to the expression of these genes. Not all these genes will be expressed in the sample's tissues at sufficiently high levels to be assembled. A plot of the number of assembled scaffolds vs. the fraction 298 of the three gene sets found in the assembled scaffolds shows an increase in the gene fractions found as 299 the number of assembled scaffolds increases (Fig. 2)  The authors declare that they have no conflicting interests, and that they believe that all the plant 365 tissues were collected in accordance with applicable regulations and laws.