Long-read transcriptome data for improved gene prediction in Lentinula edodes

Lentinula edodes is one of the most popular edible mushrooms in the world and contains useful medicinal components such as lentinan. The whole-genome sequence of L. edodes has been determined with the objective of discovering candidate genes associated with agronomic traits, but experimental verification of gene models with correction of gene prediction errors is lacking. To improve the accuracy of gene prediction, we produced 12.6 Gb of long-read transcriptome data of variable lengths using PacBio single-molecule real-time (SMRT) sequencing and generated 36,946 transcript clusters with an average length of 2.2 kb. Evidence-driven gene prediction on the basis of long- and short-read RNA sequencing data was performed; a total of 16,610 protein-coding genes were predicted with error correction. Of the predicted genes, 42.2% were verified to be covered by full-length transcript clusters. The raw reads have been deposited in the NCBI SRA database under accession number PRJNA396788.


Value of the data
The whole-genome sequence of L. edodes has been determined with the objective of discovering candidate genes associated with agronomic traits [1], but experimental verification of gene models with correction of gene prediction errors is lacking.
PacBio long-read transcriptome data integrated with Illumina short-read RNA-Seq data can enhance the accuracy of gene prediction with error correction and support experimental verification.
Our data will strengthen genome-wide analyses of L. edodes by contributing to the identification of targeted genes associated with a trait, transcriptome profiling, and comparative genomics.

Data
A total of 5,285,247 long-reads producing 12.61 Gb of sequence data were generated from three RNA libraries of the monokaryotic B17 strain of L. edodes that were size-selected for lengths of o 2 kb, 2-3 kb, and 3-6 kb ( Table 1). Those reads were clustered into 36,946 transcripts with a cumulative length of approximately 82.1 Mb and an average length of 2.2 kb (Fig. 1). Based on exon-intron boundary information generated by aligning the PacBio long-read (12.6 Gb) and Illumina short-read (3.36 Gb) [1] RNA-Seq data to the draft genome sequence of L. edodes [1], a total of 16,610 proteincoding genes were predicted with error correction (Tables 2 and 3). Of those genes, 1344 were newly identified. The transcriptome data supported 92.9% of the predicted gene models (Fig. 2). Moreover, 7005 gene models (42.2%) were verified to be covered by full-length transcript clusters. Homologybased searches indicated that 76.2% of the predicted genes had homology with known genes. Functional annotations were tentatively assigned for 38.3% of these genes. GFF files and annotations of gene models for L. edodes are provided in the Supplementary data (Supplementary material 1 and 2).

RNA extraction and PacBio SMRT transcriptome sequencing
Total RNA from the monokaryotic B17 strain of L. edodes, which was cultured in potato dextrose broth liquid medium for 10 days at 25°C, was extracted using an RNA extraction kit (iNtRon Biotech, Seoul, Korea). cDNA was obtained from the RNA and was size-selected into fractions with the following length ranges: 1-2 kb, 2-3 kb, 3-6 kb, and 4 6 kb. SMRTbell template libraries were created from the obtained cDNAs for sequencing on the PacBio RS II system, as recommended by Pacific Biosciences (Palo Alto, U.S.A.). The templates were sequenced via polymerase binding using the DNA/Polymerase Binding Kit P6 v2 primers.
For gene prediction in the genome of L. edodes, AUGUSTUS [2] was used to perform de novo prediction with prior gene models trained using GeneMark-ET [3] and exon-intron boundary  information predicted by RNA and protein sequence alignments. To generate transcriptome-based evidence, TopHat [4] and GMAP [5] were used for short-and long-read RNA-Seq alignments, respectively. To generate protein-based evidence, homologous protein sequences were collected from the NCBI non-redundant (NR) database, and Exonerate [6] was used for protein sequence alignments to produce protein-based evidence. Predicted genes were searched in the UniProt and NCBI NR   databases using BLASTP [7] with a cut-off E-value of 1×10 −10 . Protein domains were also searched using InterProScan [8] and then assigned to Gene Ontology (GO) terms.