Dataset of full-length transcriptome assembly and annotation of apocynum venetum using pacbio sequel II

Apocynum venetum, which belongs to Apocynaceae, is widely distributed throughout salt-barren zones, desert steppes, and alluvial flats of the Mediterranean area and Northwestern China. Apocynum venetum has long been used in traditional Chinese medicine because of its anti-inflammation, anti-oxidative, anti-hypertensive, anti-cancer, and bactericidal effects. However, the absence of genetic information on Apocynum venetum is an obstacle to understanding its stress resistance or medicinal function. This work was aimed at generating a full-length transcriptome of Apocynum venetum using Pacific Bioscience (PacBio) Single Molecule Real-Time (SMRT) sequencing technology. A total of 18,524 unigenes were obtained, and 18,136 unigenes were successfully annotated. The raw data were uploaded to SRA database, and the BioProject ID is PRJNA650225. The above data will provide the basis for further exploration and understanding of the molecular mechanism in stress resistance or medicinal function of Apocynum venetum.


Specifications
Plant Science Specific subject area Full-length Transcriptomics Type of data

Value of the Data
• The full-length transcriptome data of Apocynum venetum using Single Molecule Real-Time (SMRT) sequencing technology provide an important reference for the scientific community to understand of the molecular mechanism and physiological function of Apocynum venetum . • Full-length transcripts will be useful for gene discovery, characterization and cloning.
• The data will be useful for the genetic improvement of Apocynum venetum or other plants.
• Researchers can use their own bioinformatics algorithms to further process and analyze the original sequence data.

Data Description
Details of statistics of transcripts and unigene for the full-length transcriptome of Apocynum venetum were provided in Table 1 . The sequencing results generated 47.4 GB (35,351,576 reads) of clean data, which had been deposited in the SRA database (PRJNA650225). A total of the 18,524 unigenes were sequenced, and 18,136 unigenes were successfully annotated using NR,  GO, Swiss-Prot, KOG and KEGG databases ( Table 2 ). Among of them, 18,126 full-length unigenes were annotated through the NR databases, and the highest homology ratio with Coffee arabic was 36.67% ( Fig. 1

Sample collection
The seeds of Apocynum venetum were collected from Xinjiang Uygur Autonomous Region and identified by associate researcher Jiang Li. 30-day-old Apocynum venetum plantlets (10 plants) cultured on WPM medium were mixed for RNA extraction.

Total rna extraction and library construction
Total RNA was isolated from Apocynum venetum samples. The integrity, quality and concentration of RNA samples were checked by Agarose gel electrophoresis and Nanodrop 20 0 0. The library was constructed after the samples were qualified. The main processes were as follows: 1) Total RNA was reversely transcribed into cDNA using a Clonetech SMARTerTM PCR cDNA Synthesis Kit that was optimized for preparing high-quality, full-length cDNAs. The 3 terminal of eukaryotic mRNA had a Poly(A) tail structure. The primer A with Oligo dT was used for A-T base pairing with Poly(A) as the primer for reverse synthesis of cDNA. Add primer B to the terminal of the full-length cDNA synthesized in reverse. The obtained full-length cDNA was amplified by PCR. The amplified product was purified by PB magnetic beads and quantified by Qubit 2.0. 2) Use BluePippin to screen cDNA fragments above 4Kb. The amplified cDNA fragments were amplified by PCR again and the full-length cDNA was purified by PB magnetic beads. 3) Terminalrepair the full-length cDNA and connected the SMRT dumbbell adapter. 4) Used exonuclease to digest the fragments that were not connected to the jointer. Use PB magnetic beads for purification again to obtain a sequencing library. 5) After the library was constructed, Qubit 2.0 was used for accurate quantification. Then used Agilent 2100 to detected the library size. Sequencing was performed after the library size was qualified.

Transcriptome sequencing and assembly
Used the PacBio Sequel II sequence platform to sequence qualified libraries [ 1 , 2 ]. The subreads were acquired from raw sequencing reads using the SMRT Link (version7.0.0.63985; parameter -min_passes 3, -min_length 50, -max_length 15,0 0 0) pipeline supported by PacBio's official, and Circular Consensus Sequence (CCS) reads were extracted out of subreads' BAM file. Through IsoSeq, CCS reads were classified into full-length (FL), full-length non-chimeric (FLNC), non-full-length (NFL) based on cDNA primers and Poly(A) tail signal. Subsequently, the FLNC reads were clustered by Iterative Clustering and Error correction (ICE) tool to generate the cluster consensus isoforms. Finally, the NFL sequence was used to modify the obtained consistent sequence (polished) to obtain high-quality sequence for subsequent analysis [ 3 , 4 ]. To yield a final set of non-redundant transcript sequences, CD-HIT (version4.8.1; parameter -c 0.95, -G 0, -aL 0.00, -aS 0.99, -AS 30) software was used to merge highly similar sequences and remove redundant sequences from high-quality transcript. Used Diamond (version0.9.24; parametermore-sensitive, -k 10, -e le-5) to aligned the sequences to various databases. Got the protein with the highest sequence similarity and annotate the protein function information. TransDecoder (version 5.5.0; parameter -G universal, -S, -m 100) was used to identify the candidate Coding Sequence (CDS) regions within transcript sequences.

Functional annotation of full-length transcriptome sequences
NR [5] database was used to homology searches (E-value = 1e −5). GO [6] and KOG [ 7 , 8 ] was used to annotations gene function. The Swiss-Prot [9] and Pfam [10] was used to classifies transcript. The KEGG [11] tuning parameter -species ko; -e le-5 was used to compare and annotate transcripts.

Funding
This work was supported by the Foundation of State Key Laboratory of Desert and Oasis Ecology, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences (G2018-02-07), the Fundamental Research Funds for the Central Universities (2572020DY17).