Full-length transcriptome analysis of Adiantum flabellulatum gametophyte

Ferns are important components of plant communities on earth, but their genomes are generally very large, with many redundant genes, making whole genome sequencing of ferns prohibitively expensive and time-consuming. This means there is a significant lack of fern reference genomes, making molecular biology research difficult. The gametophytes of ferns can survive independently, are responsible for sexual reproduction and the feeding of young sporophytes, and play an important role in the alternation of generations. For this study, we selected Adiantum flabellulatum as it has both ornamental and medicinal value and is also an indicator plant of acidic soil. The full-length transcriptome sequencing of its gametophytes was carried out using PacBio three-generation sequencing technology. A total of 354,228 transcripts were obtained, and 231,705 coding sequences (CDSs) were predicted, including 5,749 transcription factors (TFs), 2,214 transcription regulators (TRs) and 4,950 protein kinases (PKs). The transcripts annotated by non-redundant protein sequence database (NR), Kyoto encyclopedia of genes and genomes (KEGG), eukaryotic ortholog groups (KOG), Swissprot, protein family (Pfma), nucleotide sequence database (NT) and gene ontology (GO) were 251,501, 197,474, 193,630, 194,639, 195,956, 113,069 and 197,883, respectively. In addition, 138,995 simple sequence repeats (SSRs) and 111,793 long non-coding RNAs (lncRNAs) were obtained. We selected nine chlorophyll synthase genes for qRT-PCR, and the results showed that the full-length transcript sequences and the annotation information were reliable. This study can provide a reference gene set for subsequent gene expression quantification.


INTRODUCTION
Ferns are an ancient group of land plants. Their origins can be traced back to 380 million years ago (Schneider et al., 2004). Ferns are the second largest group of vascular plants (Sessa et al., 2014). At present, more than 11,000 species have been recorded (The Pteridophyte Phylogeny Group, 2016). They are important components of plant communities on Earth.
The genomes of ferns are generally large, with an average size of 12 Gb, and the largest can even reach 148 Gb (Hidalgo et al., 2017a;Hidalgo et al., 2017b). Ceratopteris thalictroides is a model plant of ferns with a genome size of 11.25 Gb, which is more than 80 times the size of the genome of Arabidopsis thaliana (135 Mb;Marchant et al., 2019). Adiantum flabellulatum belongs to the genus Adiantum of Pteridaceae, whose genome size is unknown. The smaller genomes of Adiantum caudatum and Anogramma leptophylla in Pteridaceae are 3.78 Gb and 3.0 Gb, respectively (Kuo & Li, 2019). Because the genes of a fern are highly redundant, whole genome sequencing of ferns is both expensive and time-consuming. The genomes of Azolla filiculoides and Salvinia cucullata are relatively small at 0.75 Gb and 0.26 Gb, respectively (Li et al., 2018). They are currently the only two ferns with whole genome sequences. Due to the lack of reference genomes, the molecular biology research of ferns has somewhat stagnated.
The gametophytes of ferns are very small, but they can survive independently. They are responsible for sexual reproduction, feeding young sporophytes, and are indispensable in the process of alternation of generations. A. flabellulatum is mainly distributed in China, Japan, and Vietnam. Cao, Huang & Wang (2010) observed the gametophyte development process of A. flabellulatum by microscope. Wang et al. (2018) found that plant hormone brassinosteroids (BRs) could regulate the gametophyte development of A. flabellulatum.
For plants without reference genomes, full-length transcripts can effectively provide a reference gene set for gene expression quantification. Jia et al. (2020), for example, used PacBio single-molecule real-time (SMRT) sequencing technology to analyze the full-length transcriptome of Rhododendron lapponicum, and obtained a large number of full-length transcripts, which facilitated gene expression quantification. Wu et al. (2020) analyzed the sequencing data of the full-length transcriptome of endangered species Populus wulianensis, which provided a scientific foundation for an in-depth study of its gene functions. In this study, the full-length transcripts of A. flabellulatum gametophytes were obtained using PacBio three-generation sequencing technology, which can provide a reference gene set for subsequent gene expression quantification and is of great significance to help further the research of gametophyte development in ferns.

Plant materials
The spores of A. flabellulatum were collected from the rubber forest in Ma'an mountain, Danzhou City, Hainan Province, China (109.519 • E, 19.501 • N). The mature spores were sterilized and inoculated in 1/4 MS medium to form young gametophytes after germination. Young gametophytes with diameters of 1-2 mm were selected and transferred into new 1/4 MS medium. These young gametophytes were then placed in illuminations of 0, 10, 500, 1,000 and 10,000 lx, respectively, and cultured at 25 • C for 12 days. Three biological replicates were set in the above 5 illuminations, respectively, with a total of 15 samples. Approximately 200 µg of gametophytes from each sample were selected for RNA extraction.

RNA extraction, Library construction and Sequencing
Total RNA (more than 40 ng/µL) was extracted from the above 15 samples using the CTAB method, and the mRNAs were purified and mixed to construct a cDNA library (Yu et al., 2021). Raw reads were obtained on the PacBio Sequel platform. After reads of insert (ROI) were generated from raw reads, clustering and correction (Chin et al., 2013) were performed using the Quiver algorithm to obtain high-quality full-length consensus reads and then the transcripts were obtained after redundancy removal (using the Cd-hit program, Li & Godzik, 2006). RNA extraction, library construction and sequencing were performed by the Beijing Genomics Institute (BGI).

Quantitative real-time polymerase chain reaction
In order to verify the reliability of the full-length transcript sequences and the annotation information, we selected nine chlorophyll synthase genes and used the method applied by Cai et al. (2021) for qRT-PCR (Table S1). Each of the four treatments (0 lx, 10 lx, 500 lx and 10,000 lx) were detected with three biological replicates for a total of 12 samples. Three parallel tests were performed for each sample, and the isoform_55882 (Ubiquitin C) was used as a reference gene. Relative quantification was performed using the 2 − Ct method (Pfaffl, 2001;Cai et al., 2021).

PacBio SMRT sequencing data
In this study, a PacBio ISOSeq library was established, and the full-length transcriptome sequencing was performed on the PacBio Sequel platform. The total base of polymerase reads was 69.87 Gb, and the total number of reads was 923,898. The average length, the longest sequence and the N50 of polymerase reads were 75,625.26 bp, 303,495 bp and 129,782 bp, respectively (Fig. 1A). After removing the adapters, 16,221,787 subreads were obtained, with the total base of 68.72 Gb. The average length, the longest sequence and the N50 of subreads were 4,236.18 bp, 218 714 bp and 4,762 bp, respectively (Fig. 1B).  After subreads from the same circular molecule were combined into circular consensus sequences (CCSs), a total of 809,367 CCSs were obtained. The average length and average quality were 4,934 bp and 0.97, respectively (Fig. 1C). The CCSs were split into full-length, non-full-length, chimeric and non-chimeric sequences according to whether 5 primer, 3 primer and ploy(A) tail were detected. There were 2,762,613 full-length non-chimeric (FLNC) sequences, with an average length of 998 bp and an average quality of 0.98 (Fig. S1).
The FLNC sequences from the same copy were grouped into clusters, and the Arrow algorithm (Hong et al., 2020) was used to correct the error. After removing the low-quality sequences, 2,608,705 consensus reads were obtained with an average quality of 0.99 (Fig. S2). After removing redundancy, 354,228 transcripts were finally obtained (Table 1, Fig. 1D). The CCS and transcript data have been uploaded to the NCBI Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra), and the accession numbers are PRJNA774117 and PRJNA733457, respectively.

CDS prediction
CDS prediction was performed on transcripts, and a total of 231,705 CDSs were obtained. The total length was 194,628,012 bp, and the N50 was 1,056 bp (Table 2). There were 164,535 CDSs with a length of 100-1,000 bp, accounting for 71.0% of the total. There were 57,009 CDSs of 1,000-2,000 bp, 8,722 CDSs of 2,000-3,000 bp and 1,439 CDSs of ≥ 3,000 bp, accounting for 24.6%, 3.8% and 0.6% of the total, respectively (SRA database; Fig. 2A). The CDSs have been uploaded to the SRA database, and the accession number is PRJNA777740.

SSR detection
Through SSR detection of 354,228 transcripts, 138,995 SSRs were obtained from 81,304 transcripts, with a total length of 3,875,607 bp (Table S2). Of these, 50,482 transcripts contained one SSR locus and 30,822 transcripts contained two or more SSR loci (Table 3). Among SSR repeat types, di-nucleotide repeats (90,741) accounted for the highest proportion (65.28%). The second was the mono-nucleotide repeats (35,856) accounting for 25.8% of total repeats and the third was tri-nucleotide repeats (10,528) accounting  for 7.56% of repeats. Tetra-nucleotide, penta-nucleotide and hexa-nucleotide repeats accounted for the minority at 0.40%, 0.25% and 0.69% of repeats, respectively. Forty-one thousand three hundred sixty-eight SSRs were present in compound formation. SSRs with six repeat units (19,697, 14 17%) were the most common, followed by seven repeat units (12,575, 9.05%), eight repeat units (9,806, 7.05%) and 12 repeat units (9,272, 6.67%) ( Table 4). Among the SSRs with di-nucleotide repeats, AG/CT (68,289) was the most common type, and AT/AT (824) was the least. Among the SSRs with tri-nucleotide repeats, ATC/GAT (2,375) was the most common type and AAT/ATT (51) was the least common ( Fig. 2B). Twenty-nine thousand five hundred eighty-nine pairs of primers were designed and screened by Primer3. The primer lengths were 18-28 bp, the GC contents were 20.00-72.22%, the annealing temperatures were 57-63 • C, and the product lengths were 80-276 bp (Table S3).

Notes.
Note: the ''Intersection'' expresses the number and proportion of transcripts annotated by 7 databases together; the ''Overrall'' expresses the number and proportion of transcripts annotated by at least 1 database.

Functional annotation
Three hundred fifty-four thousand two hundred twenty-eight transcripts were annotated by seven databases. There were 251,501, 197,474, 193,630, 194,639, 195,956, 113,069 and 197,883 transcripts annotated by the NR, KEGG, KOG, Swissprot, Pfma, NT and GO databases (Table S4), respectively. Most of them were annotated by the NR database, accounting for 71.00%. The NT database annotated the least, accounting for 31.92% (Table 5). Fifty-five thousand seven hundred ninety-four transcripts were annotated by all the seven databases combined, accounting for 15.75% of the transcripts. There were 125,336 transcripts annotated by NR, KEGG, KOG, SwissProt and Pfam databases together, accounting for 35.38% (Fig. 3A).

NR annotation
The proportion of transcripts annotating to homologous species was counted from the NR annotation results. Among them, the proportions of transcripts annotating to Marchantia polymorpha, Selaginella moellendorffii, Physcomitrella patens and Picea sitchensis were 59.31%, 10.98%, 9.45% and 8.70%, respectively, and the proportion of transcripts annotating to other known species was 11.56% (Table S5; Fig. 3B).

KOG annotation
There were 193,630 transcripts annotated by the KOG database, which could be divided into 25 groups according to KOG functional classification. Among them, the transcripts belonging to ''general function prediction only'' were the most common with a total of 48,175, accounting for 24.88%. There were 26,065 ''signal transduction mechanisms'' and 23,151 ''posttranslational modifications protein turnover chaperones,'' accounting for 13.46% and 11.96%, respectively (Fig. 4B).

Quantitative real-time polymerase chain reaction
Chlorophyll is an important pigment in plant photosynthesis, and its synthesis enzyme gene expressions are regulated by light (Lee et al., 2005;Masuda et al., 2003;Matsumoto et al., 2004;Okamoto et al., 2020;Tzvetkova-Chevolleau et al., 2007). In order to verify the reliability of the full-length transcript sequences and the annotation information, we selected nine chlorophyll synthase genes according to KEGG annotation for qRT-PCR, including HemA, HemL, HemB, CHLH, CHLD, CRD, DVR, POR and CAO. The results showed that the expressions of the above nine genes increased at first and then decreased with the increase of illumination (Table S11; Fig. 9). It can be seen that the chlorophyll synthase genes in gametophyte of A. flabellulatum are also regulated by light. This also shows that the data of the full-length transcripts has strong reliability.

DISCUSSION
Based on the Illumina HiSeq sequencing platform, the transcriptomes of roots and shoots of Athyrium yokoscense (Ukai et al., 2020), young leaves of Asplenium nidus, young leaves of Asplenium komarovii (Zhang et al., 2019) and sexual reproduction and apogamy of Adiantum reniforme gametophytes (Fu & Chen, 2019) were studied, and 35,681, 89,741, 77,912, 333,352, 264,791 transcripts were obtained, respectively. The sequencing read length of the Illumina HiSeq is short, the mRNAs need to be broken into short fragments before sequencing, and the complete transcripts can be obtained only by splicing. The genomes of ferns are huge, and the genes are highly redundant, so errors inevitably occur in the splicing process. Therefore, in this study, full-length transcriptome sequencing of A. flabellulatum gametocytes was performed using PacBio SMRT technology. This technology has very long average read lengths and can read the gene sequences completely. We believe that this technology is more suitable for species with high gene redundancy, which may be one of the reasons for the large number of transcripts obtained in this study. SSRs, also known as short tandem repeats (STRs) or microsatellites (Xu, Sun & Li, 2010), play an important role in genetic development and gene expression regulation (Lawson & Zhang, 2006). Existing studies have shown that AG/CT is the dominant motif of di-nucleotide repeats in ferns, dicotyledons and monocotyledons (Jia et al., 2014;Jia et al., 2016;Kumpatla & Mukhopadhyay, 2005;Liu et al., 2016;Morgante, Hanafey & Powell, 2002;Zheng, Zhang & Wu, 2011), which is consistent with the SSR detection results of A. flabellulatum in this study. AT/AT is the second most common di-nucleotide repeat motif in dicotyledons and monocotyledons in other studies (Kumpatla & Mukhopadhyay, 2005;Zheng, Zhang & Wu, 2011), but in this experiment, AT/AT was the least common di-nucleotide repeat motif. CCG/CGG is the largest tri-nucleotide repeat motif in monocotyledons (Morgante, Hanafey & Powell, 2002;Zheng, Zhang & Wu, 2011), but is rare in dicotyledons and ferns, and the number of CCG/CGG in this study only accounted for 0.19% of the total SSRs.
The NR annotation results showed that the dominant species were S. jiangnanensis (10.98%, lycophytes/ Lycopodiopsida), M. polymorpha (11.56%, bryophytes), P. patens (9.45%, bryophytes), and P. sitchensis (8.70%, gymnosperms). A. flabellulatum belongs to Polypodiopsida (true ferns), which confirms that there is a certain relationship between Polypodiopsida and Lycopodiopsida in plant evolution (The Pteridophyte Phylogeny Group, 2016). In the history of plant evolution, ferns have a relatively close relationship with bryophytes and gymnosperms (Wickett et al., 2014), which is also why the transcripts can be annotated to bryophytes and gymnosperms. The NR database annotation results are consistent with the plant genetic relationship, so this study can provide similar reference gene sequences for other ferns.
TFs, TRs and PKs are important components of cell signal transduction, and also important elements in the gene regulatory network. They are widely present in different organisms and participate in various life activities (Zheng et al., 2016). Among about 25,000 genes in the angiosperm model plant A. thaliana, the numbers of TFs and PKs were 2,391 and 1,028, accounting for about 9.56% and 4.11%, respectively (Aglawe et al., 2012). There are about 30,000 genes in Sorghum bicolor, including 2,654 TFs, accounting for about 8.85% of the plant's genes (Baillo et al., 2019). Liaquat et al. (2021) detected 163,834 transcripts in Schima superba, and obtained 9,423 TFs, accounting for 5.75% of the transcripts. Of the 181,141 transcripts in Huperzia serrata, 3,391 (1.87%) were TFs (Yang et al., 2017). There were 101,448 transcripts of Monachosorum maximowiczii, including 1,130 TFs, accounting for 1.11% of the transcripts (Liu et al., 2016). In this study, a total of 5,749 TFs (1.62%) and 4,950 PKs (1.39%) were predicted from 354,228 full-length transcripts. It can be seen that the proportions of TFs and PKs in the gametophyte transcripts of A. flabellulatum are less than those in angiosperms, which may be related to the general characteristics of ferns or the relatively simple structure of fern gametophytes.

CONCLUSIONS
In this study, full-length transcriptome sequencing of A. flabellulatum gametocytes was performed using PacBio SMRT technology. The genomes of ferns are huge, and the genes are highly redundant, so this was a new attempt at obtaining full-length transcripts of ferns using this technology. Moreover, we predicted lncRNAs of ferns for the first time, and finally obtained 111 793 lncRNAs, accounting for 31.56% of all transcripts. This indicates that a large number of lncRNAs exist in ferns. A total of 354,228 transcripts were obtained in this study which can provide a reference gene set for subsequent gene expression quantification.
• Zixuan Wang and Min Pan analyzed the data, prepared figures and/or tables, and approved the final draft.
• Xudong Yu and Shitao Xu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

DNA Deposition
The following information was supplied regarding the deposition of DNA sequences: The full-length transcript data is available in the SRA database: PRJNA733457.

Data Availability
The following information was supplied regarding data availability: The data is available at NCBI SRA: PRJNA774117, PRJNA733457, PRJNA777740.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.13079#supplemental-information.