Transcriptome sequencing and SNP detection in Phoebe chekiangensis

Background Phoebe chekiangensis is a rare tree species that is only distributed in south-eastern China. Although this species is famous for its excellent wood properties, it has not been extensively studied at the molecular level. Results Here, the transcriptome of P. chekiangensis was sequenced using next-generation sequencing technology, and 75,647 transcripts with 48,011 unigenes were assembled and annotated. In addition, 162,938 putative single nucleotide polymorphisms (SNPs) were predicted and 25 were further validated using the Sanger method. Conclusion The currently available SNP prediction software packages showed low levels of correspondence when compared. The transcriptome and SNPs will contribute to the exploration of P. chekiangensis genetic resources and the understanding of its molecular mechanisms.


INTRODUCTION
Phoebe chekiangensis, which belongs to Lauraceae, is a tree species with a high economic value worldwide that is mainly distributed in south-eastern China. P. chekiangensis is the major source of the well-known wood 'Golden Phoebe'. This wood has a superb reputation for its high-quality properties, such as its strong resistance to decomposition and dense texture (Gao et al., 2016). In addition to being widely used as timber or furniture in the imperial palace over the centuries, P. chekiangensis is a suitable garden plant species because of its outstanding tree morphology. However, due to its narrow distribution and slow growth, limited research has been conducted on this species, including studies of its general genomic studies.
Single nucleotide polymorphisms (SNPs) are widely used as genetic markers in association studies to understand inter-individual differences because of their characteristics of high frequency and binary variation patterns (Collins, Brooks & Chakravarti, 1998). Compared with traditional technologies, next-generation sequencing (NGS) technologies are usually more suitable for SNP identification because of their high throughput, although many artifacts are caused by systemic or random error. Researches on SNP identification and association studies have been carried on in many species (Martin et al., 2008;Ratan et al., 2015); however, very few SNPs are available in tree species because of the limited transcriptomic and genomic resources. Additionally, more validation work on putative SNP predicted by software using molecular experimental methods are required.
Over the past few years, NGS technology has led to profound changes in genomic and genetic research, with faster sequencing rates and continually decreasing costs (Mardis, 2008). Among the currently available NGS sequencing platforms, Illumina Hiseq2000 is relatively more cost effective, and it has been widely applied in the deep sequencing of model and non-model species (De et al., 2013). Because the determination of expressed sequence tags (ESTs) is an effective method for understanding the molecular mechanisms underlying physiological and morphological traits, for the first time, we sequenced the transcriptome of P. chekiangensis using Illumina HiSeqTM 2000 platform. This will help better understand and protect this rare tree species, and may aid in revealing the genetic principles of P. chekiangensis.

Sample collection and preparation
Leaves from a mature P. chekiangensis tree were collected in Zhejiang Academy of Forestry. Then the leaves were quickly frozen in liquid nitrogen and stored at −80 • C until RNA extraction. RNA degradation and contamination was monitored on 1% agarose gels. RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, USA). RNA purity was determined using the NanoPhotometer R spectrophotometer (IMPLEN, Westlake Village, CA, USA). In addition, RNA concentration were measured using a Qubit R RNA Assay Kit in a Qubit R 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).

Library preparation for transcriptome sequencing
In each sample, 3 µg RNA was used as input for the RNA sample preparations. NEBNext R Ultra TM RNA Library Prep Kit for Illumina R (New England Biolabs (NEB), Beverly, MA, USA) was used to generate the sequencing libraries following the manufacturers' recommendations. Briefly, with using poly-T oligo-attached magnetic beads, mRNA was purified from total RNA. Fragmentation was then performed under elevated temperatures in NEB Next First Strand Synthesis Reaction Buffer (5×). First strand cDNA was synthesized using a random hexamer primer and M-MuLV Reverse Transcriptase (RNase H-), and second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Using exonuclease/polymerase activities, remaining overhangs were converted into blunt ends. NEB Next Adaptor with hairpin loop structures were ligated to prepare for hybridization after the adenylation of 3 ends of DNA fragments. The library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, MA, USA) in order to select cDNA fragments ranging from 150 bp to 200 bp. Afterwards, 3 µl USER Enzyme was used with size-selected, adaptor-ligated cDNA at 37 • C for 15 min followed by 5 min at 95 • C before PCR. The library quality was assessed on the Agilent Bioanalyzer 2100 system and PCR reaction was performed with Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. At last, PCR products were purified (AMPure XP system).

Clustering and sequencing
According to the manufacturers' instructions, the clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina). After cluster generation, the library preparations were sequenced on an Illumina Hiseq 2000 platform and paired-end reads were generated.
The raw data of fastq format was firstly processed through in-house perl scripts (File S4). After removing reads containing adapters (reads containing more than 5 adapter-polluted bases were regarded as adaptor-polluted reads and would be filtered out), reads containing poly-Ns accounting for more than 5% and low quality reads (reads with the number of low quality bases (phred quality < 19) accounting for more than 15% of the total bases) from the raw data, clean data (clean reads) were subsequently obtained. At the same time, Q20, Q30, GC-content and sequence duplication levels of the clean data were calculated. All of the downstream analyses were based on clean data of high quality.
Before transcriptome assembly, we counted the clean reads number for each transcript form 5 end to 3 end, and obtained the reads distribution for overall transcripts. Transcriptome assembly was accomplished based on the left.fq and right.fq using Trinity (v2012-10-5) (Haas et al., 2013) with min_kmer_cov set to 2, and all other parameters were set as default accordingly.

Functional annotation of unigenes and quantification of gene expression levels
After assembly, the longest transcript was defined accordingly as a unigene. Then, the unigenes were annotated based on the following databases: NCBI non-redundant protein sequences; NCBI non-redundant nucleotide sequences; Pfam; Clusters of Orthologous Groups of proteins; KEGG Ortholog and GO. The coding sequences and amino acids were determined based on standard codon usage table. Unigenes which couldn't be blasted to neither database were processed by ESTScan (Iseli, Jongeneel & Bucher, 1999).
Gene expression levels were estimated by RSEM (Li & Dewey, 2011) for each sample: 1. Clean data were mapped back onto the assembled transcriptome; 2. Readcount for each gene was obtained from the mapping results.

SNP calling and SSR prediction
SNP prediction was performed using the following workflow: the clean reads were firstly aligned with the transcripts that were assembled by Trinity, and then the duplicated reads and multi-mapped reads were filtered. Subsequently, the alignment results were sorted according to the transcripts' locations. SOAPsnp (v1.03) was used for SNP calling based on the sorted data, and initial raw prediction results were obtained (Li et al., 2009b). After further filtering based on their quality values, sequencing depths and SNP separation distances, final SNP prediction results were acquired.

SNP validation
Leaves of 114 samples including 9 populations were first collected during November and December in 2011-2012 (Table 1). PCR reactions were performed using the following procedure: an initial denaturation for 5 min at 94 • C, 30 cycles of 30 s at 94 • C, 30 s at the locus-specific annealing temperature, and 40 s at 72 • C, followed by a final extension of 1 min at 72 • C. A typical 10 µl reaction contained 1× buffer, 2.5 mM MgCl 2 , 0.2 mM of each dNTPs, 0.25 µM of each primer, 0.25 U of Taq DNA polymerase (Takara, Kusatsu, Shiga, Japan) and 25 ng genomic DNA. The electronic version of this article in Portable Document Format (PDF) will represent a published work according to the International Code of Nomenclature for algae, fungi, and plants (ICN), and hence the new names contained in the electronic version are effectively published under that Code from the electronic edition alone. In addition, new names contained in this work which have been issued with identifiers by IPNI will eventually be made available to the Global Names Index. The IPNI LSIDs can be resolved and the associated information viewed through any standard web browser by appending the LSID contained in this publication to the prefix ''http://ipni.org/''. The online version of this work is archived and available from the following digital repositories: PeerJ, PubMed Central, and CLOCKSS.

Sequencing and assembly results
Illumina sequencing data from P. chekiangensis were deposited in NCBI SRA database under accession number SRP100128. Two samples were collected and sequenced, and more than 134 million raw reads were initially obtained (Hansen, Brenner & Dudoit, 2010). After the filtering procedure, 128,237,694 clean reads with 93.99% and 93.47% Q30 bases, respectively, were selected for further analyses (Table 2). Using Trinity software, 75,647 transcripts were assembled successfully with an average length of 939 bp and the N50 was 1,605 bp. More than 39,000 transcripts were longer than 500 bp, accounting for 52.81% (Fig. 1). In total, 48,011 unigenes were identified, having an average length of 761 bp, and 19,439 of them were longer than 500 bp (40.49%).

Predictions of SSRs and SNPs
A total of 48,011 transcripts were examined for SSR prediction (File S3), and 9,505 were identified with SSRs (19.80%) and 1,830 sequences were found to have more than one SSR. According to the prediction results, 11,776 SSRs were predicted and one unit repeats accounted for the highest percentage (49.03%). In addition, 736 SSRs were present in compound formations.
According to the SOAPsnp prediction results, 162,983 putative SNPs were predicted in P. chekiangensis (File S4). Among them, 77.27% were in non-coding regions, 22.73% were in coding regions, 22.60% were synonymous SNPs, and 0.13% were non-synonymous. Most unigenes have less than 10 SNPs per 1,000 bp, indicating that the SNP frequency in P. chekiangensis was relatively low (Fig. 4). Although most SNPs seem not to affect the amino acid composition, they may be closely correlated with a bias in codon usage.

Validation of SNP prediction results
To further validate the putative SNPs, 15 unigenes containing 100 putative SNP loci were selected and primers were designed (Table 4). Because of the limited samples, the putative SNPs were validated using Sanger sequencing results, with the sequences amplified by PCR. All amplified sequences were sequenced accordingly, and when the double peak phenomenon was observed at one locus, based on the sequence diagram together with the comparison results between other sequences, this site could then be defined as a SNP according to the theory of polymorphism. As a result, 25 putative SNPs were finally validated.

Assembly and annotation of P. chekiangensis unigenes
In our study, 48,011 unigenes were assembled and 29,714 unigenes were successfully annotated. Based on the distribution of homogenization results, although the sequencing depth of the 5 /3 regions were relatively lower, the overall degree of transcripts' homogenization was high, indicating that our transcriptome sequencing results could satisfy the following analyses (Fig. 5). Because limited previous studies have been reported on the molecular mechanisms of this species (Gao et al., 2016), we believe that the assembly and annotation of these unigenes, including eight unigenes involved in lignin synthesis, would be beneficial to the research on P. chekiangensis molecular mechanisms, including the exploration of its excellent wood properties. Based on our annotation results, only 61.88% of all unigenes were successfully annotated in at least one database, suggesting that nearly 40% of the unigenes were uniquely distributed in P. chekiangensis. Therefore, the large number of unigenes, together with the transcripts, could effectively increase the transcriptomic and genomic information available for this species. Additionally, the prediction of

The criteria of SNP validation in P. chekiangensis
With the appearance of the next-generation sequencing technology, this will allow for the sequencing of polymorphic genotypes on specific target areas and consequent SNP identification, and the direct sequencing of DNA segments amplified by PCR from several individuals is still one classic method to identify SNPs (Oeveren & Janssen, 2009;Portis et al., 2013;Rafalski, 2002). Since we used Sanger method to validate putative SNP results, we sequenced each single sequence from both two ends (5 end and 3 end), and then they were assembled together in order to make sure the reported variants were not in fact the product of sequencing. However, in some scarce species with high heterozygosity, such as P. chekiangensis, a few problems may not be ignored. One important issue is that when amplified segments were paired-end sequenced, the results may not be easily assembled, and false positive SNPs may easily be detected because of the interference of heterozygosity or misalignment of paralogs. As a result, single sequences in one individual may have changeable bases at one position, and this may be confusing when analyzing multi-sequence alignment results to validate SNPs. As a considerable fraction of the predicted 'SNPs' are nucleotide polymorphisms between orthologous regions in the parental haplotype of a heterozygous individual, hence the exact rate of false positive or accuracy may be difficult to determine because of probable alignment artifacts caused by misalignment. In our study, we excluded this situation in one individual and only regarded base differences derived from multi-sequences as validated SNPs Besides , only obvious base differences together with double peak phenomenon observed were regarded as validated SNPs, although the clear definition of polymorphism in a single individual remains a question.

The results varied when using different SNP prediction software
In our study, SOAPsnp (Li et al., 2009b) was selected as our SNP prediction software. However, more than 20 software packages or programs, including GATK (several versions included) (McKenna et al., 2010), SAMtools (Li et al., 2009a), SOAPsnp have been developed to predict SNP, for both transcriptomes and genomes, regardless of the de novo assembly or a reference. In addition to using SOAPsnp for P. chekiangensis SNP prediction in this research, GATK and SAMtools were also selected for comparison. The different SNP prediction software packages varied greatly in time consumption and accordance, with an