Analyzing the Nucleotide Variations within the Expressed Sequence Tags of Loblolly Pine (Pinus taeda)

Single nucleotide polymorphisms (SNPs) represent the most frequent variations in eukaryotic genomes. For example, the frequency of SNPs is one per kilobase in human [1], one every 78 base pair (bp) in grapevine [2], and one every 43 bp in maize [3]. SNPs have proven to be useful genetic tools in many genetic studies, including molecular breeding, population genomics, genetic diversity, and cultivar identification [4].


Introduction
Single nucleotide polymorphisms (SNPs) represent the most frequent variations in eukaryotic genomes. For example, the frequency of SNPs is one per kilobase in human [1], one every 78 base pair (bp) in grapevine [2], and one every 43 bp in maize [3]. SNPs have proven to be useful genetic tools in many genetic studies, including molecular breeding, population genomics, genetic diversity, and cultivar identification [4].
Pine forests are the dominant ecosystem in many regions of the world and supply important industrial materials. Pines have enormous genomes that are about 10 and 40 times larger than human and poplar genomes, respectively [5]. Given current sequencing technologies, sequencing the whole genome of pine is infeasible in the near future. Expressed sequence tags (ESTs) provide an alternative approach for functional studies of the expressed genes in pine. In recent years, large numbers of pine EST sequences, especially of loblolly pine, have been deposited in public databases (www.ncbi.nlm.nih.gov/search/EST/), providing valuable resources for analyzing the nucleotide variations within expressed genes in the pine genome.
In this paper, we detected and analyzed a large number of nucleotide variations in the expressed genes of loblolly pine using existing EST resources. Our objectives were to (1) detect nucleotide variations within expressed sequences of loblolly pine, and (2) characterize these variations.

Acquisition and assembly of EST sequences
EST sequences of loblolly pine were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov/search/EST/) on May 10, 2012. Prior to assembly, we used the program SEQMAN NGEN (v. 1.2) [6] to trip adaptors, primers, and poly-A tails and to filter by sequence quality (threshold quality score = 20). The EST sequences were then assembled using GS De novo Assembler program (v. 2.7) [7] with default parameters, except for the minimum overlap length (50).

Discovery and analysis of nucleotide variations
To detect nucleotide variation in ESTs, we used the consensus EST sequences of loblolly pine as the reference sequences. Roche gsMapper (v. 2.7) [8] was used to align individual ESTs to the reference sequences with default settings, except for minimum overlap length (50) and minimum overlap identity (96%). As described in the Single Nucleotide Polymorphism Database (dbSNP, http://www.ncbi.nlm.nih.gov/ projects/SNP/), we included SNPs, short multi-base polymorphisms (MNPs), single-and short multi-base indels, and tandem repeat variation in our analysis. Candidate SNPs detected with gsMapper were filtered according to the following criteria: (1) at least four nonduplicated ESTs shared the polymorphism, and both forward and reverse ESTs had the change if the total depth was lower than 7, and (2) no other SNPs were detected within five bp on either side of the candidate SNP.

EST sequence assembly and detection of nucleotide variations
Altogether, 328,662 loblolly pine ESTs were downloaded from NCBI and assembled into 12,515 contigs, with 26 ESTs per contig on average. A total of 23,881 nucleotide variations, including 22,682 (94.98%) SNPs and 1,199 MNPs were detected in 3,697 of the assembled contigs, while 8,818 contigs contained no base changes. The ratio of contigs containing base variations was 29.54% (Table 1). The detected nucleotide variations and their characteristics were listed in Supplementary Table 1. Compared with MNPs, SNPs are the most common nucleotide change in many plant species, including crops like maize [9] and Brassica [10] and forest trees, like Pinus [9]. MNPs result in greater changes than SNPs in translated proteins, affecting function. Therefore, the predominance of SNPs may be a general phenomenon in coding genes.
As expected, the majority of single-base variations were SNPs (22,529, 99.33% of single-base changes), while indels were relatively infrequent (153, 0.67%). In contrast, the ratios of substitutions (626, 52.21%) and indels (573, 47.79%) were similar among the 1,199 multiple base variations ( There were more transitions than transversions, because base changes between two pyrimidines or two purines are biochemically easier than changes between a pyrimidine and a purine. A high frequency of transitions was also observed in other SNP discovery studies [11]. Among the SNP substitutions, C/T transitions occur most frequently. A higher rate of C/T transitions occurs in other organism [11], consistent with the experimental observation that cytosine demethylation was the most common mutational event. In the coding genes of pine, G/A and G/C substitutions were more common than G/T, A/C, and A/T transversions, although the underlying mechanism could not be explained.
We detected a total of 153 single-base indels, including 40, 43, 37, and 33 for A, C, G, and T, respectively. There was no significant indel bias for the different nucleotides.

Analysis of MNPs
Of the 1199 multiple-base changes, 626 (52.21%) were MNPs, all of which caused amino acid changes. There were also 573 (47.79%) multi-base indels, of which the frequency generally decreased with indel lengths (Figure 1), excepted for indels of 6, 9, 12, 15, and 18 bases, which were in length of integrated codons. Most multi-base indels were 2-4 bases long, accounting for 67.19% of the total. This result might imply that short indels occurred at a higher frequency than long indels, or short indels are more likely to be selected against over evolutionary time. This trend was also observed in a previous study on maize [12].
As previously reported, the general trend that longer indels are less frequent is affected by indels of codon length [11,12]. Codon-length indels of 6-18 bases occurred at higher frequencies than expected, because such indels would only cause slight changes in the open reading frames of the corresponding genes. Indels may result from errors in DNA synthesis, repair, or recombination or may be caused by insertion and excision of transposable elements, which often leave behind a characteristic DNA remnant of several bases. Notably, the frequency of eight-base indels was also higher than expected. A similar scenario was observed in maize by Bhattramakki et al. [12] and was thought to be related to sequence duplication during insertion and excision of Ac/ Ds transposable elements [13].
Although the frequency of multi-base variations was very low, they could be used as molecular markers whose polymorphisms are much easier to detect than those of single-base variations. The multi-base changes detected in this study offer a valuable resource for developing easily-detectable markers in coding genes.