Transcriptome analysis and annotation: SNPs identified from single copy annotated unigenes of three polyploid blueberry crops

Blueberry is a kind of new rising popular perennial fruit with high healthful quality. It is of utmost importance to develop new blueberry varieties for different climatic zones to satisfy the demand of people in the world. Molecular marker assisted breeding is believed to be an ideal method for the development of new blueberry varieties for its shorter breeding cycle than the conventional breeding. Simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) markers are widely used molecular tools for marker assisted breeding, which could be detected at large scale by the transcriptome sequencing. Here, we sequenced the leaves transcriptome of 19 rabbiteye (Vaccinium ashei Reade), 13 southern highbush (Vaccinium. corymbosum L × native southern Vaccinium Spp) and 22 cultivars of northern highbush blueberry (Vaccinium corymbosum L) by using next generation sequencing technologies. A total of 80.825 Gb clean data with an average of about 12.525 million reads per cultivar were obtained. We assembled 58,968, 55,973 and 53,887 unigenes by using the clean data from rabbiteye, southern highbush and northern highbush blueberry cultivars, respectively. Among these unigenes, 3599, 3495 and 3513 unigenes were detected as candidate resistance genes in three blueberry crops. Moreover, we identified more than 8756, 9020, and 9198 SSR markers from these unigenes, and 7665, 4861, 13,063 SNPs from the annotated single copy unigenes, respectively. The results will be helpful for the molecular genetics and association analysis of blueberry and the basic molecular information of pest and disease resistance of blueberry, and would also offer huge number of molecular tools for the marker assisted breeding to produce blueberry cultivars with different adaptive characteristics.

Introduction Blueberry is perennial flowering shrub or small tree, which comprises of about twenty members that belong to section) Cyanococcus, genus Vaccinium, and family Ericaceae [1]. The blueberry is a delicious fruit, and its fruits are famous in the world for its high anthocyanins contents, and it is listed among the top five healthful fruits (non-citrus) in North America [2,3]. The previous studies showed that blueberry anthocyanins have multiple healthful functions including retarding age-related diseases like Alzheimer's and enhancing memory [4], reducing eye strain, preventing macular degeneration, exhibiting anti-cancer activity [5,6], and reduce the risk of heart diseases [7]. Blueberry fruit is also a good raw material of sauces, juices and wine [8,9], and used as a dye because of high pigment contents [10].
In the recent decade, the blueberry production has increased significantly in the world, especially the production of new emerging countries from Asia, Oceania and South America [11][12][13]. The world production of highbush blueberry, which is a major blueberry crop, had passed the 1-billion pound in 2012 [14]. However, blueberry cultivars planted in the whole world are still mainly from North America [15], and the new blueberry producing countries have different climatic and soil conditions compared to the native blueberry producing area [16]. In order to cope with the challenges from various ecological and climatic conditions, more new widely adaptive cultivars are required for the development and growth of blueberry industry. However, blueberry is a perennial fruit crop with long juvenile period and complex ploidy genome [17][18][19]. Therefore, it required a long time to overcome these unfavorable factors and to select key traits in the breeding procession by conventional methods [20][21][22], and also needs to spend a lot of manpower and resources [23]. Modern molecular marker assisted breeding techniques and genetic engineering techniques are apt to overcome these problems and accelerate the breeding process [19].
With the advent of high-throughput sequencing technology and the development of bioinformatics analysis, genomics research has become a common method for biological laboratories. Blueberry research has entered into the genomic era with the availability of huge genomic data [24]. For example, the molecular mechanism of the cold adaptation of blueberry was studied extensively by using functional genomics methods, especially RNA-seq sequencing technique, and the gene expression analysis under the cold environment [25][26][27][28][29][30][31]. The metabolic related genes of blueberry antioxidant substances were explored by transcriptome analysis [32]. The changes in gene expression profiles of blueberry after infection with Bacillus anthracis were studied by RNA-seq technique [33]. Metabolite profiling showed transcriptional regulation of abscisic acid and flavonoids metabolism during the development of blueberry fruit [34], and candidate genes involved in fruit ripening were identified [35]. The EST sequence database of cultured blueberries (Vaccinium corymbosum) was also established in 2007 [36]. Meanwhile, a reference genome of blueberry (Vaccinium corymbosum with diploid genome) has been published, and researchers can access it by the genome Browser8.5.2 software (http:// bioviz.org/igb/). However, the above studies were only limited to an individual blueberry cultivar, and the genome information about different blueberry cultivars or populations have not been reported yet. Moreover, there are few studies about the exploitation of SSR or SNP markers and haplotype-phased genome assembly of blueberry by genotyping by sequencing (GBS) and whole genome sequencing [37][38][39].
Molecular markers are indispensable tools for marker assisted breeding. The SSR and SNP markers are two attractive and widely used because of many merits including co-dominant, reproducibility, locus-specificity, and random genome-wide distribution in many organisms [40,41]. In this genomic era, the development of SSR and SNP markers by high-throughput next-generation sequencing platform has been popular work and marker assisted breeding has also entered into the genomics era [42][43][44][45]. In the present study, we sequenced the leaves transcriptome of 19 rabbiteye blueberry cultivars, 13 southern highbush blueberry cultivars and 22 cultivars of northern highbush blueberry by using next generation sequencing technologies. Our aims were (1) to collect functional genome information about different blueberry cultivars; (2) to uncover the preliminary molecular mechanism of blueberry adaptation by mining resistance genes; and (3) to develop SSR and SNP markers to assist in the breeding and other corresponding studies about blueberry.

Ethics statement
No specific permissions were required for these locations/activities because all samples were collected from blueberry germplasm nursery of Majiang Blueberry Industry Engineering Technology Center, Guizhou, China. We collected leaves from blueberry cultivars for research, and also confirmed that the field studies did not involve any endangered or protected species.

Plant material and RNA extraction
We extracted the total RNA from the young leaves of 2-3 years old seedlings of 54 blueberry cultivars that were planted at blueberry germplasm nursery of Majiang Blueberry Industry Engineering Technology Center (Wuyangma village, Xuanwei town, Majiang county, Guizhou province, China), including 19 rabbiteye, 22 northern highbush and 13 southern highbush blueberry cultivars (S1 Table). The total RNA from young leaves of all cultivars was extracted by using the Spectrum plant total RNA kit (Sigma-Aldrich-STRN250 MSDS, USA) and strictly followed the guidelines provided by the company. High quality RNA with RIN (RNA integrity number) above 7.0 was used for RNA sequencing.

Library construction and sequencing
High quality total extracted RNAs (A260/A230 of OD value more than 2.0, A260/A280 OD value between 1.8-2.0, electrophoretic bands clear, concentration more than 50ng/μL) were used to construct the paired-end sequencing libraries, and the sequencing was done according to the sequencer provider's instructions as follow: First, the total RNA was treated with DNAse and then separated poly-A-containing mRNA from the total RNA by using poly-T-oligoattached magnetic beads. Second, the purified mRNA sequences were fragmented into approximately 300~500 base length fragments, and these mRNA fragments were used as template to synthetize the first single strand of cDNA, and then the first strand of cDNA was used as template to synthetize the second strand of cDNA. Third, the synthetized double strands were purified and quantified after carrying out the reaction of end repair, A-tailing and adapter ligation. Then the purified cDNA was enriched by a 15-cycle-PCR reaction to complete sequencing library. Finally, paired-end sequencing was conducted on Illumina HighSeq 4000 platform. Raw reads with fastq format have been deposited to NCBI and are available at genbank with ID: PRJNA511922.

Raw data filtering
We obtained the clean reads for further assembly by filtering the raw reads based on the following steps and rules: 1) removing reads containing adapters; 2) removing reads containing more than 10% of unknown nucleotides (N); 3) removing reads containing more than 50% of low quality (Q-value�20) bases.

De novo assembly
Though the genome of a highbush diploid blueberry is available (http://bioviz.org/igb/), but the sequencing coverage and the genome integrity of reference genome is very low. So we assembled the unigenes of three kinds of blueberry crops independently by using program "Trinity", a software package designed specifically for the assembling of short reads without reference genome [46]. The unigenes with a length longer than 201 bp were accounted for statistics and used for further analysis.

Annotation of unigenes
We executed basic annotations including protein functional annotation, pathway annotation, COG/KOG functional annotation and Gene Ontology (GO) enrichment analysis to predict the molecular functions of assembled unigenes. First, we used BLASTx program [47] with an E-value threshold of 1e-5 to hit against the NCBI non-redundant protein database (http:// www.ncbi.nlm.nih.gov), the Swiss-Port protein database (http://www.expasy.ch/sprot), the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [48], and the COG/KOG database [49]. We obtained the protein functional annotation codes of corresponding unigenes according to the best alignment results. Then we performed GO functional annotation of unigenes by using the Blast2GO software [50], and the functional classification of unigenes was done using WEGO software [51].

Identification of resistance genes
We used all assembled unigenes to query the plant resistance genes database (PRGdb; http:// prgdb.org) with an E-value threshold of 1e-5.

Detection of SSR markers and primer designing
We used program MISA (http://pgrc.ipk-gatersleben.de/misa/) to identify SSR markers and designed corresponding primers by using following parameters: (1) motif ranged from 2 to 6 nucleotides; (2) minimum repeat units were six for 2 nucleotide repeat motifs, five for 3 nucleotide repeat motifs, four for 4-6 nucleotide repeat motifs; (3) the maximum interruption length between two SSR markers was set as 100 bp. The program Primer 3 (http://primer3.ut. ee/) was used to design primers with the following criteria: The GC contents of primer sequences were ranged from 40% to 60%, and the size of expected PCR product was ranged from 100 to 250 bp.

SNP calling
We used program tophat v2.0.14 which is built in bowtie software package (http://bowtie-bio. sourceforge.net/index.shtml) to call the original SNPs dataset by setting default parameters. To avoid the false positive mutant loci as much as possible, we filtered the original SNP dataset by following criteria: sequencing quality of SNP loci base reach to Q30, the read depth of opposite base of SNP loci reach to five, minor allele frequency of SNP loci greater than 15%, and SNP found only in annotated single copy unigenes. To identify single copy unigenes, we first executed two-two alignment of all unigenes that belong to different species by using blastp method, and the unigene pairs with E-value lower than 1e-7 of were regarded as homologous genes, and then we clustered unigenes that are homologous to each other into one gene family by running the program of OrthoMCL (http://orthomcl.org/orthomcl/). If a gene family includes only one unigene in each species, then it was regarded as a single copy unigene.

Data statistics and Unigenes assembly
We obtained about 248.26, 139.28, 288.81 million raw reads from leaves transcriptome of 19 rabbiteye, 13 southern highbush, and 22 northern highbush blueberry cultivars by using High-Seq 4000 platform, respectively. After filtering the reads containing adapters, more than 10% of unknown nucleotides and low quality bases (<Q20), 246.84, 138.47 and 286.89 million clean reads with an average of 12.99, 10.65 and 13.04 million clean reads per cultivar were generated (S1 Table). The clean reads were assembled into 45,535, 42,914 and 43,630 unigenes in rabbiteye, southern highbush and northern highbush blueberry cultivars, and the average length of three unigenes clusters were 857 bp, 873 bp and 896 bp, respectively (S2 Table).  Table). Annotation by KOG database showed that most of the unigenes in three kinds of blueberries were involved into "General function prediction only", and reached to 6204 (36.36%), 6154 (35.75%) and 5990 (36.15%), followed by the molecular function of "signal transduction mechanisms" and "posttranslational modification, protein turnover, chaperones", and the number reached to 3451 (20.23%), 3375 (19.61%), 3311 (19.98%) and 3238 (18.98%), 3313 (19.25%), 3227 (19.47%), respectively ( Table 2). According to the annotation results of KEGG database, the unigenes of three kinds of blueberries were associated with 129 metabolism pathways. The top five metabolism pathways were "Plant-pathogen interaction", "Carbon metabolism", "Ribosome", "Protein processing in endoplasmic reticulum" and "Biosynthesis of amino acids" (S5 Table). GO enrichment analysis was used for functional annotation of unigenes, and 17,751, 18,237 and 17,503 unigenes hit 94,620, 97,611 and 94,168 GO terms with an average of 5.33, 5.35 and 5.38 hits per unigene (S6 Table). The "metabolic process" was the main term of "biological process" category, and "cell" and "cell part" terms were enriched in "cellular process" category, while "catalytic activity" and "binding" were significantly enriched in the "molecular function" category (S6 Table).

Detection and statistics of R-Genes
We identified 3599, 3495 and 3513 candidate R-gene unigenes, which belong to more than 15 families in rabbiteye, southern highbush and northern highbush blueberries, respectively.

Detection of SSR markers
We identified 8756, 9020, and 9198 SSR markers from 7251, 7282 and 7518 unigenes from rabbiteye, southern highbush and northern highbush blueberry cultivars. The numbers of SSR kinds with different core motifs exhibited similar distribution patterns in three blueberry Table 2. crops, for example, two repeat motifs accounted for the majority in numbers, and reached to 5829, 6177, 6230, which accounted for 66.57%, 68.48% and 67.73% of total SSR markers in three blueberry crops, followed by 3, 4, 6 and 5 repeat type SSR motifs (Table 4). Of the all kinds of SSR markers with different motifs, "AG/CT" motif was found to be the highest proportion, which accounted for 61.96%, 63.80%, 62.85% of total SSR markers in three blueberry crops, followed by AAG/CTT motif which accounted for 8.0% of total SSR markers in three blueberry crops, while all other motifs accounted for less than 5% of total SSR markers in three blueberry crops. Most of the SSR markers were found to be suitable for sequence information to design primers (S7 Table).

Identification of SNPs
After using a strict filtering procedure, we identified 7665, 4861, 13,063 SNPs in leaf's transcriptome of rabbiteye, northern highbush and southern highbush blueberry cultivars, respectively (S8 Table). Among these SNPs, base mutants with transitions were 1.90, 1.93 and 1.93 times of transversion, and G/A, C/T mutant patterns were much higher than other mutants, and the numbers reached to 2647, 1770, 4580, and 1560, 980, 2413 that accounted for 34.53%, 36.41%, 35.06% and 20.35%, 20.16%, 18.47% of total SNPs in rabbiteye, northern highbush and southern highbush blueberry cultivars, respectively (Fig 1). The minor allele frequency of these SNPs were in the range of 0.15-0.50, and if we divided these values into seven intervals with 0.05 per interval, the minor allele frequency of most of the SNPs fall into 0.35-0.40 in rabbiteye and northern highbush blueberry, and 0.20-0.25 in southern highbush blueberry cultivars (Fig 2). The heterozygosity ratio of all the SNPs was in the range of 0.00-0.80 in three blueberry crops, and if we divided these heterozygosity values into 9 intervals with 0.10 per interval, most of the SNPs fall into the range of 0.4-0.5 in rabbiteye blueberry and northern highbush blueberry and 0.3-0.4 in southern highbush blueberry (Fig 3). Detection of SSRs and high-quality SNPs in blueberry

Discussion
In the last decade, genomics research based on high-throughput sequencing for fruit crops had made a dramatic progress, and the reference genome of more than ten fruit crops and huge RNA data have been published. These investigations have greatly promoted the studies of molecular biology, evolution genetics and breeding program of fruit crops [52][53][54][55]. Recently, the genome of blueberry (Vaccinium corymbosum with diploid genome) has been published, and researchers can access it by the genome Browser8.5.2 software (http://bioviz.org/igb/). However, the assembly integrity and sequencing coverage of this reference genome is very low [36]. Therefore, we assembled the transcriptome with a method of no reference genome to get more information about gene functions in this study. In spite of a lot of genome information or transcriptome sequences have been deposited in Genebank [24][25][26][27][28][29][30][31][32][33][34][35][36], the reports on SSR or SNP markers at large scale are limited. In this study, we developed more than 8000 SSRs and 4000 high-quality SNPs markers in three kinds of blueberry crops based on the transcriptome data, and this would offer great help for the blueberry studies about molecular genetics, molecular breeding and association analysis that mainly rely on the molecular tools. Plants have evolved a wide range of defense mechanisms to protect themselves against pathogens, and the major defense mechanisms are disease resistance which commonly mediated by semi-dominant or dominant R genes that encode receptors and detect pathogen infection either by recognition of pathogen effector molecules directly, or by recognition of effector modified host targets indirectly [56,57]. Crops are the plant groups that offer basic sources of energy and nutrition for human survival. There are number of factors that reduces the global crop yield, such as huge number of plants grown together, inadequate supply of fertilizer and water, and plants of a crop are more susceptible to a large number of pathogens, including bacteria, insects, oomycetes, and nematodes [58,59]. So, developing disease-resistant varieties by different methods, such as genetic transformation of plant resistance genes, are believed to be a good choice to protect crops from diseases, insects and pests. Identifying plant resistance genes and R-gene loci are the basic premise to assemble various resistance sources effectively and to engineer new strategies for disease resistance in agriculture [60,61]. In this study, we identified about thousands of unigenes that were homologous with R-gene that belong to more than 13 families, and this would offer the molecular information to understand the Detection of SSRs and high-quality SNPs in blueberry ecological adaption of blueberry. At the same time, these unigenes information also offer the basic molecular tools for resistance breeding.
In the past decade, single nucleotide polymorphisms (SNPs) have become a popular and conventional choice of genetic marker, especially for diploid species by high-throughput sequencing method [62][63][64][65]. However, identification of SNPs in polyploids is more challenging because of complex genome duplication events which incurring homologous SNPs (polymorphic positions occurring across subgenomes within and among individuals). SNP markers were produced at large scale by next generation sequencing platform in few polyploid species by using different methods to filter false positives [66]. For example, to filter out false positives as much as possible, the SNPs from uniquely mapping reads or the reads depth more than three have been used in transcriptome data of B. napus [67,68], and SNPs from these strategies have been successfully used for genome-wide association studies [69]. Another SNP filtering strategy was successfully used in potato by combining of read depth, quality and SNP density of transcriptome sequence [70]. Besides, a Network-Enabled Analysis Kit (UNEAK pipeline) implemented in the TASSEL-GBS software program (https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline) has been developed and proven to be effective for the identification of SNPs in complex species such as switchgrass [71]. The conclusion drawn from above successful cases is that high-quality SNPs can be identified in even the most difficult polyploid species.
Most cultivated blueberry cultivars are polyploid, for example, lowbush blueberry is tetraploid [68], northern highbush blueberry is 2x, 4x and hexaploid (6x), and 3x and 5x are produced by hybridization [72,73], and rabbiteye blueberry is hexaploid [74]. The southern highbush and inter-highbush are also generated by crossing with northern highbush and other species, and both are polyploid [75]. To overcome the adverse effects incurred by complex genome duplication events, we further filtered out the original SNP dataset, which was generated by using program tophat v2.0.14 with default parameters. We systematically considered the status of SNP loci by sequencing quality, read depth, minor allele frequency, annotation statistics, and only from annotated single copy unigenes (a gene family includes only one unigene in a species). We believed that final SNP datasets are reliable molecular tools for the association studies and marker assisted breeding of blueberry.
Supporting information S1