Analysis of Litopenaeus vannamei Transcriptome Using the Next-Generation DNA Sequencing Technique

Background Pacific white shrimp (Litopenaeus vannamei), the major species of farmed shrimps in the world, has been attracting extensive studies, which require more and more genome background knowledge. The now available transcriptome data of L. vannamei are insufficient for research requirements, and have not been adequately assembled and annotated. Methodology/Principal Findings This is the first study that used a next-generation high-throughput DNA sequencing technique, the Solexa/Illumina GA II method, to analyze the transcriptome from whole bodies of L. vannamei larvae. More than 2.4 Gb of raw data were generated, and 109,169 unigenes with a mean length of 396 bp were assembled using the SOAP denovo software. 73,505 unigenes (>200 bp) with good quality sequences were selected and subjected to annotation analysis, among which 37.80% can be matched in NCBI Nr database, 37.3% matched in Swissprot, and 44.1% matched in TrEMBL. Using BLAST and BLAST2Go softwares, 11,153 unigenes were classified into 25 Clusters of Orthologous Groups of proteins (COG) categories, 8171 unigenes were assigned into 51 Gene ontology (GO) functional groups, and 18,154 unigenes were divided into 220 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. To primarily verify part of the results of assembly and annotations, 12 assembled unigenes that are homologous to many embryo development-related genes were chosen and subjected to RT-PCR for electrophoresis and Sanger sequencing analyses, and to real-time PCR for expression profile analyses during embryo development. Conclusions/Significance The L. vannamei transcriptome analyzed using the next-generation sequencing technique enriches the information of L. vannamei genes, which will facilitate our understanding of the genome background of crustaceans, and promote the studies on L. vannamei.


Introduction
Pacific white shrimp, Litopenaeus vannamei, formerly Penaeus vannamei, belongs to the Penaeidae family of decapod crustaceans, is the major species of farmed shrimps in the world [1]. Because of its great economic value and important evolutionary status, more and more studies have been focused on breeding, growth, development, immunity, genetics and evolution of L. vannamei in recent twenty years [2][3][4][5][6]. The L. vannamei genome with higher recombination rates than other genomes of closely related penaeid prawns is predicted to be 2.0 Gigabases and has not been sequenced up to now [7,8]. The absence of a fully sequenced and assembled genome hindered the studies on L. vannamei, including determination of gene functions and regulations, and establishment of novel genetic manipulation technologies.
Many transcriptome studies of L. vannamei have been carried out and a large number of expressed sequence tags (ESTs) were obtained using cDNA library and Sanger sequencing methods. By May 2012, 162,993 ESTs from many organs and tissues have been released on Genbank. These data have been used for cloning functional genes, selecting genetic molecular markers, and designing cDNA microarrays [2,[9][10][11]. However, because of the limitations of the traditional methods used for ESTs sequencing, the now available transcriptome data of L. vannamei are still insufficient for research requirements relative to the size of its genome. Moreover, the now available EST sequences have not been systematically analyzed. Up to now, only 7968 unigenes have been assembled and annotated, largely limiting the use of the ESTs sequence data.
The next-generation high-throughput DNA sequencing techniques, such as Solexa/Illumina (Illumina), 454 (Roche) and SOLiD (ABI), have been developed and growing rapidly in recent years [12]. They can sequence millions of DNA fragments simultaneously and provide Gigabases of data with high fidelity in a single machine run, greatly improving work efficiency and increasing data output [13]. The enormous advantages of these technologies make them admirably suited for genomics research, such as de novo and re-sequencing of genome, mRNA and microRNA [14][15][16][17][18]. Especially in transcriptome analysis, the usage of the next-generation sequencing techniques make it no longer necessary to establish cDNA libraries with bacteria cells as carriers, which could introduce DNA fragments deletion during the cloning process [19].
In this study, we analyzed the transcriptome of whole bodies of L. vannamei larvae using Solexa/Illumina high-throughout sequencing method, providing over 2.4 Gb data of raw sequences, which were assembled into 109,169 unigenes. We further annotated the unigenes by matching against Nr, Swissprot, Clusters of Orthologous Groups of proteins (COG), Gene ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Part of the results of unigene assemblies and annotations were primarily verified by RT-PCR, gel electrophoresis, Sanger sequencing, and real-time PCR. The assembled and annotated unigenes provide useful information for the studies on genomes and functional genes of L. vannamei and crustaceans.

Breeding of L. vannamei
Newly hatched L. vannamei larvae from a relatively high-WSSVresistant family [20], obtained from Hengxing (Evergreen) shrimp farm in Zhanjiang, China, were fed with larvae of Chirocephalus diaphanus and bred in seawater of pH 8.0 and 30 g kg 21 salinity in a 3 m 3 indoor tank with recirculating and filtering units at 28uC.

Illumina Sequencing
One hundred L. vannamei larvae at 20 days post spawning were fasted for 24h to avoid contamination from the feed Chirocephalus diaphanous larvae and then sampled. Total RNA was extracted using the SV Total RNA Isolation System (Promega) according to the manufacturer's protocol. The total RNA concentration, quality and integrity were determined using a NanoDrop 1000 spectrophotometer and an Agilent 2100 Bioanalyzer. The total RNA with absorbance 260/280 nm Ratio of ,2.0 was chosen and treated with DNase I prior to library construction. Poly-(A) mRNA was then purified using oligo(dT) magnetic beads and fragmented by treatment with divalent cations and heat, followed by reversetranscribing into cDNA using reverse transcriptase and random hexamer-primers. The second-strand cDNA was synthesized using RNaseH and DNA polymerase I. The double-stranded cDNA was end-repaired using T4 DNA polymerase, Klenow fragment, and T4 polynucleotide kinase followed by a single (A) base addition using Klenow 39 to 59 exo-polymerase, and further ligated with an Illumina PE adapter oligo mix using T4 DNA ligase. Adaptormodified fragments with length of 200625 bp were selected by gel purification and subjected to PCR amplification as templates. After 15 cycles of PCR amplification the libraries were sequenced on an Illumina sequencing platform (GAII) and the raw reads were generated using Solexa GA pipeline 1.6.
Transcriptome De novo Assembly and Analysis Fig. 1 shows the workflow of transcriptome assembly and analysis. The raw reads of Illumina sequencing were preprocessed by removing adaptor sequences, low-quality reads (reads with ambiguous bases N), and duplication sequences, and were then assembled using the SOAP denovo software (http://soap. genomics.org.cn/soapdenovo.html) with the default settings. Firstly, the clean reads are combined by SOAP denovo based on sequence overlap to form longer fragments without N, which are called contigs, and then the reads are mapped back to contigs. Next, scaffolds were made using SOAP denovo by connecting the contigs with N to present unknown sequences between each two contigs in a same transcript. Scaffolds' gaps can be filled by pairedend reads of sequencing to get sequences with least Ns and cannot be extended on either end. Such sequences are defined as unigenes, and the following analysis are based on them [15,21,22].
RT-PCR and Real-time PCR Assays 12 annotated unigenes that may relate to embryo development were selected to be analyzed using real-time PCR and RT-PCR, and their specific primers were listed in Table 1. For real-time PCR assays, Parent prawns mated and laid eggs in a 3 m 3 tank at 28uC with seawater of pH 8.0 and 30 g kg 21 salinity. The timing of sampling was according to the embryogenesis stages, which were determined by microscopic examination and reference to previous studies [31,32]. 60 fertilized eggs per sample were collected at 0, 140, 215, 275, 480, 600 and 660 minutes postspawning (mps), respectively. The spawning and sampling experiments were repeated for three biological replicates. Total RNA of each sample was isolated with TRIzol (Invitrogen, USA) and subjected to DNase I treatment (Promega, USA) according to the manufacturer's protocols. The cDNA was synthesized with SuperScript II RT (Invitrogen, USA), and quantitative real-time PCR was triply performed using the LightCycle 480 System (Roche, Germany) with a volume of 10 ml contained 1 ml of 1:10 cDNA diluted with ddH 2 O, 5 ml of 26SYBRGreen Master Mix (Toyobo, Japan), and 250 nM of each primer. The cycling parameters were 95uC for 2 min to activate the polymerase, followed by 40 cycles of 95uC for 15 s, 62uC for 1 min, and 70uC for 1 s. Cycling ended at 95uC with 5uC/s calefactive velocity to create the melting curve. Fluorescence measurements were taken at 70uC for 1 s during each cycle. Expression levels of each gene were normalized to 18S ribosomal RNA (18S rRNA, GenBank accession number: AF186250, 18S-rRNA-qF and 18S-rRNA-qR primers: 59-CTGCGACGCTAGAGGTGAAATTC-39 and 59-AGGTTAGAACTAGGGCGGTATCTG-39). Data were calculated with the relative quantification method described by Muller et al. [33], and subjected to statistical analysis.
For RT-PCR assays, the 12 unigenes were amplified by LA Taq DNA polymerase (TaKaRa, Japan) using cDNA from L. vannamei larvae at 20 days as template. The PCR products were analyzed using agarose gel electrophoresis, and subcloned into the PMD19-T vector (TaKaRa, Japan) for Sanger sequencing.    Table 2, and the size distributions (.500 bp) of Contigs, Scaffolds, and Unigenes were showed in Fig. 2.

Illumina Sequencing and Sequence Assembly
To assess the abundance and coverage of the transcriptome data, we matched the assembled unigenes against the known EST library on Genbank. The 162,926 ESTs downloaded from NCBI were clustered and assembled using TGICL and Phrap, and 25,813 assembled EST-unigenes with mean length of 681 bp and N50 length of 756 bp were generated (http://marine.sysu.edu.cn/ Download/download/id/68). Comparisons between transcriptome unigenes and EST-unigenes were performed using BLASTn algorithm. Results were shown in Fig. 3 as a Venn chart and further detailed in Table S1. 60.1% (15,

Annotation of Unigenes
After ruling out short-length or low-quality sequences, 73,505 unigenes with a minimum length of 200 bp were selected and subjected to annotation analysis by matching sequences against Nr, Swiss-prot and TrEMBL databases using BLASTx searching with an E value,0.00001. 27,789 unigenes (37.80% of the total) can be matched in Nr database (Table S2)

COG, GO and KEGG Classification
The assembled unigene sequences were subjected to BLAST searching against GO, COG and KEGG databases, and the summary statistics of BLAST assignment was shown in Table 3.
COG is a database where orthologous gene products were classified. Every protein in COG is assumed to be evolved from an ancestor protein, and the whole database is built on coding proteins with complete genome as well as system evolution relationships of bacteria, algae and eukaryotes [27,29]. Phylogenetic classifications of the predicted CDSs of unigenes were analyzed by searching against COG database to predict and classify possible functions of the unigenes (Fig. 4). Possible functions of 11,153 unigenes were classified and subdivided into 25 COG categories (Table S5), among which the cluster for 'General function prediction only' represents the largest group (2002, 17.95% of the matched unigenes) followed by 'Translation, ribosomal structure and biogenesis' (929, 8.33%) and 'Posttranslational modification, protein turnover, chaperones' (830, 7.44%). The following categories: 'extracellular structures' (6, 0.05%), 'nuclear structure' (9, 0.08%) and 'RNA processing and modification' (71, 0.64%), represent the smallest groups.
GO is an international standardized gene functional classification system which offers a dynamic-updated controlled vocabulary and a strictly defined concept to comprehensively describe properties of genes and their products in any organism [25,26]. Based on the results of Nr annotation, the Gene Ontology (GO) annotations of unigenes were generated using the BLAST2GO program, and the GO functional classifications were performed using WEGO software to understand the distribution of gene functions of L. vannamei from the macro-level (Fig. 5). 45,601 GO term annotations corresponding to 8171 unigenes were produced and assigned into 51 functional groups and three categories, among which 22,268 were assigned in biological process category, 15,403 in cellular component category and 7930 in molecular function category (Table S6). Among the biological process category, 'cellular process' (17.65%) and 'metabolic process' (14.45%) biological regulation were most highly represented, and other unigenes were categorized into other 25 important biological process, including 'biological regulation' (7.53%), 'multicellular organismal process' (7.02%), 'localization'(6.75%), 'developmental process' (6.55%), 'regulation of biological process' (6.35%), 'cellular component organization or biogenesis' (5.64%), and so on. 11 GO functional groups were assigned into the cellular component category, among which 'cell' (32.88%) and 'cell part' (29.66%) were most highly represented. Similarly, 13 GO functional groups were assigned into the molecular function category, among which 'catalytic activity' (42.86%) and 'binding' (40.86%) were most highly represented.
The KEGG pathway database records networks of molecular interactions in the cells and variants of them specific to particular organisms. Pathway-based analysis helps us to further learn biological functions of genes [27,34,35]. To systematically analyze their inner-cell metabolic pathways and complicated biological behaviors, we classified the unigenes into biological pathways by mapping the annotated CDS sequences to the reference canonical pathways in the KEGG database (Fig. 6). 18,154 unigenes were consequently assigned to 220 KEGG pathways (Table S7), among which 2285 members assigned to 'metabolic pathways', followed by 'Spliceosome' (1007 members), 'Regulation of actin cytoskeleton' (926 members), 'Amoebiasis' (848 members), 'Pathways in Table 3. Summary statistics of L. vannamei transcriptome blast assignment.

RT-PCR and Real-time RT-PCR Assays
To primarily verify the results of assemblies and annotations, 12 assembled unigenes that are homologous to many embryo development-related genes, including three Hox family transcriptional regulators abdominal-A, abdominal-B, and homeotic antennapedia (Antp), three Wnt signaling pathway genes Wnt6, Wnt10 and beta-catenin, two Puf Family RNA-binding translational regulators pumilio (PUM) and pumilio homolog 2 (PUM2), a NF-kB family gene Dorsal, a Zn-finger transcription factor Spalt, a Polycomb group (PcG) gene extra sex combs (esc), and histone cell cycle regulation defective homolog A (HIRA), were chosen and subjected to RT-PCR and real-time PCR analyses (Table 1). Full lengths of these unigenes were amplified by RT-PCR using specific primers designed based on the assembly results. The amplicons were analyzed using agarose gel electrophoresis (Fig. 7) and Sanger sequencing (Table S8), which confirmed their lengths and sequences, suggesting the faithful assembly of transcriptome data.
The sampled L. vannamei embryos were examined under microscope to determine their embryogenesis stages (Fig. 8). To verify their annotations, expression profiles of the 12 unigenes during embryonic development were further detected using realtime RT-PCR (Fig. 9). The Strigamia maritima abdominal-A homolog unigene98122 demonstrated a periodic expression profile with the first peak of 10.14-fold increase at 215 mps and the second peak of 18.54-fold increase at 480 mps, while the Strigamia maritima abdominal-B homolog unigene10296 and Culex quinquefasciatus Antp homolog unigene54158 showed similar expression profiles with peaks at 215 mps of 6.01-fold and 11.52-fold increase levels, respectively. The Wnt6 homolog unigene16317 peaked at 215 to 275 mps with 4.50-4.04-fold, followed by a sharp decrease at 480 mps, and the Wnt10 homolog unigene18019 kept increasing after spawning and reached a peak at 600 mps with 7.89-flod increase. The unigene105360, similar to beta-catenin, the key molecule of the Wnt pathway, peaked at 215 mps with 2.38-fold and returned to baseline levels after 480 mps. The expression of the Apis mellifera PUM homolog unigene92779 was up-regulated during 0-215 mps, and then fell to a low level at 480 mps. The Saccoglossus kowalevskii PUM2 homolog uni-gene99210 exhibited a 5.49-fold increase at 215 mps, and then returned to the basal level at 275 mps where it remained unchanged. The expression of the unigene95206, the dorsal gene of Litopenaeus vannamei, was up-regulated periodically, with the first peak of 6.15-fold at 215 mps and the second peak of 6.82-fold at

Discussion
Many members of crustacean are of great economic value and important evolutionary status. Up to now, the Daphnia pulex genome is the only one sequenced in the subphylum Crustacea, phylum Arthropod [36]. Many transcriptomes of crustaceans have been analyzed using traditional Sanger sequencing and cDNA microarray method, including Eriocheir sinensis [37], Portunus pelagicus [38], Petrolisthes cinctipes [39], Penaeus monodon [40], Penaeus japonicas [41], Daphnia magna [42], Daphnia pulex [43], and so on. In recent years, the next generation sequencing methods have also been applied to analyze transcriptomes of crustaceans, such as Balanus Amphitrite [44], Euphausia superb [45], Macrobrachium rosenbergii [46] and Parhyale hawaiensis [47] using 454 sequencing, and Eriocheir sinensis [48,49] using Illumina sequencing. Comparing with the traditional methods, the next-generation high-throughput DNA sequencing techniques provide more ideal methods for transcriptome analyses with high efficiency, low cost and high data output. The development of DNA sequencing technology will facilitate the studies on crustaceans' gene background.
In this study, using the Illumina sequencing method to analyze the trancriptome of L. vannamei, more than 2.4 Gb of raw data were generated, and 882,339 contigs (.75 bp) were assembled, largely enriching the transcriptome data of L. vannamei and prompting the genome studies of crustaceans. The former studies on L. vannamei transcriptome were performed using traditional cDNA library and Sanger sequencing methods with RNA from many organs such as muscle, blood and hepatopancreas. In our study, RNA used for transcriptome analysis was exacted from whole bodies of L. vannamei larvae, covering all tissues of the species, which could include fuller transcriptional genes of L. vannamei. We compared our transcriptome data with L. vannamei EST sequences obtained from NCBI and showed that more than half of the EST sequences (60.1%) can be matched in the transcriptome data, whereas up to 85.8% of the transcriptome unigenes can not be found in the ESTs library. It suggests the transcriptome data provide abundant information besides the now available ESTs sequences.
Although providing much more data throughout than traditional Sanger sequencing method, the reading lengths of the raw data of the Illumina GAII system are quite short. Up to 79.64% of the obtained contigs are less than 100 bp. The SOAP denovo method, a relative mature technique based on the short oligonucleotide analysis package (SOAP) algorithm, was adopted to process the sequencing data, and 73,505 unigenes (.200 bp) with good quality sequences were assembled and subjected to annotation analyses. There are more unigenes showed similarities to H. sapiens than other arthropods species such as T. castaneum, D. mojavensis, and H. saltator, which are phylogenetically closer to L. vannamei than human. It is maybe because the now available information on gene background of crustaceans and arthropods is limited, and human genes have been much better studied than other species, providing sufficient gene sequences and annotations for comparison analyses. Only 31.83% of the total analyzed unigenes (84.2% of the Nr database-matched) showed similarities to D. melanogaster, a well studied model animal in the Insecta class, phylum Arthropod, maybe because the genome size of L. vannamei is almost 12 times more than that of D. melanogaster [50] and there might be somewhat different between their gene backgrounds. Further investigation should be required to determine whether protein sequences in crustaceans may have divergence from other animals. With more genes from crustaceans being studied and   more gene background information being available, unigenes of L. vannamei obtained in this study will be further annotated. Moreover, since it has been reported that there are limitations of the next-generation sequence de novo assembly, which could cause missing of the duplicated sequences [51], the possibilities of missing of the encoding sequences in the assembled unigenes might not be excluded. It might also lead to miss-matching of the L. vannamei unigenes to protein databases of other species. With the development of sequencing methods and short-read assembly algorithms, further analyses will be performed to rearrange the now available L. vannamei transcriptome data and improve their annotations.
Possible functions of the assembled unigenes were analyzed by matching to GO, COG and KEGG databases. Although only a small part of the assembled unigenes were functional annotated, the results of these three databases searching help us learn more about biological features of L. vannamei. For example, 748 unigenes can be classified into the 'Vibrio cholerae infection' KEGG pathway, such as protein disulfide isomerase (PDI, unigene15355), ADP ribosylation factor (ARF, Unigene87975) and transport protein Sec61 (Unigene100471), which have been reported that involve in Vibrio infection in arthropods [52][53][54][55]. It indicated that as an animal living in water, L. vannamei may always deal with the challenges from bacteria in water, and may have evolved complicated systems and signal pathways against infection. Interestingly, although no cancer has ever been reported in Arthropoda animals, our results showed that 843 unigenes can be classified into 'Pathways in cancer' KEGG pathway. Cancer is a disease of aberrant multicellularity, and its hallmarks are thought to be intimately associated with those of metazoan multicellularity [56,57]. It has been reported that many 'multicellularity' genes of Amphimedon queenslandica were also implicated in cancer, suggesting the remote origin of cancer and oncogenes [56,57]. The KEGG 'Pathways in cancer' annotations of these L. vannamei unigenes in this study also provided supports for the theme that cancer origin may be related to evolution of multicellularity. Further studies should be performed to confirm this hypothesis.
The 12 unigenes subjected to RT-PCR and real-time PCR analyses were chosen based on the results of annotations, and the primer-pairs were designed based on the sequences of the unigenes assembled with contigs. The products of RT-PCR were sequenced using Sanger sequencing method, which confirmed the sequences of these unigenes, suggesting the faithful results of Illumina sequencing and assemblies. Furthermore, as real-time PCR showed that the expression profiles of these assembled unigenes regularly varied following the developmental stages of embryos, we think the qualities of the annotations of the unigenes may be enough for the requirements of studies on functional genes of L. vannamei.
With the development of sequencing techniques, the data of nucleic sequences boosted every day. As more data will be obtained from other species, the assembled unigenes in this study will be further annotated and analyzed. The data of the annotated unigenes are worthy of deeper mining and further analyzing. It will facilitate our understanding of the genome background of crustaceans, and promote the studies on genetics, functional genes, and gene regulations of L. vannamei.

Supporting Information
Table S1 Comparisons between assembled transcriptome unigenes and the EST-unigenes which were assembled form EST sequences available from Genbank using TGICL and Phrap softwares. (ZIP)