Expressed Sequence Tags Analysis and Design of Simple Sequence Repeats Markers from a Full-Length cDNA Library in Perilla frutescens (L.)

Perilla frutescens is valuable as a medicinal plant as well as a natural medicine and functional food. However, comparative genomics analyses of P. frutescens are limited due to a lack of gene annotations and characterization. A full-length cDNA library from P. frutescens leaves was constructed to identify functional gene clusters and probable EST-SSR markers via analysis of 1,056 expressed sequence tags. Unigene assembly was performed using basic local alignment search tool (BLAST) homology searches and annotated Gene Ontology (GO). A total of 18 simple sequence repeats (SSRs) were designed as primer pairs. This study is the first to report comparative genomics and EST-SSR markers from P. frutescens will help gene discovery and provide an important source for functional genomics and molecular genetic research in this interesting medicinal plant.


Introduction
Perilla frutescens (L.) is a self-compatible annual herb known as the beefsteak mint plant. It is cultivated in East Asian countries, including Japan, China, and Korea, and is an economical crop in the medicinal herb family, Lamiaceae [1]. Its seeds can be processed into foods and nutritional edible oils, and its leaves can be utilized as a traditional medicinal herb or flavor for vegetables [2,3]. Perilla oil contains abundant polyunsaturated fatty acids (PUFAs), including linolenic (56.8%) and linoleic (17.6%) acids, which are used in salad oils or cooking [4,5]. The flavor and odor of perilla are caused by the essential oils of monoterpenoids and sesquiterpenoids, including terpenoids, and they are commercially used as a natural fragrance or for flavoring [6]. The perilla leaf is composed of a number of chemical variants of the volatile essential oil classified as PA-type (mainly in perillaldehyde), EK-type (elsholtziaketone), PK-type (perilla ketone), PLtype (perillene), PP-type (phenylpropanoids), and PT-type (piperitenone) [7]. Perilla has been described as an important pharmaceutical with anti-inflammatory, anti-allergic, and broad antioxidant functions [8,9].
Expressed sequence tags (ESTs) are fragments of expressed genes occurring from single-pass sequencing of cDNA libraries [10]. EST databases are sources of SSRs that can be developed as ortholog-specific EST-SSR markers and are dependent on genotype applications in many plant species [11][12][13][14][15][16][17][18]. As a molecular tool, EST-SSRs are highly important for studies on genetic populations [19]. They can identify functional markers in the open reading frames (ORFs) or 2 International Journal of Genomics 5 -or 3 -untranslated regions (UTRs) as well as exerting a phenotypic effect [20]. One advantage of the EST-SSR is that it is more transferable across closely related genera compared with unknown SSRs in the UTRs or noncoding sequences. Therefore, EST-SSRs are easy to understand for studying polymorphisms and genetic diversity [21,22]. EST-derived SSRs have been reported in various plant species, including Arabidopsis thaliana, cacao, and sugarcane [23][24][25]. EST-SSRs also provide a new source for genetic and evolutionary studies based on homology searches of putative SSR functions [26].
In this study, we developed a full-length enriched cDNA library from P. frutescens leaves. EST sequence analysis allowed for genome annotation and gene ontologies and the identification of EST-SSR markers for genomic tool development in this less-well-studied medicinal plant species. These results provide useful and multipurpose data for further studies on P. frutescens.

Plant Materials.
Seeds of P. frutescens were obtained after harvest in attached farm of Kangwon National University (Republic of Korea) during each year of collected accessions and grown on pot supplemented with commercial soil (GFC, Hongseong, Republic of Korea) in a greenhouse for a photoperiod of 16-hour light/8-hour dark at 25 ∘ C under wellwater conditions. Leaves were sampled for RNA isolation.

RNA Extraction, cDNA Library Construction, Plasmid
DNA Extraction, and Sequencing. Leaves were ground with a pestle in the presence of liquid nitrogen and ground tissue was used to RNA isolation using the Trizol method [27]. Total RNA was stored at −80 ∘ C until use. The fulllength cDNA library was constructed using the Creator SMART cDNA Construction Kit (Clontech Laboratories, CA, USA). Concentrations of isolated RNAs were evaluated using a nanodrop spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA) and then used for first-strand cDNA synthesis. Second-strand cDNA was purified by QIAquick (Qiagen, Venlo, Netherlands) and ligated or transformed into the pTripleEX Vector. Plasmid DNA extraction was processed using a Multiscreen Plasmid Extraction Kit (Millipore) and purified. The cDNA library was amplified and using GeneAmp PCR System 9700-384 (Applied Biosystems, CA, USA) and DNA clones was sequenced with single-pass sequencing from the 5 -ends of the cDNA.

Assembly Annotation.
Because genome and gene information were unavailable, assembly was performed without clustering. In the pretreatment process, PHRED was used to transfer peak information into the quality file and trim lowquality bases. Vector sequences were trimmed using Cross match (http://www.macvector.com/Assembler/trimmingwithcrossmatch.html). Chimeric clones, polyA-tails, and sequences less than 100 bp were removed with Seqclean. Assembly was performed with CAP3 software. Contigs were manually checked and, together with singlet reads, compiled to generate a final unigene file. Finally, 412 unigene sequences were obtained from 1,000 ESTs, which were composed of 69 contigs and 343 singletons.
Unigenes were searched against the NCBI nonredundant nucleotide (NT) and protein (NR) databases (http://www .ncbi.nlm.nih.gov/), the Uniprot sprot database (http://www .ebi.ac.uk/), and BLAST2GO (https://www.blast2go.com/) for functional annotation using the BLAST alignment tool. A sequence was considered as a significant match when the BLAST probability value (E-value) was less than 1e-5, and the match with the most significant E-value was recognized as the best annotation. A BLASTx search was also conducted against the UniProtKB/Swissprot database (http://www.ebi.ac.uk/) using default parameters. Unigenes were further annotated with GO terms (http://geneontology.org/).

Primer Design for EST-SSR Markers.
A total of 1,000 ESTs of 1,056 samples obtained from a cDNA library were detected and analyzed by TRF version 4.07b online software (http://tandem.bu.edu/trf/trf.html). SSR sequences were then obtained. SSRs that fit the following criteria were considered for primer design: a minimum length of 18 bp with minimum repetitions for di-, tri-, tetra-, penta-, and hexa-4 and 4, respectively. Primers were designed using Primer 3 (http:// www.premierbiosoft.com/primerdesign/) according to the following core criteria: a primer length ranging from 18 bp to 22 bp, with 20 bp as the optimum; product size ranging from 100 bp to 400 bp; melting temperature between 50 ∘ C and 62 ∘ C, with 60 ∘ C as the optimum; and GC content between 40% and 60%, with avoidance of mismatch, hairpin structures, and primer dimers that can cause nonspecific amplification.

cDNA Library Quality Check and Reads Assembly.
A full-length cDNA library was constructed from a mixture of P. frutescens samples. Library quality was evaluated after sequencing 96 randomly selected clones. On average, the insert size was greater than 1.2 kb. Forty-nine clones (51.04%) yielded sequencing reads above 700 bp; 11 clones (11.45%) were less than 500 bp. After confirming clone quality, a massscale sequencing approach was used. Construction of the fulllength cDNA library was produced from P. frutescens. A total of 1,000 randomly selected clones from the cDNA library were subjected to single-orientation sequencing from the 5end using an ABI3730xl Platform (BGI). Read lengths ranged from 420 bp to 844 bp, with an average of 632 bp (Figure 1).

GC Content by
Assembly of cDNA Reads. One thousand EST reads were obtained by trimming vector contaminants with Crossmatch and eliminating chimeric clones and short sequences (less than 100 bp). EST reads were then assembled by PHRAP and CAPS software [28,29]. Results from the CAP3 assembly indicated that the GC content of unigenes varied from 29.46% to 61.32%. Ninety-one percent of the unigenes exhibited GC content between 37.93% and 52.87% ( Figure 2).

Sequence Annotation.
Annotation of the EST library was achieved through BLAST (  Results from the NR database were determined to match that of the sequence homology with two species, Sesamum indicum (185 genes, 57.45) and Erythranthe guttata (78 genes, 24.22%). The remaining genes exhibited low levels (less than 1.86%) of sequence homology (Table 2).

EST-SSR Traits in P. frutescens.
A total of 343 unigene sequences were investigated. SSR sequences were obtained using TRF version 4.07b online software. Eighteen EST-SSR sequences were selected and analyzed following functional annotation (Table 3). Primer pairs were designed using the Primer 3 program. Expected product sizes ranged from 191 bp to 773 bp. In the future, we will perform additional classification studies through gene functions of P. frutescens using these EST-SSR primers.

Discussion
The major outcomes of this study were the construction of a full-length cDNA library from the important P. frutescens Ltype (with limonene component) and the preliminary 1,000 ESTs identified (average 632 bp in length). Genome segment quality was affected by many factors. GC content analysis revealed a distribution between 29.46% and 61.32%. Earlier study showed that thirty to fifty percent of GC content influenced genome sequence quality in Medicago truncatula and Lotus japonicas [30]. GC content increment was related to the ratio of segments with matching EST data [31], consistent with that from the human genome [32]. Gene Ontology (GO) was utilized to obtain functional information and descriptions of gene products by studying domain-specific ontologies [33]. Annotation results consisted of biological data related to stress response genes, which were classified functionally using the GO hierarchy [34]. The corresponding classifications were processed to obtain additional information on the putative functionality for the subject accession number of pepper EST data from the GO databases [35]. GO "biological process" and "molecular function," generated by level 3, were annotated and associated with the number of sequences from each term, which were normalized by labeling with a GO term [36].
We also established 18 EST-SSR primers from the fulllength cDNA library of P. frutescens. In Vitis vinifera, Artemisia tridentate, Panax ginseng, and S. miltiorrhiza, the EST-SSR motifs were generally di-and trinucleotide repeats [37][38][39][40]. However, this study revealed various penta-, hexa-, dodeca-, and tetradecanucleotide repeat motifs. This finding is in agreement with that for Scutellaria baicalensis, which contains penta-and hexanucleotide repeats [41]. Differences in repeat type may be attributed to the degree of the SSR search criteria for the EST database in various plant species. The development of EST-SSR markers has many advantages compared with other molecular markers and can be used to study genetic diversity, evolution, comparative genomics, and gene-based associations.
Construction of a full-length cDNA library is significant for comparative genomics, genome sequence validation, and design of EST-SSR primers that display entire transcription International Journal of Genomics 5 International Journal of Genomics units rather than partial gene sequences [42]. One benefit of constructing a full-length cDNA library is that it allowed us to conduct proper gene modeling while comparing other cDNA sequences in P. frutescens. The full-length cDNA sequences will be useful for annotation of the plant genome. Another advantage of EST sequencing is the increased ratio of unigenes with definitive GO categories compared with other libraries. The library built by this method included a high proportion of full-length cDNAs [42], allowing us to have a database of this library available for P. frutescens genomics studies. This full-length cDNA library provides a wealth of knowledge about the unique EST sequences available for the P. frutescens genome and, particularly, about the addition of 5end sequences that are more unique and valuable for gene identification. These EST tags will be useful for functional gene annotation, analysis of splice site variations, and gene homologies as additional whole-genome sequences become available in P. frutescens.

Conclusions
Perilla frutescens is valuable as a medicinal plant as well as a natural medicine and functional food. However, comparative genomics analyses of P. frutescens are limited due to a lack of gene annotations and characterization. A full-length cDNA library from P. frutescens leaves was constructed to identify functional gene clusters and probable EST-SSR markers through 1,056 examples of expressed sequence tag (EST) sequencing data. Unigene assembly was performed using basic local alignment search tool (BLAST) homology searches and annotated Gene Ontology (GO). A total of 18 simple sequence repeats (SSRs) were designed as primer pairs. This study is the first to report comparative genomics and EST-SSR markers from P. frutescens to ease gene discovery and provide an important source for functional genomics and molecular genetic research in this interesting medicinal plant.