Distributed under Creative Commons Cc-by 4.0 Genome-wide Identification and Characterization of Gras Transcription Factors in Sacred Lotus (nelumbo Nucifera)

The GRAS gene family is one of the most important plant-specific gene families, which encodes transcriptional regulators and plays an essential role in plant development and physiological processes. The GRAS gene family has been well characterized in many higher plants such as Arabidopsis, rice, Chinese cabbage, tomato and tobacco. In this study, we identified 38 GRAS genes in sacred lotus (Nelumbo nucifera), analyzed their physical and chemical characteristics and performed phylogenetic analysis using the GRAS genes from eight representative plant species to show the evolution of GRAS genes in Planta. In addition, the gene structures and motifs of the sacred lotus GRAS proteins were characterized in detail. Comparative analysis identified 42 orthologous and 9 co-orthologous gene pairs between sacred lotus and Arabidopsis, and 35 orthologous and 22 co-orthologous gene pairs between sacred lotus and rice. Based on publically available RNA-seq data generated from leaf, petiole, rhizome and root, we found that most of the sacred lotus GRAS genes exhibited a tissue-specific expression pattern. Eight of the ten PAT1-clade GRAS genes, particularly NnuGRAS-05, NnuGRAS-10 and NnuGRAS-25, were preferentially expressed in rhizome and root. In summary, this is the first in silico analysis of the GRAS gene family in sacred lotus, which will provide valuable information for further molecular and biological analyses of this important gene family.


INTRODUCTION
The GRAS genes are plant-specific transcriptional regulators, which have been characterized in many higher plant species. The name GRAS is an acronym of the first three functionally characterized members in this family, including the gibberellic acid insensitive (GAI), repressor of ga1-3 (RGA), and scarecrow (SCR) (Bolle, 2004). The GRAS proteins contain a variable N-terminal region and a highly conserved C-terminal region with five different sequence motifs in the following order: leucine heptad repeat I (LHR I, LRI), VHIID motif, leucine heptad repeat II (LHR II, LRII), PFYRE motif and SAW motif (Pysh et al., 1999).
The GRAS proteins are usually composed of 400-770 amino acid residues (Bolle, 2004). Much research has shown that LRI, VHIID, LRII motif or the whole LRI-VHIID-LRII pattern might function as one of the DNA-binding or protein-binding regions during interaction between GRAS and other proteins. Functions of the PFYRE and SAW motifs are still not very clear, while the highly conserved residues in the PFYRE and SAW motifs indicate that these motifs are potentially related to the structural integrity of the GRAS proteins (Sun et al., 2011). In fact, the order of the conserved domains in all GRAS proteins are similar, and the regulatory specificity of the GRAS proteins seem to be determined by the length and amino acid sequences of their variable amino-termini (Tian et al., 2004). Based on studies on the model plants Arabidopsis and rice, members of the GRAS gene family are divided into eight distinct branches, namely LISCL, PAT1, SCL3, DELLA, SCR, SHR, LS and HAM, which are named after a common feature or one of their members (Hirsch & Oldroyd, 2009). However, another phylogenetic analysis grouped the Arabidopsis GRAS proteins into ten branches, including LISCL, AtPAT1, AtSCL3, DELLA, AtSCR, AtSHR, AtLAS, HAM, AtSCL4/7, DLT (Sun et al., 2011). In another study, the GRAS genes can be divided into at least 13 branches (Liu & Widmer, 2014). The identification of GRAS members in different species was also slightly different among studies. There were 32 to 34 genes identified for sacred lotus in this study. In addition, 57 and 60 were identified as GRAS genes in rice in two reports (Liu & Widmer, 2014;Tian et al., 2004). In addition, there were 21, 46, 48, 53 and 106 GRAS transcription factors identified in tobacco (Chen et al., 2015), Prunus mume (Lu et al., 2015), Chinese cabbage (Song et al., 2014), tomato (Huang et al., 2015) and Populus (Liu & Widmer, 2014), respectively.
Members of the GRAS gene family have diverse functions and exert important roles in plant development and physiological processes, such as in gibberellin acid (GA) signal transduction, root development, axillary shoot development, phytochrome signal transduction and transcriptional regulation in response to abiotic and biotic stresses. For example, the phosphomimic/de-phosphomimic RGA, a representative DELLA protein of the GRAS protein family in Arabidopsis regulates different bioactivities in GA signaling (Wang et al., 2014). AtSCL3 plays an essential role in integrating multiple signals during root cell elongation in Arabidopsis (Heo et al., 2011). The gene MONOCULM 1 (MOC1) which is highly related to rice tillering has been cloned. As a result of a defect in the formation of tiller buds, the moc1 mutant plants contain no tillers except a main culm . As a member of the PAT1 branch, the Arabidopsis Scarecrow-like 13 (AtSCL13) is a positive regulator of continuous red light signals downstream of phytochrome B (phyB), it also regulates phytochrome A (phyA) signal transduction in a phyB-independent way according to genetic evidences (Torres-Galea et al., 2006). With increasing the activity of two stress-responsive enzymes α-amylase and Superoxide dismutase (SOD) in transgenic seedlings, transgenic Arabidopsis plants overexpressing PeSCL7 from Populus euphratica Oliv exhibited enhanced tolerance to salt and drought stress due to increased activity of stress-responsive enzymes in the transgenic seedlings (Ma et al., 2010).
Sacred lotus (Nelumbo nucifera Gaertn) is a perennial aquatic herb with large showy flowers. It is widely distributed in Australia, China, India, Iran and Japan (Mukherjee et al., 2009). Sacred lotus has been cultivated as a crop in Far-East Asia for over 3,000 years. This plant is widely used for food (edible rhizomes, seeds and leaves) and medicine, and also plays an vital role in cultural and religious activities (Shen-Miller, 2002). The individual parts of sacred lotus, including leaf, stamen, stem, rhizome and seed, have various medicinal properties. For example, the embryos of seeds are used as traditional Chinese medicine in the treatment of nervous disorders, high fevers (with restlessness), insomnia and cardiovascular diseases (e.g., hypertension, arrhythmia). The leaves are mainly used for clearing heat, cooling blood, removing heatstroke and stopping bleeding. The flowers are useful in treating premature ejaculation, bloody discharges and abdominal cramps (Mukherjee et al., 2009).
The GRAS family genes have been well characterized in several plant species, but they have not been investigated in sacred lotus. In 2013, the China Antique variety of the sacred lotus had its complete genome sequenced and the draft genome was released. The final assembly covers 86.5% of the estimated 929 Mbp total genome size, and the genome was well-assembled with a scaffold N50 of 3.4 Mbp. In the recent assembly, a total of 26,685 genes were annotated in the sacred lotus genome study, of which 4,223 represented a minimum gene set for vascular plants by comparisons of the available sequenced genomes (Ming et al., 2013). The release of the whole genome sequence of sacred lotus enabled us to conduct genome-wide identification and comparative analysis of the GRAS gene family in this plant. In this study, we identified 38 GRAS genes in sacred lotus and constructed a phylogenetic tree of the GRAS genes from eight plant species. Also, we analyzed the gene structure and conserved motifs of the sacred lotus GRAS genes, performed comparative analysis for the GRAS genes from Arabidopsis, rice and sacred lotus. Furthermore, expression profiles of the sacred lotus GRAS genes were investigated in four different tissues using publically available RNA-seq data. This study provided essential information for the GRAS family genes in sacred lotus and enhanced our understanding of the GRAS family genes in plants.

Identification and characterization of the sacred lotus GRAS genes
The sacred lotus genome sequence, the annotated sacred lotus gene and protein sequences were downloaded from Lotus-DB (http://lotus-db.wbgcas.cn/, v1.0) (Wang et al., 2015). The newest HMM model for the GRAS transcription factor gene family named PF03514.11 was downloaded from the Pfam database (http://pfam.xfam.org/) (Eddy, 2011;Finn et al., 2014). The HMMER software was employed in searching for GRAS proteins in the entire protein dataset of sacred lotus with a cut-off E-value of 1e −5 using PF03514.11 as a query. The identified potential GRAS proteins were manually checked given their presence of the motifs essential for a protein to be defined as GRAS and all of them were retained for further analysis. The coding sequences and the genomic sequences of the identified sacred lotus GRAS genes were obtained in accordance with the GFF3 file specification. The genomic position of each GRAS protein on the assembled mega scaffolds of sacred lotus was also obtained based on the GFF3 file.

Phylogenetic analysis of the GRAS genes
Eight representative species were selected for comparative analysis of their GRAS proteins. The genome sequences of spruce (Picea abies), the first available gymnosperm genome assembly (Nystedt et al., 2013), was collected from the Congenie Website (http://congenie.org/). The annotated proteins of algae (Chlamydomonas reinhardtii), moss (Physcomitrella paten), fern (Selaginella moellendorffii), columbine (Aquilegia coerulea), Arabidopsis thaliana, and rice (Oryza sativa) were downloaded from the Pfam database (v10) (Goodstein et al., 2012). The GRAS genes in these species were identified using the aforementioned method. No GRAS gene was identified in algae. Given that there were two GRAS domains in two rice genes (LOC_Os12g04370, LOC_Os11g04570), only the best hit domain for each of them was selected for phylogenetic tree construction to avoid redundant amino acids.
The ClustalX2 program was used to generate the multiple sequence alignments of the GRAS proteins with the Gonnet protein weight matrix (Larkin et al., 2007). The MEGA program (v6.06) was employed to construct a maximum-parsimony phylogenetic tree using the JTT model with 500 bootstrap replicates (Tamura et al., 2013). All sites in each of the GRAS proteins were involved in the phylogenetic tree construction. To clearly distinguish the genes in this phylogenetic tree, the terms of ''moss'' and ''fern'' were added as prefixes to indicate they were genes from Physcomitrella paten and Selaginella moellendorffii, while the common names were used as prefixes for other species. The frequency of each divergent branch was displayed if its value was higher than 50%. The Adobe Illustrator software was used to clearly show the GRAS branches after classification of all proteins based on the known background information.

Gene structure and motif analysis
The gene structure was analyzed using the Gene Structure Display Server tool (http://gsds.cbi.pku.edu.cn/, v2.0) (Hu et al., 2015). The UTR sequences were gathered and were displayed in the final gene structure. The MEME software (http://memesuite.org/doc/download.html, v4.11.0) was used to search for motifs in all 38 sacred lotus GRAS proteins with a motif window length from 6 bp to 50 bp (Bailey et al., 2009). The default number of motifs to be found was set to 10, and these motifs were allowed to be distributed for any number of repetitions.

Identification of orthologous and paralogous genes
OrthoMCL (v2.0.3) (Li, Stoeckert & Roos, 2003) was used to search for orthologous, coorthologous and paralogous genes in sacred lotus, Arabidopsis, and rice using the entire GRAS protein sequences. We ran ortholog analysis using OrthoMCL with an E-value of 1e −5 for all-against-all BLASTP alignment, and a match cut-off value of 50. The orthologous and paralogous relationships were gathered, and displayed using the Circos software (http://circos.ca/) (Krzywinski et al., 2009).

Analysis of the GRAS gene expression in different tissues
Transcriptome data of leaf, petiole, rhizome internode and root of sacred lotus have been previously generated and processed. The FPKM (fragments per kilobase per million measure) representing the gene expression level of each sacred lotus gene was available in Lotus-DB (http://lotus-db.wbgcas.cn/genome_download/transcript/). We retrieved the FPKM results of all sacred lotus GRAS genes from Lotus-DB and displayed the results using the HemI program (http://hemi.biocuckoo.org/) with the maximum distance similarity metric (Deng et al., 2014).

Genome-wide identification of GRAS genes in sacred lotus
A total of 38 distinct GRAS transcription factors were identified from the sacred lotus genome (Table S2). We named these GRAS candidates NnuGRAS-01 up to NnuGRAS-38 based on their E-value ranking (Table 1 and Table S2). The length of GRAS proteins varied greatly, ranging from 74 amino acids (aa) to 807 aa. It was noteworthy that the minimum length of a typical GRAS domain is about 350 amino acids (Liu & Widmer, 2014). Based on this criterion, several of the 38 GRAS proteins that we identified in sacred lotus were exceptional. For example, the number of amino acids of NNU_15540-RA, NNU_11453-RA and NNU_26501-RA was 97, 74 and 168, respectively. The average length of coding sequences (CDS) of the GRAS genes (1,544 bp) was longer than the average of all sacred lotus genes (1,135 bp). Analysis revealed that the GRAS genes of sacred lotus contained longer exons and smaller introns.
Physical and chemical characteristics of all 38 sacred lotus GRAS proteins, including number of amino acids, molecular weight, theoretical pI, formula, instability index, aliphatic index and GRAVY, were analyzed and summarized in Table S2. The average value of theoretical pI was 5.0, suggesting that most proteins were acidic. Only two proteins, NnuGRAS-35 and NnuGRAS-38, with values of theoretical pI at 9.03 and 9.12, respectively, were considered to be alkaline. Three GRAS proteins with an instability index less than 40 were classified as stable and the rest were classified as unstable. The average aliphatic index of all proteins was 85.8, ranging from 67.68 to 108.11. We found that most sacred lotus GRAS proteins contained numerous aliphatic amino acids. The GRAVY scores of all proteins, except NnuGRAS-29 (0.005), NnuGRAS-37 (0.264) and NnuGRAS-38 (0.482), were less than zero, suggesting that these proteins were hydrophilic. Based on a comparative analysis, the physical and chemical characteristics of the sacred lotus GRAS proteins were in general similar to those of Chinese cabbage (Song et al., 2014).

The representative gene families in sacred lotus
To shed light to all genes in sacred lotus, the genome-wide identification of all possible genes in gene families were performed. A collection of 16,230 Pfam HMM models was used following abovementioned method. A total of 19,925 sacred lotus genes were identified in 4,032 gene families, and some genes were identified into several families for the boundary of some similar gene families were not totally clear. The Pkinase, PPR (Pentatricopeptide repeat), LRR or RRM (RNA recognition motif), MYB, LRRNT (Leucine rich repeat N-terminal domain), TPR (Tetratricopeptide repeat) and WD40 gene families were identified to contain most members in sacred lotus. Most of these abundant gene families were found to be repeats, transcription factors and zinc-finger containing genes. These

Phylogenetic relationship of the GRAS proteins in Planta
To shed insight into the evolution of GRAS genes in the plant kingdom, we analyzed the phylogenetic relationship of the GRAS genes from seven plant species, including moss (Physcomitrella patens), fern (Selaginella moellendorffii), a gymnosperm (Picea abies), a monocot (rice, Oryza sativa) and a representative eudicot species (Arabidopsis thaliana) and two basal eudicots (Aquilegia coerulea and Nelumbo nucifera). The Nelumbo genus, which was the solely extant genus in the family Nelumbonaceae, was one of the earliest eudicot clades (Li et al., 2014). The columbine was also included in the analysis because it belongs to another primitive family between gymnosperms and Nelumbonaceae. The number of GRAS genes in sacred lotus was 38, which was more than in Arabidopsis (32-34), but less than in Prunus mume (46), Solanum lycopersicum (53) and Chinese cabbage (48) (Huang et al., 2015;Lu et al., 2015;Song et al., 2014;Tian et al., 2004).
A previous study showed that the phylogenetic trees based on the full-length sequences or the conserved C-termini of the GRAS proteins were very similar. In both analyses, the GRAS genes were classified into eight branches (Bolle, 2004). The phylogenetic tree of all GRAS genes generated in this study was generally consistent with previous reports. In addition to eight previously reported branches, the new branches SCL3/28 was classified here based on phylogeny. AtSCL8 and AtSCL26 were not classified into any branches for which they were classified into the PAT1 branch and HAM branch due to relative similarity with other members in these two branches. According to the phylogenetic tree, the sacred lotus GRAS genes were distributed in all nine branches, with 4, 3, 4, 1, 4, 2, 7, 10 and 3 in the DELLA, SCR, SCL4/7, LS, HAM, SCL9, SHR, PAT1 and SCL 3/28 branches, respectively (Fig. 1). The SCR and SHR branches contained three (NunGRAS-14, NunGRAS-19 and NunGRAS-32) and seven sacred lotus GRAS proteins, respectively. GRAS genes of the SHR branch have been previously shown to be crucial for the development of root and shoot (Cui et al., 2007). NunGRAS-16 and NunGRAS-17 were classified into the SCL9 branch, whose members have been previously found to regulate the transcription process during microsporogenesis in Lilium longiflorum (Morohashi et al., 2003). Ten sacred lotus GRAS proteins were clustered into the PAT1 branch. PAT1 has been shown to be potentially involved in phytochrome. A signal transduction and was found to be localized to the cytoplasm in Arabidopsis (Bolle, Koncz & Chua, 2000). The DELLA branch includes four sacred lotus GRAS proteins, including NunGRAS-09, NunGRAS-11, NunGRAS-02 and NunGRAS-08. The Arabidopsis GRAS proteins of this branch were found to be associated with negative regulation of GA signaling (Wen & Chang, 2002).
We did not find GRAS genes in the green algae, which in consistent with previous results. The GRAS family transcription factors were found to be first emerged in bacteria and belong to the Rossmann fold methylatransferase superfamily. All bacterial GRAS proteins are likely to function as small-molecule methylases and some of the plant GRAS proteins could have a similar function (Zhang, Iyer & Aravind, 2012). We deduced that the GRAS genes in plants might be originated from bacterial genome. The typical plant GRAS genes first appeared in Figure 1 Phylogenetic tree of the GRAS genes from seven plant species, which was constructed based on the GRAS domains using the maximum-parsimony method. Individual species were distinguished by different shapes with different colors. moss, with 40 members identified in Physcomitrella patens. The total number of the GRAS genes in different species fell in a narrow range of 34 to 60; however, the distribution of GRAS genes in different branches was extremely uneven. In some species, the GRAS genes were mainly found in certain branches. For example, 29 of the total 52 fern GRAS genes and 13 of the total 60 rice GRAS genes were classified into the SCL9 branch, while only one columbine (35 in total) and two sacred lotus (38 in total) GRAS genes were found in the SCL9 branch (Fig. 1). The GRAS gene number of Arabidopsis thaliana in SCL9 was enlarged into six members, this may result from Gamma, Beta and Alpha whole-genome duplication which will need further research. In addition, nine spruce GRAS genes also accumulated in a sub-branch of the PAT1 branch. As columbine and sacred lotus were closely related to each other based on the phylogenetic analysis, every sacred lotus GRAS gene had close columbine GRAS genes, and most columbine GRAS genes had one or two corresponding GRAS members of sacred lotus except for Aquaca_076_00075.1 and four members in DELLA branch. In one sub-branch of the DELLA branch, there were 2, 4 and 3 fern columbine and rice GRAS genes, respectively, but no moss and spruce GRAS genes. Interestingly, most GRAS genes contained only one typical domain, but two rice genes (LOC_Os12g04370 and LOC_Os11g04570) contained two domains.

Gene structure analysis
The structural divergence of exon/intron regions played a crucial role in the evolution of gene families (Bai et al., 2011). To evaluate the possible evolution and diversity of the sacred lotus GRAS genes, we analyzed the number and location of exons, introns and UTRs of each GRAS gene (Fig. 2). The results showed that up to 73.7% (28/38) of the sacred lotus GRAS genes were intronless, which was lower than in Prunus mume (82.2%) and tomato (77.4%), but higher than in Arabidopsis (67.6%), rice (55%) and Populus (54.7%) (Abarca et al., 2014;Liu & Widmer, 2014;Lu et al., 2015;Tian et al., 2004). Only ten of the 38 sacred lotus GRAS genes had introns, ranging from one to seven. Seven genes had a single intron, two genes (NunGRAS-10 and NunGRAS-35) had three introns and NunGRAS-38 had seven introns (Fig. 2). In addition, it was notable that most GRAS members of the same branch generally showed similar exon-intron structures. Most of the paralogous pairs also shared conserved exon-intron structures. However, there were exceptions in the exon-intron structures for different GRAS members of the same branch, such as . These results suggested that one of the genes in these gene pairs might have experienced intron loss or gain events during its evolution process. The intron phase patterns of the GRAS genes in different branches and even in the same branch were totally different, which implied that the relationship of the GRAS genes among different branches was not as close as that observed in other gene families such as the Squamosa Promoter Binding Protein (SBP) gene family in Brassica rapa and the TIFY gene family in Arabidopsis (Bai et al., 2011;Tan et al., 2015). Interestingly, NnuGRAS-38 contained seven introns, while its homolog NnuGRAS-18 contained only one intron, suggesting that even homologs could have experienced gene structure diversification although the possibility of wrong gene annotation could not be ruled out. In addition, NnuGRAS-07, NnuGRAS-10, NnuGRAS-18 and NnuGRAS-35 contained long introns with a length over 4kb, which needs further validation.

Identification of conserved motifs in the sacred lotus GRAS proteins
To analyze the conserved features of the sacred lotus GRAS family proteins, we used the MEME program to identify conserved motifs in all GRAS proteins (Fig. 3A). Surprisingly we found that 23 sacred lotus GRAS proteins contained all 10 motifs, the default number of the MEME analysis, and that only six sacred lotus GRAS proteins contained less than eight motifs. The proteins of the SHR branch was the most representative one of the GRAS gene family with all of them containing 10 motifs. In addition, we found that motifs were more likely to be located in the C-terminal than in the N-terminal, supporting the notion that the C-terminal region of the GRAS proteins was more conserved than the N-terminal region (Pysh et al., 1999). The number of amino acids of these motifs ranged from 21 to 29 (Fig. 3B).

Identification of orthologous, co-orthologous and paralogous GRAS genes in sacred lotus, Arabidopsis and rice
Comparative analysis identified orthologous, co-orthologous, and paralogous GRAS genes among sacred lotus, Arabidopsis and rice using OrthoMCL (Fig. 4). In total, we identified 42 orthologous and nine co-orthologous gene pairs between sacred lotus and Arabidopsis, and 35 orthologous and 22 co-orthologous gene pairs between sacred lotus and rice. Between Arabidopsis and rice, 29 orthologous and 20 co-orthologous gene pairs were found. In addition, a total of 26 (68.4%) sacred lotus GRAS proteins were found to have corresponding paralogous proteins. This ratio was higher than that in Arabidopsis (19, 55.9%) and rice (31, 51.7%) (Table S4).
It was generally considered that orthologous genes have similar gene structure and biological function (Chen et al., 2007). Identification of orthologous genes was crucial for phylogenetic analysis since it could play a role in elucidation of gene and plant evolution. The sacred lotus genome underwent a lineage-specific whole genome duplication (WGD) event about 65 million years ago, which separated it from other eudicots prior to the Gamma genome-triplication event. The lack of the triplication event made the sacred lotus genome an excellent research material bridging eudicots and monocots  (Ming et al., 2013). We found that the number of orthologous gene pairs between sacred lotus and Arabidopsis were more than that between sacred lotus and rice. In comparison, the number of orthologous GRAS gene pairs between Chinese cabbage and Arabidopsis was 52 (Song et al., 2014), a number more than that between sacred lotus and Arabidopsis. This result suggested that Arabidopsis was genetically close to Chinese cabbage than sacred lotus.

Expression pattern of the sacred lotus GRAS genes in different tissues
We used publically available RNA-seq data of four tissues (leaf, petiole, rhizome and root) to investigate the expression profiles of the sacred lotus GRAS genes (Wang et al., 2015). Based on the maximum distance similarity metric clustering method on two dimensions, Figure 4 Comparative analysis of the GRAS genes in Arabidopsis, rice and sacred lotus. Yellow, blue and red lines indicate orthologous, co-orthologous and paralogous gene pair relationships, respectively. we found that the GRAS genes shared a similar expression pattern in leaf and petiole as well as in rhizome and root (Table S2 and Fig. 5). The GRAS genes of some branches exhibited a tissue-specific expression pattern. For example, the GRAS genes of the PAT1 branch seemed to be more abundantly expressed in rhizome and root than in leaf and petiole, especially NnuGRAS-05, NnuGRAS-10, and NnuGRAS-25. This suggests that these genes might be associated with development of the edible sacred lotus rhizome. Further functional characterization of these genes would contribute to increase the production of sacred lotus rhizome through molecular-assisted breeding. Six of the seven GRAS genes of the SHR branch (except NnuGRAS-21) showed a higher expression level in leaf and petiole than in rhizome and root. The members in the SCL 3/28 branch had a similar expression level in the four tissues investigated. The expression levels of NnuGRAS-08, NnuGRAS-16, NnuGRAS-33, NnuGRAS-36 and NnuGRAS-37 were very low in all four tissues (Fig. 5).