Annotation of glycoside hydrolases in Sorghum bicolor using proteins interaction approach

Background: The glycoside hydrolase (GH) proteins are found in a wide range of organisms viz., Archea, animals and plant. These family members are involved in diverse processes, including starch metabolism, transport, stress defense and cell wall remodeling. A large number of GH proteins have been identified in several plants viz., Oryza sativa, Arabidopsis thaliana and Populus trichocarpa etc. However, the majority of proteins in sorghum are described as putative uncharacterized till date. Methods: To annotate these proteins in sorghum, we constructed protein interaction among 13 families of GHs. We developed neural network of machine learning based algorithm for protein interactions. The algorithm was considered total average of high (≥90%) GO terms semantic similarity and remove false proteins interaction (<90%). Results: As a result, total 1,318 high semantic similar homologous proteins were identified from sorghum, rice and Arabidopsis. These data were used to annotate 238 putative uncharacterized proteins from GH in sorghum. Consequently, the identified proteins belonged to the functional categories of carbohydrate transport & metabolism and hydrolase activity. These functional categories appear to be a distinct mechanism of abiotic stress adaptation in plants. Conclusions: A novel method was developed to annotate putative uncharacterized proteins by proteins interaction in the GO terms semantic similarity. The proposed method will help in further identifying new proteins that may help in the development of stress resistant cereal crops and bioenergy grasses.


Background
Glycoside hydrolase (GH) superfamily is a large group of carbohydrate active enzymes that hydrolyse polysaccharides. These enzymes hydrolyse the glycosidic bond between two or more carbohydrates [1]. The Carbohydrate-Active Enzymes (CAZy) database has classified GH superfamily into 131 different families (http://www.cazy.org/Glycoside-Hydrolases.html) [2]. The group of GHs different families has similar functional property [3]. These families are found in a wide range of organisms viz., Archea, animals and in plant genomes. In plants, these family members are involved in diverse processes including starch metabolism, transport, stress defense and cell-wall remodeling. They have been implicated in plants as the defense against biotic and abiotic stresses [4,5]. In addition, GHs with its catalytic activities and polysaccharide binding affinities may have important applications in the development for biofuels [6]. Glycoside hydrolase genes have been previously catalogued for several plants viz., Oryza sativa, Arabidopsis thaliana, Zea mays and Populus trichocarpa. In contrast, there is no functional annotation of GH superfamily in sorghum (Release 34b; http:// www.gramene.org). To improve understanding of GH superfamily in Sorghum bicolor, we compared its proteins at the level of gene ontology [7].
The Sorghum bicolor plant has been recently sequenced [8]. The plant is an important cereal crop for food, fodder and as raw materials for the production of starch, alcohol and biofuels [9]. The extensive agricultural and various other uses of sorghum have required to improve resistance of this crop towards biotic and abiotic stresses [10]. The sequencing of genome has become one of the most important approaches in modern biological research. This has been greatly facilitated by the availability of rapid sequencing methods. Consequently, the genome sequencing projects have deposited a large amount of sequence data in post genomic era. Now the major challenges are to provide functional annotations for large number of proteins that have arisen through genome sequencing projects [11]. However, successive experimental investigations to assign protein functions are costly and time consuming. Alternatively, the computational approaches have made protein function prediction very attractive to solve this complex task. There are several important methods for predicting protein function from sequence and structural data. The sequence similarity based approaches are widely used for function prediction but they are often insufficient if the similarity is not statistically sound [12]. Therefore, identifying protein-protein interactions (PPIs) and networks are important for understanding the mechanisms of biological processes at a molecular level [13]. The machine learning techniques are applied for knowledge extraction from data of several biological domains viz., genomics, proteomic, microarray and systems biology [14]. Thus computational techniques are extremely helpful to predict proteins and biological networks. Protein interaction network also take advantage of machine learning [15].
The system biology approaches facilitate to identify regulatory hubs in complex networks [16] where the interaction of proteins determines the outcome of most cellular processes. The protein interaction network during stress tolerance has been recently elucidated at transcript level in Arabidopsis thaliana and Oryza sativa [17]. The gene network in sorghum responsive water limiting environments has also been reported [18].
The co-evolution of sequences is well documented and has been used to predict proteins interaction [19]. The interacting proteins are also often predicted on the basis of co-expressed and co-regulated genes [20,21]. Previously, it has been reported that such genes belong to the same protein complex. Hence, the semantic similarity is very helpful to evaluate the physiological significance for protein-protein interactions [22]. The semantic similarity through annotation system like gene ontology (GO) has been studied [23]. The GO terms are organized as directed acyclic graph (DAG) in three aspects of ontologies, viz., molecular function (MF), biological process (BP) and cellular component (CC) [7]. In our earlier work, we have predicted the function of proteins using protein-protein interaction approach [24]. Here, we report a neural network of machine learning based method to annotate proteins of GH superfamily. We described an algorithm, to compute total average of semantic similarity between GO terms. Our algorithm was considered total average of high (≥90%) semantic similarity and remove false protein interactions (<90%). This approach provides a common functional annotation of glycoside hydrolase superfamily, which appears to be an important mechanism of abiotic stress adaptation in sorghum.

Proteins retrieval and analysis
To identify 238 putative uncharacterized proteins of GH superfamily in sorghum, we constructed proteins interaction network between GH families. These proteins were manually obtained at unique identification number (IPR017853) from Gramene genome database (Gramene release 34b, January 2012) (Supplement Table S1). We classified these proteins in distinct clusters by using their homology search and family. To identify potential homologous proteins for these putative uncharacterized proteins, 1,318 high semantic similar homologous proteins from Sorghum bicolor (238), Oryza sativa Indica (214), Oryza sativa japonica (375), Arabidopsis thaliana (296) and Arabidopsis lyrata (195) were obtained using PSI-BLAST (http://www.ebi.ac.uk/Tools/sss/psiblast/) [25]. Homology search was carried out using PSI-BLAST with the following parameters: protein database-Uniprot; E-value-1.0e -3; matrix-blosum62; gap opening-11; gap extend-1; scores and alignments-1000; dropoff-15 default, final dropoff-25 default and alignment view-pairwise with active filter. It has been shown earlier that 43% of the BLAST hits were homologous at E-values ranging from 1 to 10 and over 99% they are homologous at below the threshold E-value 1e-03 [26].

Semantic similarity measures
We developed an algorithm to calculate GO terms semantic similarity of proteins in distinct clusters. The semantic similarities were identified using ProteInOn tool (http://xldb. di.fc.ul.pt/tools/proteinon/) with Jiang-Conrath method [31] without ignore inferred electronic annotation (IEA) [32]. The semantic similarity was performed among homologous proteins in distinct clusters in sorghum. Interacting proteins are likely to share molecular functions, biological processes and locations [23]. Thus, we used similar functional property between each pair of proteins to predict interaction.

Activation method
The proposed method is an information processing model that is inspired by neural network. It is configured for a specific application, such as data classification through a learning process. This method structurally has three layers, input, process and output (Figure 1). It takes many inputs of GO terms semantic similarity and it classifies in output based on threshold value. If the semantic similarity is greater than or equal to the threshold value (90%), the output value (a i ) will be 1. Else, if the semantic similarity is less than or equal to the threshold value (90%), the output value (a i ) will be 0. The output value (a i ) 1 was considered as true positive and (a i ) 0 was false positive. Where, SS is input semantic similarity and θ i is threshold value. The algorithm is given as: We summed true positive semantic similarity (≥90%) in three ontologies GO terms molecular function (MF), biological function (BP) and cellular component (CC) distinctly from our preferred plants, sorghum, Arabidopsis and rice. The sum of semantic similarity value was divided by total obtained GO terms of true positive proteins (≥90%) respectively. The proteins were eliminated, if they have any false positive (<90%) GO

Protein interaction algorithm
This algorithm is based on the concept that high (≥90%) semantic similar proteins in a species are more likely to interact. The algorithm was proposed as neural network of machine learning concept. Algorithm is structured in terms of interconnected as artificial neurons. It is evaluated for homologous proteins of GHs families in 13 clusters in sorghum. Here, if the calculated semantic similarity value (T vk ) is true positive (≥90%) then the homologous proteins in a cluster and distinct clusters are likely to interact. PPI is protein-protein interaction in following algorithm that is given as: (4) where 'a iA ' and 'a iB ' are high semantic similar proteins in different clusters and Cl 1 to Cl n are clusters of these proteins. Each of the clusters consists of high semantic similar proteins of a family. The symbol ('∩') shows the protein interaction.

Hit rate measure
Total proteins of false positive were subtracted from total proteins of true positive of a cluster. The hit rate covers the proteins of true positive score. The hit rate was identified by the following calculation.

Proteins interaction assumption
The protein-protein interaction (PPI) algorithm was examined based on concept that total average of high (>=90%) semantic similar proteins of a family have probability to interaction.
Here, in developed algorithm, chosen threshold value (≥90%) improves accuracy for proteins interaction than the earlier methods. Previously, several protein interactions have been detected in vitro analyses methods. In result, more than 70 protein interaction domains have been described viz., BRCT, LRR, MH1 domain proteins and BH1-BH4 proteins [33]. These all proteins are semantically similar with their corresponding proteins. Previously, identified kinase-proteins interaction in rice are also shown high (>70%) semantic similarity [34]. Endoribonuclease proteins in RISC (RNA-induced silencing complex) viz., DICER, DORSA and ARGNT are high interacted complex. DICER interacts with several partner proteins like TRBP in humans, R2D2 in Drosophila [35,36]. These complex containing proteins have high (>80%) semantic similarity with their corresponding proteins in all three ontologies MF, BP and CC. RNase III containing domains viz., DICER1, MRPL44 and RNAsen are also high GO terms semantically similar. Hence, we attempted to use the PPI algorithm (Equation 4) to assign interaction between high semantic similar proteins in distinct families of GHs in sorghum. In this study, high semantic similar proteins allowed us to illustrate interaction among a group of proteins in sorghum, which was high beneficial to predict common functions of different families of GH.

Homologous proteins
Homologous finding is a powerful approach to identify proteins function [37]. Hence, unknown proteins from one plant species can be used to identify the conserved proteins in other species. In this study, all 238 putative uncharacterized proteins were classified in 13 clusters based on their homology and family ( Table 1). These uncharacterized proteins of sorghum from each cluster were assigned to find out their homologous proteins from our preferred plant species viz., sorghum, rice and Arabidopsis. A total of 1,318 homologous proteins were identified. Amongst them, 801 proteins were true positive (≥90%) and 517 were false positive (<90%). In the proposed method, majority clusters covered maximum true positive scores that have more accuracy to proteins interaction in sorghum. The proteins of false positive score were eliminated and no error rate was generated by the proposed method. However, the majority homologous proteins in cluster 8 were false positive, so the hit rate was only 6%. The proteins in cluster 10 showed 20% hit rate. The proteins in clusters 7 and 11 showed 78% and 66% hit rates respectively. Other clusters showed more than 80% hit rates in sorghum by predicted method ( Table 1). The hit rate covered proteins of true positive score of a cluster. The maximum hit rate indicated high coverage of proteins. High coverage proteins of true positive score have more probability to interaction.
The homologous proteins of true positive hit rates in 13 distinct clusters showed high (≥90%) semantic similarity ( Table 1). However, in the terms of evolution the homologous originate with similar function [38]. Hence, our analyses strongly identified proteins of GHs based on high semantic similarity. The COGnitor program revealed that the proteins in clusters 1, 2, 3, 5, 7, 9, 10 and 11 belong to functional category of [(G) carbohydrate transport and metabolism]. For clusters 4, 6, 8, 12 and 13, no COG records were found ( Table 1). Thus, the classified proteins in 13 distinct clusters are commonly involved in the functional categories of carbohydrate transport and metabolism.

Proteins interaction of GHs
We found that the majority of GHs of sorghum in 13 distinct clusters showed high (100%) semantic similarity as compared with BP, MF and CC. Hence, in the terms of MF and BP the proteins in clusters 5, 9 and 12 have 95-100% semantic similarity and in clusters 4 and 7 have 90-100%. In terms of CC the homologous proteins of sorghum in cluster 8 have <90% semantic similarity. Earlier, it has been validated that proteins interaction has defined basis as co-functional, co-expressed and co-located proteins [39]. For instance SUC2-type transporters have also been reported in physical interaction with other transporters viz., SUC3 and SUC4 in Arabidopsis [40]. The interaction between DREB2A with RING E3 ligases and mitogen activated protein kinases (MAPKs) with the EAR motif of zinc finger protein in Arabidopsis has also been reported [41,42]. These proteins have high (>80%) semantic similarity with their corresponding proteins in all three aspects of ontologies, MF,  BP and CC. In this study, our developed algorithm allowed to construct PPI network among GHs in sorghum.

Proteins cluster linkage
In this work, a large amount of data was analyzed in response to functional identification of uncharacterized proteins in sorghum. To classify these proteins in 13 families, a data set of GO terms was built. Through the interaction of proteins of each clusters, we also noted association between 13 distinct clusters using their high semantic similarity (100%) GO terms (Figure 3). The high semantic similarity GO terms in 13 distinct clusters are, biological process:  Table S1). These identified high semantic similar GO terms in classified 13 clusters allow to make association between clusters. Therefore, these associations between glycoside hydrolase families were suggested to annotate a common function.

Discussion
The functional identification of proteins is one of the most important tasks in post genomic era [43]. In this study, we identified proteins of glycoside hydrolase (GH) superfamily in sorghum using protein-protein interaction approach. Previously, comparative analysis of different families of GH has been performed in Brachypodium distachyon and Sorghum bicolor. Therefore, these proteins are involved in various processes, including starch metabolism, defense and cell-wall remodeling [3]. In the present comparative study, the GHs are expected to share similar functional property. The high semantic similar proteins were useful to assess the physiological importance of proteins interaction. With more similarity in function, the proteins may have greater probability to become interacted, because highly similar genes are consistently co-regulated across distantly related organisms [21]. In this study, the most duplicated GO terms in predicted interaction of GH families were carbohydrate metabolic process, hydrolase activity and cation binding. These functional categories of identified proteins are involved in response to abiotic stresses mainly in drought and salt. Soluble carbohydrates and starch accumulate under normal conditions and are the main resources for supply of energy during stress condition in plants [44]. For example, the effects of salinity stress on growth and carbohydrate metabolism in rice [45] showed that the carbohydrate compositions were differently altered by salinity stress. Carbohydrate and the accumulation of sugar are the limiting factors of growth under salinity in the salt sensitive cultivars. Previously, it has been reported that the expression of a large number of stress responsive genes of carbohydrate metabolism is down or up-regulated by the sugar status during abiotic stresses [46].
Carbohydrate contents viz., sucrose, glucose, fructose and starch accumulate under salt stress and play a major role in osmotic adjustment, carbon storage and radical scavenging [47]. In past, it has been shown that proline, sucrose, mannitol and fructans increased during drought stress [48]. Therefore, soluble sugar has proved as better marker for selecting improvement in drought tolerance. The accumulation of total soluble sugars and sucrose may cause reduction in growth limitation in the leaves of salt sensitive cultivars [45]. The GO term "beta-amylase activity" in cluster 6 is responsive to drought stress. A number of studies have demonstrated that the beta-amylase induction in response to abiotic stress [49,50]. The beta-amylase induction and maltose accumulation play the role as a compatiblesolute stabilizing factor in the chloroplast stroma in response to temperature stress [49]. Enhanced beta-amylase activity during drought stress in cucumber has also been reported [51]. The effects of salt stress on carbohydrate contents as well as on the activity of amylases, phosphorylase of cotton varieties have been investigated [52]. It was concluded that different ion regulation in combination with carbohydrate metabolism contributes to salt tolerance of cotton varieties. Another most duplicated GO term "hydrolase activity" are more involved in cell expansion, differentiation, response to environmental, pathogen and mechanical stress [5,53].
The identified GO term "cation binding" is involved in numerous functions including electron carrier, maintenance of charge balance and enzyme activation [54]. In the cellular component, majority of proteins belong to cytoplasmic part viz., cytosol, vacuole, chloroplast, plant-type cell wall, cell wall, apoplast and membrane. These terms are involved in important processes of stress response. Hence, the identified GO term "apoplast" in clusters 1, 7 and 10 are involved in numerous processes of plants, such as maintenance of tissue shape, development, nutrition, signalling, detoxification and defence. The changes in soluble apoplast composition induced under salt stress in Nicotiana tabacum plants have also been reported [55]. The cell wall apoplast has an ion and metabolite composition in mitochondria, chloroplasts and other cell compartments. The apoplast is the first compartment to encounter environmental signals in plants and it contributes to plant development [56]. The ability to control Na + influx into the cytoplasm is highly important in determining plant response to salinity [57].

Conclusions
In this study, we performed functional annotation of 238 putative uncharacterized proteins of glycoside hydrolase (GH) superfamily in sorghum using PPI approach. The identified proteins in different GH families were highly semantic similar to each other. The analyzed data has allowed us to construct the interaction network between distinct GH families, which has provided a common functional annotation. Further, we observed that identified proteins belonged to the functional categories of carbohydrate metabolism and transport, hydrolase activity, beta-amylase activity and cation binding. These identified functions appear to be a distinct mechanism of abiotic stress adaptation in sorghum plant. The findings from this study will help in further identifying new proteins that can help in the development of cereal crops and bioenergy grasses.