Alignment-free Prediction of Ribonucleases Using a Computational Chemistry Approach: Comparison with Hmm Model and Isolation from Schizosaccharomyces Pombe, Prediction, and Experimental Assay of a New Sequence

The study of type III RNases constitutes an important area in molecular biology. It is known that the pac1 + gene encodes a particular RNase III that shares low amino acid similarity with other genes despite having a double-stranded ribonuclease activity. Bioinformatics methods based on sequence alignment may fail when there is a low amino acidic identity percentage between query sequence and others with similar functions (remote homologues) or a similar sequence is not recorded in the database. Quantitative Structure-Activity Relationships (QSAR) applied to protein sequences may allow an alignment-independent prediction of protein function. These sequences QSAR like methods often use 1D sequence numerical parameters as the input to seek sequence-function relationships. However, previous 2D representation of sequences may uncover useful higher-order information. In the work described here we calculated for the first time the Spectral Moments of a Markov Matrix (MMM) associated with a 2D-HP-map of a protein sequence. We used MMMs values to characterize numerically 81 sequences of type III RNases and 133 proteins of a control group. We subsequently developed one MMM-QSAR and one classic Hidden Markov Model (HMM) based on the same data. The MMM-QSAR showed a discrimination power of RNAses from other proteins of 97.35% without using alignment, which is a result as good as for the known HMM techniques. We also report for the first time the isolation of a new Pac1 protein (DQ647826) from Schizosaccharomyces pombe, strain 428-4-1. The MMM-QSAR model predicts the new RNase III with the same accuracy as other classical alignment methods. Experimental assay of this protein confirms the predicted activity. The present results suggest that MMM-QSAR models may be used for protein function annotation avoiding sequence alignment with the same accuracy of classic HMM models.


Introduction
RNase III is a double-strand-specific ribonuclease (dsRNase) that usually makes staggered cuts in both strands of a double helical RNA, although in some cases it cleaves once in a single-stranded bulge in the helix 1,2 .The primary biological function of this system is the specific processing of rRNA and mRNA precursors [3][4][5] but it has also been implicated in other diverse phenomena such as mRNA turnover 6 , conjugative DNA transfer 7 , antisense RNA-mediated regulation and other 8,9 .For instance, Dicer and Drospha are type III RNases responsible for the generation of short interfering RNAs (siRNAs) from long double-stranded RNAs during RNA interference (RNAi).Also, the cellular processing of shRNAs shares common features with the biogenesis of naturally occurring miRNA such as cleavage by nuclear type III RNase Drosha, export from the nucleus, and processing by a cytoplasmic type III RNase Dicer, and incorporation into the RNA-induced silencing complex (RISC).1][12][13] It involves both RNase proteins in several important biological processes as for instance the function of Dicer on the vascular system regulating embryonic angiogenesis probably by processing miRNAs, which regulate the expression levels of some critical angiogenic regulators in the cell. 14ecently, RNAi has moved from a purely experimental technique to the stage of potential clinical applications such as possible use the treatment of spinocerebellar ataxia or amyotrophic lateral sclerosis 15 .Many other dsRNases have been characterized from a variety of prokaryotic and eukaryotic sources and RNase III from Escherichia coli is an archetype of this class of enzymes 6,16,17 .The RNase III family consists of a growing number of enzymes that includes at least 33 bacterial and 22 eukaryotic enzymes 18 .There have been numerous reports of dsRNase activities in eukaryotic cells, some of which exhibited properties consistent with a role in pre-rRNA processing [19][20][21] .
One of the best candidates for eukaryotic RNase III homologues is the Pac1 RNase from Schizosaccharomyces pombe [22][23][24] .The Pac1 product is derived from Schizosaccharomyces pombe pac1 + gene expression, which is also involved in the regulation of sexual development 25 , possibly through a mechanism that involves the processing of certain small nucleolar RNAs (snRNAs) 26 .Pac1 works in eukaryotes as dsRNase and shares a functional similarity to RNase III from E. coli.This fact was proved either by measuring the ability of Pac1 to degrade double-stranded RNA in vitro or by expressing pac1 + in E. coli, where it produced an activity that converted dsRNA into acid-soluble products 23 .Despite these observations the Pac1 gene product shows low homology with other RNAse III enzymes, particularly with those ones belonging to bacteria.The homology between the different RNase III enzymes varies in the range 20 to 84% depending on their evolutionary distance, suggesting a low level of primary structure conservation 27 .It has been reported that antibodies prepared against Pacl RNase have failed to react with RNase III 23 .The Pac1 gene product from Schizosaccharomyces pombe belongs to subclass II of the RNase III family, which is characterized by the presence of an N-terminal extension and includes fungal RNase III 27,28 .This contains 363 amino acids (aa) and only its C-terminal 230 residues share 25% amino acid identity with the Escherichia coli ribonuclease III 23 .
Methods based on sequence alignment have revealed a low amino acidic identity (20-40 %) for the pac1 + gene product with other typical RNases III, either isolated from bacteria or even from species that are genetically close 27,29 .However, experimental observations show Pac1 protein to be a dsRNAse enzyme.This relatively low degree of conservation probably reflects the species-specificity of RNase III, which prevents genetic complementation between members of the RNase III family 30 .
All of the facts discussed above hinder the prediction of the Pac1 gene product as an RNase III-like enzyme using computational methods based on sequence alignment.In fact, Bioinformatics methods based on sequence alignment may fail in general for cases of low sequence homology between the query and the template sequences deposited in the data base.The lack of function annotation (defined biological function) for the sequences deposited in databases and used as templates for function prediction constitutes another weakness of alignment approaches 31,32 .Recently, a group of researchers published in PROTEOMICS (2006) a review 33 on the growing importance of machine learning methods for predicting protein functional class independently of sequence similarity.In this review the authors make reference to various papers on the topic, including their own work [34][35][36][37][38][39][40][41][42][43][44][45] .These methods often use as the input 1D sequence numerical parameters specifically defined to seek sequence-function relationships.For instance, the so-called pseudo amino acid composition approach 46,47 based on 1D sequence coupling numbers has been widely used to predict subcellular localization, enzyme family class, structural class, as well as other attributes of proteins based on their sequence similarity 45, . Alteratively, some authors generalized molecular indices that are classically used for small molecules 75,76 to describe protein sequences, such as the generalization of Broto-Moreau indices by Caballero and Fernández et al. 77 .On the other hand, many authors have introduced 2D or higher dimension representations of sequences prior to the calculation of numerical parameters.This constitutes an important step in order to uncover useful higher-order information not encoded by 1D sequence parameters  .In addition, 2D graphs have been used for proteins and DNA sequences by other researchers. For exmple, Zupan and Randić used spectral-like and zigzag representations.These authors suggested an algorithm for encoding long strings of building blocks (like four DNA bases, twenty natural amino acids, or all 64 possible base triplets) using "zigzag" or "spectrum-like" representations 99 .Hydrophobic cluster analysis (HCA) constitutes another well known technique for the 2D representation of protein sequences 100 . Radić et al. ultimately approached protein representations by using 2D schemes based on nucleotide triplet codons or virtual genetic code 101 and we introduced Hydrophobicity-Polarity (HP) 2D Cartesian or lattice-like representations for proteins related to plant metabolism 93 In this work, we propose to use the Spectral Moments of a Markov Matrix (MMM) associated to a 2D-HP-graph to numerically characterize protein sequences and seek a QSAR model to predict type III RNAses without alignment.Firstly, we derived Hydrophobicity-Polarity (HP) 2D Cartesian or lattice-like representations (also called maps or graphs) for RNase III and control group protein sequences 93 .We then calculated the MMM values of order k (symbolized as SR π k ) to characterize the protein sequence.Spectral Moments for many kinds of graphs have been used before for quantitative structure-activity relationships (QSAR) studies on proteins [102][103][104][105][106][107][108][109][110][111][112] .We subsequently developed a classifier to connect protein sequence information (represented by the SR π k values) with the classification of sequences as RNAse III or not.In general, different kinds of classifiers have been used to derive protein sequence QSAR models 113,114 .We selected a Linear Discriminant Analysis (LDA), which is a simple but powerful technique [115][116][117][118][119][120][121] .The use of this MMM-QSAR model enabled us to predict a novel recombinant Pac1 (rPac1) protein as an RNase III-like enzyme from a new isolate of Schizosaccharomyces pombe.Prediction was also supported by profile Hidden Markov Model (HMM) analysis, submission to BLASTp and InterPro 122 servers and demonstrated by experimental evidence.

Computational methods.
A Markov Model (MM), also called MARCH-INSIDE, was used to codify information about 81 RNase III protein sequences belonging to prokaryote and eukaryote species downloaded from the GenBank database.Briefly, our methodology considers as states of the Markov Chain (MC) any atom, nucleotide or amino acid (aa) depending on the kind of molecule to be described 123,124 .Therefore, MM deals with the calculation of the probabilities ( k p ij ) with which the charge distribution of aa moves from any aa in the vicinity i at time t 0 to another aa j along the protein backbone in discrete time periods until a stationary state is achieved 125,126 .
Each RNase III sequence was labelled by its accession number; see Table I in the supplementary material (SM).The control group consists of 133 proteins, which were selected from 2184 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank (PDB); most of the data were published by other authors to distinguish enzymes and non-enzymes without alignment 127 (see Table II in the SM).Many researchers have demonstrated the possibility of predicting protein function from sequences 128 and we used 2D-HP graphs to encode information about RNase III amino acid sequences 93 .We then calculated for the first time the HP π k values for these graphs.As can be seen from the discussion above, we selected HP π k based on the utility of other non-stochastic Spectral Moments [103][104][105][106][107][108][109][110][111][112] as well as other MMMs and other stochastic parameters 102,[129][130][131] .
It is important to point out that this 2D graphical representation for proteins is similar to those previously reported for DNA 89,91,132 but the 20 different amino acids are regrouped into HP classes instead of using 4 types of bases.These four groups characterize the HP physicochemical nature of the amino acids as polar, non-polar, acidic or basic 133 .The 2D-HP graph for the deduced amino acid sequence of rPac1 protein, obtained from Schizosaccharomyces pombe strain 428-4-1 (uploaded by our group with accession number DQ647826), is shown in

2D-HP graph MMMs used as sequence numerical descriptors.
After the representation of the sequences we assigned to each graph a stochastic matrix 1 Π.Note that the number of nodes (n) in the graph is equal to the number of rows and columns in 1 Π but may be equal or even smaller than the number of amino acids or DNA bases in the sequence.The elements of 1 Π are the probabilities 1 p ij of reaching a node n i with charge Q i moving through a walk of length k = 1 from another node n j with charge Q j 134 : ( ) Where α ij equals 1 if the nodes n i and n j are adjacent in the graph and equal to 0 otherwise.Q j is equal to the sum of the electrostatic charges of all amino acids placed at this node.It then becomes straightforward to carry out the calculation of the spectral moments of 1 Π in order to numerically characterize the protein sequence: Where Tr is called the trace and indicates that we sum all the values in the main diagonal of the matrices k Π = ( 1 Π) k , which are the natural powers of 1 Π.The present class of MMMs encodes in a stochastic manner the distribution of the amino acid properties (charge) through all of the nodes placed at different distances in the 2D-HP lattice.Expansion of expression (2) for k = 0 gives the order zero MMM 0 ( HP π 0 ); for k = 1 the short-range MMM 1 ( HP π 1 ), for k = 2 the middle-range MMM 2 ( HP π 2 ), and for k = 3 the long-range MMMs.This extension is illustrated for the linear graph n 1 -n 2 -n 3 , which is characteristic of the sequence (Asp-Glu-Asp-Lys); please note that the central node contains both Glu and Asp: All calculations of HP π k values for protein sequences of both groups were carried out with our in-house software BIOMARKS version 1.0 ®, including sequence representation 135 .We proceeded to upload a row data table with eleven HP π k values for each sequence (k = 0, 1, 2,…10) and grouping variable RNaseIII-score = 1 (for RNAses) and -1 (for control group sequences) to statistical analysis software 136 .The overall methodology is represented schematically in order to improve the understanding of our approach (see Figure 2).

Statistical analysis. K-Means cluster analysis.
The negative group was selected from 2184 proteins with diverse functions (enzymes and non-enzymes) recorded in the PDB, as mentioned before.Our negative subset was designed according to K-Means cluster analysis (k-MCA) 137 .The method consists of carrying out a partition of the starting group made up by a non-RNase III series of proteins into several statistically representative clusters of sequences.Thus, one may select the members to conform to the negative subset from all of these clusters.This procedure ensures that the main protein classes (as determined by the clusters derived from k-MCA) will be represented in the model control group, thus allowing the representation of the entire 'experimental universe'.The spectral moment series was explored as clustering variables in order to carry out k-MCA.The procedure described above is represented graphically in Figure 3, where a cluster analysis was carried out to select a representative sample for the control group.

Linear Discriminant Analysis.
LDA forward stepwise analysis was carried out for variable selection to build up the model [115][116][117][118][119][120][121] .All of the variables included in the model were standardized in order to bring them onto the same scale.Subsequently, a standardized linear discriminant equation that allows comparison of their coefficients was obtained 138 .The square of Mahalanobis's distance (D 2 ) and Wilk's (λ) statistic (λ = 0 perfect discrimination, being 0<λ<1) were examined in order to assess the discriminatory power of the model.Pac1 protein was submitted to BLASTp to show graphically the similarity of the sequence compared to other RNases III.Each sequence presented in this study was also submitted to the InterPro server 122 in order to compare our methodology with other classical sources of predictive functional annotation.InterPro consists of a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.

Experimental Section
3.1 Strains and culture media.The Schizosaccharomyces pombe strain 428-4-1 was routinely grown in Yeast Extract (YEB) medium at 30 ºC during 12 h.Bacterial strain Escherichia coli DH5α was grown in Luria Broth (LB).Transformed bacteria were recovered in the same LB medium but supplemented with carbenicillin at 100 µg/mL.Media were also supplemented with bacteriological agar when required.

Total DNA extraction.
A colony from Schizosaccharomyces pombe strain 428-4-1 was inoculated in 5 mL of YEB medium and grown at 30 ºC during 12 hours until OD 600 = 0.5.From this culture, 250 µL was transferred to 50 mL of the same medium and grown overnight at the same temperature.When the OD 600 = 0.8, cells were collected by centrifugation and broken using small glass pearls.A cellular pellet was resuspended in 500 µL of sterile water at 50 °C and the extract was separated from cellular debris by centrifugation.Total DNA was purified using a total DNA extraction kit (Qiagen GmbH, Germany).Total DNA solution was measured at 260 nm in a GENESYS 10 spectrophotometer, reaching a concentration of 3.8 µg/µL.The solution was also run on agarose gel (0.8%) and high integrity was seen.
3.4 PCR amplifications.Amplification of the pac1 + gene from Schizosaccharomyces pombe was performed by standard PCR from its total DNA.The reaction mixture containing 10 ng of template, 1mM of each dNTP, 1.5 mM MgCl 2, 2 µM of each PAC5' and PAC3' primers, 1x buffer Taq Pol (Gibco BRL) and 2.5 U Taq Pol (Gibco) was completed to a total volume of 50 µL.The PCR was carried out using a thermocycler (Perkin-Elmer 2400) programmed as follows: 5 minutes initial template denaturation at 94 °C, cycle steps: 1 minutes template denaturation at 94 °C, 2 minutes primer annealing at 45 °C, 2 minutes primer extension at 72 °C for 30 cycles; plus a final extension step at 72 °C for 5 minutes 29,30,139 .PCR reaction showed a band coinciding with the size of the reported pac1 + ORF 139 .
3.5 Plasmid construction and sequencing.The PCR amplification product was purified using a GEL Band Purification kit (AmershamPharmaciaBiotech) and ligated to pMOS-Blue T-vector (AmershamPharmaciaBiotech).The ligation was transformed into electrocompetent E. coli DH5α by electroporation in 0.2 mm cuvettes using a Gene Pulser Machine (BioRad) (12.5 kV, 25 µF, 1000 ω).The transformation was plated onto LB medium supplemented with 40 µL of 20 µg/mL X-gal solution and 4 µL of isopropylthio-β-D-galactoside from a 200 µg/mL IPTG solution per plate and allowed to grow overnight at 37 ºC.White coloniespresumably carrying the recombinant pac1 gene inserted in pMOS-Blue T-vector, named pRSPac1 -were selected and plasmid DNA extracted for analysis of the cloned fragment by restriction enzymes.Sequencing of the cloned fragment was performed using an ABI 3700 sequencer (Applied Biosystems) 140 and this showed a product of 1.111 Kb.

Purification of recombinant Pac1
. A single colony of E. coli DH5α with pRSPac1 was grown overnight at 30 ºC in 5 mL of LB medium supplemented with carbencillin at 100 µg/mL.250 µL of culture was then inoculated to 250 mL of the same medium supplemented with carbenicillin (100µg/mL) and grown under the same culture conditions until OD 600 = 0.8; at this point 50 µL of 200 µg/mL IPTG solution was added to the culture.Three hours after induction, cells were harvested by centrifugation and washed with 15 mL of 50 mM tris-HCl (pH 8), 100 mM NaCl and 1 mM EDTA.Cells were collected by centrifugation and stored at -70 ºC overnight.Around 3 g of frozen cells were resuspended in 15 mL of lysis buffer (1% NP40, 0.5% sodium deoxycholate, 0.1 M NaCl, 30 mM Tris-HCl (pH 8), 1mM EDTA); 5 mM MgCl 2 and DNase1 (10 µg/mL) were added.The cell suspension was incubated on ice for 10 minutes.Inclusion bodies were collected by washing four times with lysis buffer and twice with 50 mM Tris-HCl 5 mM (pH 8), 1 mM DTT.Finally, the sample was dissolved in 5 mL of loading buffer and boiled on a water bath for 10 minutes.The total volume of extract was divided into five preparative PAGE electrophoresis samples containing 1 mL of protein extract, which were run in 12% gel.The component corresponding to 45.5 kDa recombinant Pac1 protein was visualized by staining with an aqueous solution of 0.05% Coomassie brilliant blue R250.In each case the recombinant protein was excised from polyacrylamide gel, recovered by electroelution, combined and concentrated using with a Centricon-10 (Amicon) to 0.5 mL and diluted to 1.5 mL with storage buffer to a final composition of 500 mM NaCl, 20 mM sodium phosphate (ph 7.4), 67 mM imidazole, 1 mM DTT, 1 mM EDTA and 30% glycerol.The recPac1 preparation was stored at -20 ºC 29,30,139 .
3.7 Synthesis and preparation of complementary RNA strands.The enzymatic assay of recombinant Pac1 was carried out according to the optimized conditions described by Rotondo and Frendewey 29 .In a previous experiment (data not shown) we amplified by PCR a fragment corresponding to the fourth intron of Schizosaccharomyces pombe β-tubuline from its total DNA and inserted the amplified fragment into pBluescript II KS (-) for further in vitro transcription purposes.The integrity of the amplified sequence and transcriptional fusion was tested by sequencing.We reproduced exactly the described assay to compare the activity of our recombinant enzyme with the results from other reports.This construction was used as a template for the PCR of fragments corresponding to transcriptional-fusion suitable for the synthesis of both complementary strands of dsRNA substrate for an in vitro transcription reaction.For this purpose the following primers were synthesized: a) 5'-gctcggaattaaccctcactaag↓ggaacGTAGGTTTTTTTGCTTTC-3' (T3 promoter in lower case, 5' end of the Schizosaccharomyces pombe β-tubuline fourth intron in upper case).b) 5'-ggtacctaatacgactcactatag↓ggagaCTACAGTCGTCAGTAC-3' (T7 promoter in lower case, complement of the 3' end of the Schizosaccharomyces pombe β-tubuline fourth intron in upper case).
The arrows indicate the transcription initiation site.The PCR products were purified and 50 ng of each was used to synthesize both complementary strands of the dsRNA Pac1 substrate.The transcription reactions were prepared in a final volume of 20 µL containing 40 mM Tris-HCl (pH 7.9), 6 mM MgCl 2 , 2 mM spermedine, 10 mM DTT, 0.5 mM of each ribonucleoside (AmershamPharmaciaBiotech), 50 µCi [α 32 P] UTP (800 Ci/mmol), 20 U RNAsin (Promega) and 20 U T3 or T7 RNA Polymerase (Amersham Pharmacia Biotech).In the case of the transcription reaction driven by the T3 promoter, the addition of 50 mM NaCl to the reaction mixture was required.In all cases the reactions were prepared on ice were then incubated at 37 ºC during 10 minutes.The resulting transcripts were treated with DNAse I (Promega), phenol extraction and precipitation with 2.5 V/V of absolute ethanol was carried out and the samples were stored overnight at -70 ºC.The complementary RNA strands were collected by centrifugation at 16 000 g during 10 minutes at 4 ºC.Finally, the pellets were washed with 70% ethanol, dried and re-suspended in diethyl pyrocarbonate treated with distilled water and stored at -70 ºC.
3.8 Preparation of dsRNA substrate for Pac1 enzymatic assay.Equimolar quantities of both complementary strands were mixed in diethyl pyrocarbonate and treated distilled water to give a final volume of 50 µL.The mixture was heated during 10 minutes at 100 ºC in a water bath.The whole bath was then firmly closed and placed into thermal box overnight to allow annealing of both complementary strands into the dsRNA substrate.The unpaired ends and RNA strands were removed by RNase A (Promega) treatment.The dsRNA substrate was purified (PAGE-TBE 15% gel) and stored in diethyl pyrocarbonate (DEPC) treated distilled water at -70 ºC.The substrate for the Pac1 assay consisted of 101 bp dsRNA, identical to the substrate used by Rotondo and Frendewey 29 .
3.9 Enzymatic assay of recombinant Pac1.The Pac1 assay was carried out using the following conditions: 30 mM Tris-HCl (pH7.6), 1 mM DTT, 5 mM of MgCl 2 , 10 nM of dsRNA substrate and different quantities (0, 1, 10, 100 nM) of purified recombinant Pac1 enzyme.Enzymatic reactions were completed on ice and started by the addition of 0.1V of 50 mM MgCl 2 , incubated at 30 ºC for 10 minutes and stopped by the addition of 500 µL of 5% ice-cooled TCA followed by 15 minutes on ice.The aliquots were centrifuged at 16 000 g during 5 minutes in a Spin-X filter unit (Costar).The soluble fractions (filtrate) were quantified by liquid scintillation counting.The counting data represent the amount of acid-precipitable polynucleotide phosphorus (dsRNA) substrate transformed into acid soluble cleavage products by Pac1 enzyme.The procedure was repeated three times with three repetitions per experiment 29,30,139 .

MMM-QSAR model to predict type III RNAses without alignment.
Many different parameters can be used to encode protein sequence information and further assign or predict the function or physical properties of proteins and their mutants 141,142 .The present approach involves the calculation of different sequence parameters based on MMs, which can be applied to different kinds of molecular graphs 131 including DNA, RNA and proteins 93,143 .MMs have been applied successfully to Genomics and Proteomics and represent an important tool for analyzing biological sequence data.In particular, MMs have been used for protein folding recognition 144 and the prediction of protein signal sequences 145,146 .MMs have also been applied to predict alpha turns 147 , beta turns 148 , as well as other tight turns and their types 149 .Particularly, MMs have been further used to predict the specificity of GalNActransferase 150 and cleavage sites in proteins by proteases [151][152][153][154] , greatly stimulating the development for drug design against AIDS and SARS [155][156][157][158][159][160][161][162][163] .In this work we calculated MMMs ( HP π k ) of the stochastic matrix that describe the distribution of the amino acids of the protein sequence in the 2D-HP graph.This calculation was carried out for two groups of protein sequences, one made up of RNase III-like enzymes and the other formed by heterogeneous proteins.This last group contains 133 members and these were selected as follows: Original data were submitted to k-Means cluster analysis as described previously.The k-MCA divided the data into four clusters containing 439, 684, 592 and 469 members, respectively.Selection was based on the distance from each member with respect to the cluster centre (Euclidean distance).We selected the closer cases to the centre in order to ensure the inclusion of representative members of each cluster in the control group.Depending on the cluster size, a proportional number of proteins were set; 27 cases were taken from the first cluster, 42 from the second, 36 from the third and 28 from the fourth to give a total of 133 members in the control group.We always bore in mind the principle of discriminant analysis in terms of balancing the size of the control group with respect to the RNase III group.
A simple MMM-QSAR was then developed to classify a novel sequence as RNase III or not.The best equation found for this purpose was: ( )

HP HP
The statistical parameters for the above equation were Wilk's statistic (λ = 0.18), Mahalanobis's distance (D 2 = 16.36) and error level (p-level < 0.001) 164 .This discriminant function misclassified only four cases out of 214 proteins used in both the training and validation series, reaching a high level of accuracy of 98.13%.More specifically, the model classified correctly 77/81 (95.06%) of RNase III-like enzymes and 100% of the control group.The respective classification matrices for training and cross-validation are depicted in Table 1.A validation procedure was subsequently performed in order to assess the model predictability.This validation was carried out with an external series of 20 RNase III-like proteins and a further 43 diverse proteins (see Table 1).The present model showed an average predictability of 100% for each group, which is remarkable in comparison to results obtained by other researchers on using the LDA method in QSAR studies [165][166][167][168] .These results are consistent with those obtained in our previous report, in which we used 2D coupling numbers as sequence descriptors for function annotation of plant metabolism enzymes 93 .In addition, we carried out a classification analysis with all of the proteins included.These results provide further evidence of the robustness of the results obtained.The Receiver Operating Characteristic (ROC) curve was also constructed for the training and validation series.Notably, the curve presented a pronounced curvature (convexity) with respect to the y = x line for both series (see Figure 4).This result confirms that the present model is a significant classifier, having areas of 0.99 (training) and 0.97 (validation) -i.e.markedly higher than 0.5, which is the value for a random classifier 169 .In this work we isolated, cloned and expressed a new Pac1 DNA sequence from Schizosaccharomyces pombe strain 428-4-1, its nucleotide and amino acid sequence was recorded on the GenBank database with accession number DQ647826.The theoretical prediction of its translated ORF as an RNase III-like enzyme was performed by the present alignment-independent approach instead of traditional alignment methods.The theoretical prediction of rPac1 as a double-stranded RNase was confirmed experimentally by in vitro assays.

Prediction.
Our Pac1 protein sequence was analyzed using the MMM-QSAR methodology with the aim of recognizing the rPac1 gene product as a eukaryotic RNase III homologue.The sequence was represented in a Cartesian 2D system and calculated including the whole data set.This particular case was included in the validation subset in order to make a prediction.The MMM-QSAR model even very simple (two variables) allowed the correctly classification of the rPac1 product as an RNase III-like enzyme with the maximum probability (p = 1).In order to make a graphical comparison between our methodology and alignment methods like BLASTp [170][171][172][173] , several representative RNase III protein sequences from prokaryotes and eukaryotes were selected together with rPac1 for representation in a 2D-mapping system (see Figure 5).The 2D-HP map protein representation revealed a significant separation for the groups consisting of dsRNases from prokaryotes (in dark grey) and eukaryotes (in light grey).The rPac1 protein (in black) is placed between the two groups, acting as a sort of link between the RNase III families.This representation possibly supports evolutionary relationships between double-stranded RNase protein sequences.Since the Cartesian 2D protein representation is mainly based on amino acid composition, we can highlight a major region from rPac1 matching eukaryote sequences (in light grey) and another small region that lies within the prokaryote region (in dark grey).There is also a non-matching region specific for rPac1 in Schizosaccharomyces pombe that does not exist in other eukaryotes.However, matching regions in the graph made a significant contribution to calculation of the spectral moments, thus allowing successful recognition of rPac1 as RNase III.
A BLASTp analysis was carried out on the translated rPac1 DNA sequence (see Figure 6).This method recognized successfully our query sequence as a Pac1 ribonuclease, reaching up to 98% of amino acid identity with others already recorded from Schizosaccharomyces pombe strains.Although this analysis showed lower scores (close to 80%) in comparison to other typical dsRNases, the approach still enabled protein query recognition as RNase III.With the aim of comparing different methods, it is possible to set an equivalence for the score value (80%) from BLASTp with our predicted probability, p = 1, for rPac1 to act as an RNase III-like enzyme.BLASTp also revealed low amino acid identity (< 40%) toward the C-terminal portion despite this representing the highest conserved region in the four existing RNase III subclasses.On the other hand, as mentioned previously, each sequence included in the study was submitted to InterProt.All cases (100%) from the RNAse III group matched significantly with RNase III domains (IPR000999), allowing the total recognition as dsRNases (see Tables ISM).In the case of the control group, six cases did not have InterProt identification and three of them did not have any hits reported (95, 50% of predictability) (see Table IISM).We also performed an alignment between the previously selected sequences in the figure 5 and our rPac1 product using the Clustal W program, version 1.81 (see Figure 7 and Figure 8).Alignment results coincide with those obtained in previous studies reported by other authors.The rPac1 showed low amino acid identity percentages in comparison to dsRNase sequences from other eukaryote organisms, even for those belonging to yeast-related species.Short and less frequent regions match along the protein sequences, especially toward the N-terminal region (see Figure 8).The comparison with prokaryote sequences showed a matching region toward the protein's C-terminal part, from the 170 up to 260 amino acid position.This region corresponds with the RNase III C-terminal domain (RIBOc), which is conserved in eukaryotic, bacterial and archeal RNase III and is associated with the catalytic activity.There is a significant N-terminal region in the Pac1 product that does not appear in the RNase III prokaryote family -a finding consistent with other reports (see Figure 7) 29 .The results found in this study confirm that our model does not replace classical method for protein function annotation like BLAST or InterProt service, but becomes an interesting alternative tool -especially due to its alignment-independence and simplicity.It is also important to highlight that our methodology can be considered as a good classifier, despite its simplicity, as it gives rise to a linear equation with two variables at most.Consequently, it is a useful method to perform a quick virtual screening of a representative protein database since the protein query submission to classical sequence classifiers is generally performed on a one by one basis.Thus, once the whole database has been screened and proteins having the desired function are recognized, it would be advisable to assess results obtained using our approach by other methodologies.The search for approaches that complement or improve on classical alignment tools like BLAST with information from gene ontology, RNA secondary structure prediction, partial ordering or other sources constitutes a goal of major importance [174][175][176][177][178] .
In order to compare the MMM-QSAR approach reported here with other methodologies based on MM, training and negative (non-RNAses sequences) sets were scored with a classic HMMs.Classification driven by an HMM built on the original training set resulted in an accuracy of 98.75% for the positives sequences (training set) and 96.24% for the negative sequences (see Table 1).Our query sequence DQ647826 was also successfully predicted with the maximum score by the HMM.

Experimental evidence for RNase III activity. Recombinant Pac1 protein from
Schizosaccharomyces pombe strain 428-4-1 was purified in order to measure its double-stranded RNase activity in vitro.The corresponding product size (45.5 kDa) coincided with the reported size for the native protein (see Figure 9).Double-stranded activity was measured in vitro by following the protocol described above.The unit definition for all RNase III types is the amount of enzyme able to solubilize 1 nmol of acidprecipitable radioactivity per hour. 17  The kinetic enzymatic reaction of rPac1 by monitoring dsRNA integrity (lanes 2-5) is illustrated in Figure 10.This particular results for the DQ647826 sequence were not carry out to validate the MMM-QSAR model but to shown how to use it for predicting RNase III-like protein function annotation.We recall that the validation of the MMM-QSAR model was assessed with the external prediction series as recommended for any QSAR (see previous sections). 179igure 10.Autoradiography of rPac1 enzymatic assay.To visualize the cleavage activity of dsRNA substrate generated by T3/T7 "in vitro" transcription, aliquots of enzymatic assay were taken at 2, 5 and 10 minutes, loaded in 12.5% PAGE/7M Urea followed by autoradiography.Lane 1 is pBR322 digested by Msp1, Lane 2 is the intact dsRNA substrate, Lane 3 to 5 are the results of rPac1 enzymatic activity at 2, 5 and 10 minutes of reaction at 30ºC.Lane 6 and 7 are the T3 and T7 ssRNA obtained by "in vitro" transcription too which are not degradated by RNAse activity of Pac1.

Conclusions
The work described here introduces a new approach to predict RNase type III function from protein sequences irrespective of sequence alignment.The methodology uses the MMMs associated with a 2D sequence representation as the input for an LDA classifier.This MMM-QSAR classifier successfully discriminates between RNase-like sequences and a control group.The Pac1 gene product was chosen as a representative example of a sequence with low amino acid identity compared to other enzymes with similar activity.The present methodology achieves high classification scores similar than bioinformatics tools based on sequence alignment (BLASTp) and comparable results to other predicting protein function annotation methods like InterProt and HMMs.The predictions made by the present model coincide with outcomes from experimental isolation, expression, and enzymatic activity measurement of a novel pac1 + gene sequence DQ647826 isolated from a new isolate Schizosaccharomyces pombe strain 428-4-1.The work opens up new possibilities for the use of the experience accumulated in small molecules QSAR in the field kind of alignment-independent sequence function annotation.

Figure 1 .
It is worth noting that 363 amino acids are rearranged in a 2D space compacting protein representation.Each amino acid in the sequence is placed in a Cartesian 2D space starting with the first monomer at the (0, 0) coordinates.The coordinates of the successive amino acids are calculated as follows: a) Increase by +1 the abscissa axis coordinate for an acid amino acid (rightwards-step) or: b) Decrease by -1 the abscissa axis coordinate for a basic amino acid (leftwards-step) or: c) Increase by +1 the ordinate axis coordinate for a polar amino acid (upwards-step) or: d) Decrease by -1 the ordinate axis coordinate for a non-polar amino acid (downwards-step).

Figure 1 .
Figure 1.2D Cartesian representation for amino acid sequence of rPac1 protein from Schizosaccharomyces pombe strain 428-4-1; GenBank Accession number DQ647826.Note that a node may contain more than one amino acid, which ensures graph compactness.

Figure 2 .
Figure 2. Schematic representation of the steps given in this work.

Figure 3 .
Figure 3. k-MCA procedure for control group design.

4. 2
Isolation, prediction and assay of a novel Pac1 from Schizosaccharomyces pombe strain 428-

Figure 6 .
Figure 6.BLASTp analysis for rPac1 protein sequence DQ647826.Note that the scale of scoring is progressive in darkness.Sequence names are not depicted.

Table 1 .
Classification results derived from the model for training and validation series