Mathematical Basis of Predicting Dominant Function in Protein Sequences by a Generic HMM–ANN Algorithm

The accurate annotation of an unknown protein sequence depends on extant data of template sequences. This could be empirical or sets of reference sequences, and provides an exhaustive pool of probable functions. Individual methods of predicting dominant function possess shortcomings such as varying degrees of inter-sequence redundancy, arbitrary domain inclusion thresholds, heterogeneous parameterization protocols, and ill-conditioned input channels. Here, I present a rigorous theoretical derivation of various steps of a generic algorithm that integrates and utilizes several statistical methods to predict the dominant function in unknown protein sequences. The accompanying mathematical proofs, interval definitions, analysis, and numerical computations presented are meant to offer insights not only into the specificity and accuracy of predictions, but also provide details of the operatic mechanisms involved in the integration and its ensuing rigor. The algorithm uses numerically modified raw hidden markov model scores of well defined sets of training sequences and clusters them on the basis of known function. The results are then fed into an artificial neural network, the predictions of which can be refined using the available data. This pipeline is trained recursively and can be used to discern the dominant principal function, and thereby, annotate an unknown protein sequence. Whilst, the approach is complex, the specificity of the final predictions can benefit laboratory workers design their experiments with greater confidence.


3 Background
The reliable annotation of genomic data is dependent on the assignment of function to protein sequences. Much of this information is gleaned from the clustering of these with existing functional groups. The presence of experimentally available data is invaluable to this effort, and in its absence the same has to be inferred from sequence data. This decomposition, into a superset of distinct functions of its constituent members (superfamily, family), is the most critical step of any clustering schema. A superfamily, by definition consists of sequences with poor, if any, sequence identity, with the simultaneous presence of one or more common fold(s). Consider the enzymes that belong to the iron Fe 2+ and 2-oxoglutarate (2OG) or -ketoglutarate (AKG) dependent dioxygenases (EC 1.14.11.x). The average intersequence identity of these enzymes (< 25%) , notwithstanding, the unifying features of these enzymes are the presence of a jelly-roll motif (Double strand -helix; DSBH), and the substrate hydroxylating triad of residues HX [DE]X n H (Clifton et al. 2006;Hausinger 2004;Koehntop et al. 2005). However, the chemical nature of the cognate substrate(s) of these enzymes and/or the reactions differs substantially, and can form smaller clusters (Kundu 2012(Kundu , 2015. Similarly, whilst the glycoside hydrolases (GHs 1-130; EC 3.2.1.x), comprise the larger set, plant GH9 endoglucanases can be further stratified into classes A, B, and C (Libertini et al. 2004;Lombard et al. 2014;Molhoj et al. 2002;Urbanowicz et al. 2007).
Whilst, the spatial arrangement of atoms of members of a superfamily dictates their biological role, differential function in a family of sequences can be attributed to the presence (native, acquired) or absence (native, excised) of specific sequence segments (classes A, B, and C of the plant GH9 endoglucanase family) and/or a limited number of amino acid residues (desaturases, demethylases, and chlorinating enzymes of the 2OG dependent dioxygenase superfamily). These regions are rarely silent, and can influence the behavior of the protein product(s) in vivo. Thus, while enzyme catalysis is dependent on conserved amino acids that form its active site geometry, generic proteins possess protein-protein, DNA/RNA-protein, transmembrane (TM), localization signals, and protein anchor-membrane domains that can influence its function. Despite the significant reduction in the dimensions of the superset to these smaller clusters, the unambiguous assertion of dominant function, remains challenging. For example, prevailing literature suggests that class B GH9 endoglucanases are the dominant forms of this family, far exceeding class C enzymes; a finding that is based on similarity to a few reference sequences (Buchanan et al. 2012;Montanier et al. 2010;Xie et al. 2013). Quantitative analyses of the differences between catalytically relevant segments of these enzymes, however, suggests that putative class C enzymes may approximate those of class B members (Kundu and Sharma 2016).
Sequence based classifiers of protein function can either be direct and deploy hidden markov models (HMMs), support vector machines (SVMs), and artificial neural networks (ANNs). Indirect indices of function range from domain comparison against existing databases such as the conserved domain database (CDD) of the national center for biotechnology information (NCBI), and the prediction of Mathematical Basis of Predicting Dominant Function in Protein… secondary structural elements (Cao et al. 2016;Frishman and Argos 1995;Kabsch and Sander 1983;Marchler-Bauer et al. 2015;Martin et al. 2005). SVMs, although exhaustive, mandates the presence of training sets with highly similar sequences (Cao et al. 2016;Frishman and Argos 1995;Martin et al. 2005). Profile HMMs (pHMMs), are global representations of a multiple sequence alignment (MSA), and encompass modular information using a system of threshold values. A major finding in work done previously, however, highlighted the insensitivity of the inclusion thresholds, despite, log-orders of difference in the E-values used (Kundu and Sharma 2016). ANNs, are weighted approximations of multiple inputs to a function, and introduce bias in their computations as a means of achieving convergence. The reduction, to a single output channel, implies that this value is intrinsically ill-conditioned with the final prediction depending on the quality of the input. The arguments vide supra, justify the use of multiple statistical methods to assign dominant function to a protein of uncertain function. A specific instance (prediction of enzyme catalysis) of this pipeline has been tested on available sequence data in sequenced green plants (Kundu and Sharma 2016).
The work presented here is a detailed exposition of the mathematics that underlies the observed specificity and accuracy of a generic HMM-ANN algorithm in predicting dominant probable function in an unknown protein sequence. Detailed proofs for all the steps and the derivation of the unique intervals both, theoretical and observed that encompass the ANN predictions are presented and are meant to offer mechanistic insights into the process of integrating several statistical methods as well as the rigor that may ensue. In addition, the definition, analysis, and the numerical computation of bounds of the participating sets and intervals are discussed in context of selecting suitable datasets and dictating the architecture of the ANN deployed. Additionally, interesting mathematical results based on the Lebesgue outer measure are discussed along with its biological relevance.

Algorithm and Results
Step 0 Data collation, pre-processing, and computational tools. Protein sequences with detailed and specific biochemical data (kinetics, structure, mutagenesis) are preferred for training the HMMs and the ANN, while the test dataset can comprise sequences with expression data, unannotated coding segments of sequenced genomes (open reading frames, ORFs), or sequences with putative function. An alignment generating tool (Structural Alignment of Proteins, STRAP; Clustal suite), and HMMER (downloadable or server-based) may be used for model building, analysis, database construction, and similarity studies (Finn et al. 2015;Gille et al. 2014;Sievers and Higgins 2014). A scripting language (R, PERL, Python, AWK) may be utilized to analyze the data and perform miscellaneous tasks such as tabulation and formatting. The specialized R-packages needed to implement the unsupervised (clustering; cluster, fpc) and supervised (ANN; nnet, neuralnet) machine learning tools utilized by this algorithm can be easily downloaded.

Mathematical Basis of Predicting Dominant Function in Protein…
Step 2 Define and enumerate the list of full length sequences that best represents each of these functions. These sequences {m ∈ G|m ∈ ℕ} , then constitute the training dataset for each predicted function (1 ≤ min (n) ≤ n) and must necessarily possess the recommended sequence suitability index (SSI > 1.00) (Kundu 2017). These could also be complemented with extant empirical data.
(1) Lemma The computed value nml of a pair-of-pairs (POP) of raw HMM scores is numerically equivalent to its z-score, i.e., nml ≃ z nml Proof Define nml , such that ∈ ℝ + ,  (0, 1) In the absence of an explicit assumption of a normal population, the mean ( ) and standard deviation ( ) are not independent, i.e., ∝ , It follows that ∃ nml l=|D| l=1 ; nml ∈ ℝ + ,  (0, 1) Step 4 The nm -values computed in Step 3 are then clustered, such that every cluster mean represents the centroid of a specific function � n ;1 ≤ min (n) ≤ n;n ∈ ℕ . These are then compared Since, the cluster means are derived from the sequence data their difference is expected to be trivial Rearranging the terms and differentiating w.r.t

Mathematical Basis of Predicting Dominant Function in Protein…
Consider the arbitrary terms nm , n(m+i) ∀i ≠ m If n(m+i) + > nm , ∈ ℝ + Similarly, if n(m+i) − > nm , ∈ ℝ + From Eqs. (9) and (10), Substituting this value in (Eq. 7) Step 5 Utilize the results in Step 4 in association with pre-computed values of the set of pairs-of-pairs for each sequence of each probable function ( nml ; 1 < min (n) ≤ n; 1 ≤ m ≤ |G|; 1 ≤ l ≤ |D|; n, m, l ∈ ℕ) and define the input ′ and output ′′ channels to the artificial neural network (ANN) (Kundu and Sharma 2016): ζ, Computed value of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; λ, Weighted ζ-score computed by the ANN; D, Composite set of pairsof-pairs of raw HMM scores of the mth sequence of the nth function.
Step 6 Define the intervals  n unique to each probable set of functions that an unknown protein sequence may be assigned to. These could be estimated directly or determined empirically prediction → ( �� nm ± ∧ nml ; , ∈ ℝ + ; 1 < min(n) ≤ n; 1 ≤ m ≤ |G|;1 ≤ l ≤ |D|;n, m, l ∈ ℕ) (Def. 3) (Kundu and Sharma 2016). Step 7 Define the bounds (a, b) of the search space by considering the countable union of the sequence of open and pairwise disjoint intervals (observed, expected) contained within the encompassing major interval �  [a,b] size l  [a,b] is then the outer Lebesgue measure m *  [a,b] of the encompassing interval.
b, max � n + n ; a, min � n − n ; σ n , Standard deviation of nth cluster; ′ n , Centroid of nth cluster.

Contribution of the Probability of Mapping the ANN-Prediction to a Distinct Partition
The effective prediction by the ANN of dominant function for an unknown protein sequence �� seq ∈ ℝ + is dependent on it being unambiguously mapped to a single numerical interval whose centroids approximate the cluster means for that particular function. Consider the closed and bounded interval of length l  [a,b] = |b − a| (Step 7; Eq. 17) and the following sequences of open and pairwise disjoint subintervals (Step 6): Consider the sequence ( ⊆  ) of uniquely observed open and pairwise disjoint subintervals:

Mathematical Basis of Predicting Dominant Function in Protein…
Proof Consider the aforementioned sets (, ) . Since, every probable function is modeled as an open and bounded interval with a centroid, and ℚ is dense in ℝ , we can always find an infinite number of rational numbers between any two real numbers, i.e., Rewriting these inequalities and continuing,

Relevance of Functional Constraints to Unambiguous Assignment of Dominant Function
The outlined protocol is expected to improve upon previous stratification attempts, both, in terms of biological relevance, as well as in the accuracy of predictions. The latter has been assessed in earlier work using the indices of precision (specificity) and recall (sensitivity) (Kundu and Sharma 2016). Whilst, the utility of collating biochemical data relevant to sequence clustering is unequivocal; the multitude of methods utilized imposes rigor in the schema. In particular, the use of the SSI ( Step 2) in tandem with empirical data can refine the selection of training sequences such that -value for each relevant sequence nm is within one standard deviation of the centroid for a particular cluster (| nm − � n | < n ) and may even converge (| nm − � n | → 0) (Steps 3 and 4) (Kundu 2017). The Chi squared data (Step 4), too, can be utilized to modify this selection such that an outlier sequence can be edited at this stage as well. The ratio of the input and output channels is critical to accomplishing convergence in an ANN (Step 5) with multiple outputs, as is its determination of the number of hidden layers. In contrast, despite a single output's risk at being ill-conditioned, the unbiased assignment of dominant function mandates its persistent use. Clearly, well partitioned (open, bounded, pairwise disjoint) intervals that encompass the inputs nml to the ANN are then a pre-requisite for efficacious prediction (Steps 6 and 7). The number of theoretical partitions nml ∈ ; nml → ∞ (Steps 6 and 7), notwithstanding, the analysis suggests that the cardinality of the superset (|A|) of probable functions that an unknown sequence may be partitioned into is important and must be considered (Step 1).
Consider the following functions f ∶ A → {|A|} K k=3 ;g ∶ D → {|D|} K k=3 (Step 1) (Table 1, Fig. 1c) q p , 1∕x, 1∕y ∈ ℚ (Set of rational numbers) x, y ∈ ℤ + (Set of positive integers) A, Raw HMM scores of all probable functions for a sequence; D, Composite set of pairs-of-pairs of raw HMM scores of the mth sequence of the nth function; τ, Probability of assigning a unique dominant function to a protein sequence.
Prediction of dominant function by this integrated algorithm is also likely to be constrained at the ANN stage, wherein, a larger number of hidden neurons may not result in any additional information. Extant literature from clinical medicine, agriculture, and academia, that have utilized ANN-based predictors suggests that the upper limit for neurons/nodes in a back-propagation (BP) ANN with 1∶1∶1 architecture is n ≅ 18 (Akbari Hasanjani and Sohrabi 2017;Kundu and Sharma 2016;Hawari and Alnahhal 2016;Teshnizi and Ayatollahi 2015;Shi et al. 2013;Yamamura et al. 2008;Zhou and Li 2007). This, in turn would imply a limit on the cardinality of the superset of all probable functions that a protein sequence might be expected to possess, i.e., 3 ≤ |A| ≤ 6; 0.166 ≤ ≤ 0.33 (Table 1, Fig. 1d).

Concluding Remarks
The HMM-ANN based algorithm accurately predicts dominant biological function of an unknown protein sequence. The detailed mathematical treatment of the various steps of this algorithm not only offers insights into the origins of this specificity, but also highlights the mechanism of integrating multiple methods into a generic functional algorithm. Additionally, it may assist investigators in preparing a computationally feasible superset, of putative function for their sequence(s) of interest. The algorithm itself, can be adapted with little effort, and uses publically available software and tools. The coding, when needed is trivial and can be accomplished with ease. The computations are self explanatory, lucid, and can be readily comprehended by biologists. The existence of upper and lower bounds may impose constraints on the selection of features/probable functions that could characterize a protein sequence. However, careful curation, inclusion of empirical data, and strict thresholds could go a long way in broadening the utility of this generic HMM-ANN algorithm.