Protein alignment based on higher order conditional random fields for template-based modeling

The query-template alignment of proteins is one of the most critical steps of template-based modeling methods used to predict the 3D structure of a query protein. This alignment can be interpreted as a temporal classification or structured prediction task and first order Conditional Random Fields have been proposed for protein alignment and proven to be rather successful. Some other popular structured prediction problems, such as speech or image classification, have gained from the use of higher order Conditional Random Fields due to the well known higher order correlations that exist between their labels and features. In this paper, we propose and describe the use of higher order Conditional Random Fields for query-template protein alignment. The experiments carried out on different public datasets validate our proposal, especially on distantly-related protein pairs which are the most difficult to align.


Introduction
Proteins carry out most of the work in living cells and their functions (structure, enzyme, messenger, . . .) are largely determined by their three dimensional (3D) structure which in turn is determined by the amino acid sequence [1]. Decoding the rules of these sequence-structurefunction relationships is one of the most important open problems in science, not only in order to accelerate the understanding of the molecular functions in life, but also for proteinbased biotechnologies and drug discovery [2]. Due to the high cost of the experimental methods (X-ray crystallography, nuclear magnetic resonance, . . .), the rate at which new protein sequences become available is much faster than the rate at which their structure and function are known [3]. Machine Learning techniques are bringing about new methods that fast and accurately predict the function [4,5] and structure [6,7] of proteins. In this paper we will focus on structure prediction methods which can be currently classified into one of the following two approaches [8]: 1) Free Modeling (FM) and 2) Template-Based Modeling (TBM).
Despite the great progress currently made on FM methods (mainly due to the incorporation of co-evolutionary information [2,7,9,10]), FM remains computationally expensive (particularly for long-length proteins) and most of the servers for protein structure prediction a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 respectively. A = a 1 ! a 2 . . . ! a L A denotes an alignment of length L A where each label or state a i can be M (match), I q (query insertion) or I t (template insertion). One alignment can also be represented as a sequence of alignment positions (x 1 , y 1 ) ! (x 2 , y 2 ). . . ! (x L A , y L A ) where every position indicates the query and template residue indexes on the alignment matrix, respectively. As

Probability of one alignment
The order of a CRF is equal to the maximum number of previous labels considered. Taking into account a 1st-order CRF, let C u ! v (x i , y i ) be the scoring function of a small alignment region with a label transition u ! v at alignment position i. This function depends on the query-template feature vectors qt but for the sake of simplicity we omit them. The scoring function can be expressed as the exponential of the sum of a node log-factor φ and an edge log-factor f: The log-factors relate different label transitions with the features and they can be any kind of functions. Here, they are Neural Networks as in [16] (for further details see Sec. 4.3). In total there are 12 (9 + 3) independent log-factors as we have 3 labels. Fig 1a depicts the log-factors (NNs) used to compute the scoring function of a label transition I q ! M at position (3,2). We can see that the log-factor NN I q ! M is fed by a final feature vector which is formed by the concatenation of 2 feature vectors: qt I q ð2; 1Þ and qt M (3, 2) (see Sec. 4.1). In a 1st-order CRF, the conditional probability of one alignment A given the query and template features (qt) and the parameter vector of all the log-factors (in our case NNs parameters θ) can be expressed as follows: As can be seen, it is obtained as the product of the scoring functions normalized by a factor Z (known as partition function) that allows to properly compare different alignments. In the next section we will explain how to efficiently compute Z.

Forward-backward algorithm for Z function calculation
In this section we will show how to determine the partition function Z by means of the forward-backward algorithm [8]. Since we have 9 label patterns (M ! M, M ! I q , . . ., I t ! I t ) we have to compute 3 forward and 3 backward terms (they correspond to the different suffixes and prefixes of these label patterns [24]) at every (x, y) position as follows: where u 0 (or v 0 ) of the summations goes over the 3 labels. The previous (x p , y p ) (or the next (x n , y n )) position is determined by the current (x, y) position and the v (or v 0 ) label of the scoring function C as follows: For example, given (x, y) and C M!I t , then (x p , y p ) = (x, y − 1) (and (x n , y n ) = (x, y + 1)). The partition function can then be computed as: where the initializations are: α v (0, 0) = 1 and β u (L q , L t ) = 1. The corresponding derivative of these formulas with respect to each one of the parameters θ of the log-factors, i. e. @a v ðx;yÞ @y , @b u ðx;yÞ @y and @Z @y , can be obtained as detailed in reference [8].

Training procedure
Given a set of N training alignments, the objective is to maximize the probability of observing them (maximum likelihood training [16]). In order to do so we are going to minimize the following loss function: where θ is the parameter vector of length L θ (Sec. 2.1), log p(A n |qt n , θ) is the log probability of one training alignment A n (Eq 2) and l 2 is a regularization term. We use a stochastic gradient descent with learning rate η and momentum γ in order to minimize the loss function and to find the best parameter vector. We do not split the training set in batches, so we compute the average over the full training set of the loss function derivative l (θ) (Eq 7) with respect to every parameter in order to reestimate the new parameters at every training epoch. This derivative can be obtained by means of the derivative of the scoring function and of the partition function (see Sec. 2.2).

Test
In the test stage we implement the Viterbi algorithm, which consists of a maximization and a backtracking step, in order to find the optimal global alignment (x 1 , y 1 ), (x 2 , y 2 ), . . .(x L A , y L A ) given the scoring functions C u ! v (x, y).
In the maximization step, we recursively compute 3 cumulative scores (δ v ) and 3 optimal alignment subsequences of labels (F v ) at each (x, y) position. We replace the sum operator of Eq (3) by the maximum operator as follows: Every F v (x, y) records an optimal alignment subsequence of the type u Ã ! v.
In the backtracking step, the alignment starts at the corner position (x L A , y L A ) = (L q , L t ) and at the "move" a L A with the highest cumulative score at that position: . Then we apply the following recursion: given a current position (x i , y i ) and a current move a i , the previous position (x i−1 , y i−1 ) can be determined by a i using Eq (5). The previous

Alignment with a higher order CRF
We will show now how to extend the 1st-order CRF alignment, described in Sec. 2, to a 2ndorder CRF alignment building on the HO-CRF formulation presented at [24,30]. The extension to higher orders or even to a sparse CRF alignment could be derived in a similar way.

Probability of one alignment
The scoring function of a 2nd-order CRF alignment can be expressed as: If we compare it with the 1st-order CRF scoring function (Sec. 2.1) we can now observe a new 2nd-order log-factor (χ) that models a longer alignment subsequence (see Fig 1a and 1b where the alignment subsequences of C are indicated by continuous lines). This increases the expressive power of our CRF at the expense of an increase in the number of parameters. Compared to the 1st-order CRF, not only the number of log-factors (39 = 27 + 9 + 3) has increased but also the new 2nd-order log-factors are more complex as they are fed by larger size feature vectors (see Sec. 4.1). Nonetheless, we will see in Sec. 5.1 that this increase in the number of parameters does not produce overfitting. Then, in a 2nd-order CRF, the probability of one alignment can be now expressed as follows:

Forward-backward algorithm
Since we now have 27 label patterns we have to compute 9 forward and 9 backward terms (corresponding to the different suffixes and prefixes of these label patterns [24]) at every (x, y) position as follows: The previous (x p , y p ) (or the next (x n , y n )) position is determined by the current (x, y) position and the v (or v 0 ) label of the scoring function as in Sec. 2.2. The partition function can now be computed as: where the initializations are: α u ! v (0, 0) = 1 and β t ! u (L q , L t ) = 1. The corresponding derivative of these formulas with respect to each parameter θ of the log-factors can be obtained in a similar way as in reference [8].

Training
The loss function used for the 2nd-order CRF is the same one as that employed in the 1storder CRF, see Eq (7), but now using the 2nd-order CRF probability (Eq 11) instead. The same stochastic gradient descent method is also used to find the minimum. The derivative with respect to every parameter of the loss function requires the derivatives of the partition function and of the forward terms at the corner position (see Sec. 3.2). In order to do so, we need to compute the forward and the scoring function derivatives at all positions which is quite memory demanding and time consuming due to the high number of parameters and positions (specially in the 2nd-order CRF). In order to reduce computational resources, it has been implemented in a diagonal recursive way, i. e. we compute all the derivatives of one alignment matrix diagonal at every iteration. The memory is reduced by maintaining only the previous diagonals required to compute the current diagonal. The computational time is reduced because by vectorization we parallelize the computations of diagonal elements which are independent of each other.

Test
A similar Viterbi algorithm to that described in Sec. 2.4 has been implemented for the HO-CRF but now in the maximization step, we find the 9 cumulative scores (δ u ! v ) and 9 optimal alignment subsequences F u ! v (x, y) at each (x, y) position. We then replace the sum operator of Eq (12) by the maximum operator as follows: Every F u!v (x, y) now records an optimal alignment subsequence of the type t Ã ! u ! v.
In the backtracking step, the alignment starts at the corner position (x L A , y L A ) = (L q , L t ) and at the "move" with the highest cumulative score at that position: . Then we apply the following recursion: given a current position (x i , y i ) and a current move a i−1 ! a i the previous position (x i−1 , y i−1 ) can be determined by a i using Eq (5).
The recursion finishes when (x i , y i ) is out of the alignment matrix.

Features
As in [16,18], we extract evolutionary and structural features using publicly available software. In particular, we use the program buildFeature of CNFPred to compile all the evolutionary and structural features in one file (.tgt and .tpl for the query and template proteins, respectively). In turn, buildFeature calls several external programs that are described next. PSI-BLAST [14] on the Non Redundant databases NR70 and NR90 to generate the position specific scoring matrix PSSM for the template and the position specific frequency matrix PSFM for the query. HHPred together with HHMake also on these databases to obtain the HMM query and template profiles. PDB-Tool [31] for the 3-class Secondary Structure (SS3) and Solvent Accessibility (SA) values of the template and PSIPRED [32] together with RaptorX-ACC [33] for SS3 and SA state probabilities estimation of the query. Finally, DISOPRED [34] is used to set to 0 all structural features when disordered regions are detected.
The features used for each CRF label are then the following ones: • Features for a match label (M): sequence profile similarity (PSSM-PSFM) [18], sequence similarity using BLOSUM50 [13] and SS3 and SA match score [18] of the query-template residues at the current position.
• Features for a query-insertion label (I q ): context hydropathy count [20], Relative Solvent Accessibility (RSA) and SS3 values of the template residue at the current position.
• Features for a template-insertion label (I t ): context hydropathy count, RSA estimation and SS3 estimation of the query residue at the current position.
The RSA estimation of a query residue at position x is computed based on the probabilities of being at a buried (B), medium (M) or exposed (E) state at that position as follows: where the cutoff values 10 and 42 are chosen as in RaptorX. Compared to [16,18], we are using a smaller number of features and all of them are mean and variance normalized in order to accelerate the training convergence. The features corresponding to positions out of the alignment matrix are set to zero. We have just described the composition of the feature vector associated to each type of CRF label. Then the final log-factor feature vector has to be obtained. As an example, let us consider the log-factor NN M!I q !M of Fig 1b. It is fed by the feature vectors qt M (1, 1), qt I q ð2; 1Þ and qt M (3, 2) that are concatenated obtaining a final log-factor feature vector of size 11 = 4 + 3 + 4.

Datasets
We build our dataset from the LINDAHL benchmark (see [35] for description and availability on the Web). LINDAHL contains 7438 positive pairs, i. e. protein pairs which share the same fold (structurally-related proteins). From these pairs, we randomly select 50 (12,7,31) for training, 50 (12, 2, 36) for validation and the rest 7338 (1622, 2121, 3595) for our in-house test set called POSLINDAHL. In parenthesis we indicate the number of pairs resulting at the Family, Superfamily and Fold levels. The pairs selected for the training and validation sets are limited to pairs in which query and template sequence lengths are smaller than 100 amino acids. We establish this limitation due to the training complexity of the 2nd-Order CRF alignment (Sec. 3.3 and 4.3). Each protein pair of the POSLINDAHL set shares around 16 (22,15,13)% of sequence identity (SeqID).
In addition to the POSLINDAHL test set, we will evaluate our algorithms with two more public datasets: PROSUP ( [36] for description and availability) which consists of 127 pairs (16% of SeqID) and SALIGN (test-set) ( [37] for description and availability) which consists of 200 pairs (20% of SeqID). In Sec. 5.2 we will give more details about the difficulty of these datasets by means of the TM-Score.
In all cases, the structure alignment tool DeepAlign [38] provides us the reference alignment. As we are interested in global alignment, only the region between the first and the last aligned residues provided by this tool are employed in our experiments. For the test sets, the [minimum, average, maximum] of the sequence lengths and the length difference of the pairs (|L q − L t |/max(L q , L t )%) are [4,137,527

CRF parameters
We describe next how the values of the 1st and 2nd-order CRF parameters are optimized by means of our training and validation sets. 21 factors (out of the 39 log-factors) are a Neural Network with one hidden layer of 12 neurons without bias as in [16], the remainder are just constant values which do not depend on the features. These constant log-factors correspond to the label transitions that appear in the training set with a probability smaller than 0.002 and for which there is not enough data to train them properly. In this way we reduce the number of parameters to be trained which are now L θ = 901 and 2941 for the 1st and 2nd-order CRFs, respectively. For the sake of simplicity we use the same values for the stochastic gradient descent parameters in the 1st and 2nd-order CRFs. They are: l 2 = 0.001, η = 0.01 and γ = 0.5. Fig 2 shows the loss function, and the alignment accuracy of the 1st and 2nd-order CRFs at every epoch of the training set. Each epoch of the 1st and 2nd-order CRFs takes in our computer (Intel(R) Core(TM) i7-4790 CPU 3.60GHz) around 1 and 8 minutes, respectively. The 1st and 2nd-order CRF curves cannot be directly compared as they can converge at different speed, however we can see that, in general, the 2nd-CRF performs better than the 1st-CRF. By means of the validation sets we selected our best 1st and 2nd-order CRFs models from epochs 93 and 99, respectively.

Baseline software tools
We compare our HO-CRF based alignment with the following alignment methods: • The state of the art CNFPred [16] based on Conditional Neural Fields of order 1. Compared to our 1st-order CRF, CNFPred employs a wider variety of features such as Secondary Structure with 8 states (SS8), a position specific gap probability, context-specific potentials (of size 11) and other strategies such as geometrical constraints in the Viterbi search [39]. Moreover, it has been trained on a larger dataset of 1010 protein pairs and with a more effective loss function that directly maximizes the TM-Score.
• The state of the art alignment method HHAlign [40] based on HMM profiles alignment [15]. We obtained the HMM profiles with the program HHPred of buildFeature (see Sec. 4.1) and then we ran HHAlign with the options '-loc -mact 0.1' (denoted as loc in future comparisons) and '-glob' (denoted as glob). These two options provided the best results among other tried.
• CONTRAlign [20], the first CRF alignment method proposed and based on linear factors, with its default hydropathy options. Compared to our 1st-order CRF, both use the same global Viterbi algorithm (Sec. 2.4). The only difference lies in the scoring functions (Eq 1) and, consequently, in the log-factors which are more complex in 1st-order CRF. For example, the edge log-factor f M!M (x, y) of 1st-order CRF is an NN that depends on the current (x, y) structural and evolutionary features while in CONTRAlign it is a constant value which does not depend on the current features.
• The global alignment NWAlign of MATLAB (with its default parameters) based on the Needleman-Wunsch [12] algorithm and the BLOSUM50 matrix [13]. Table 1 shows the reference dependent accuracies (%) [16,18], defined as the percentage of correctly aligned query amino acids judged by the reference alignments, for several alignment methods on different test sets. A query amino acid is correctly aligned when the template amino acid associated by the method is the same as the template amino acid associated by Dee-pAlign. For the POSLINDAHL set we show the average and the Family, Superfamily and Fold (Fa., Su., Fo) level accuracies. 1st and 2nd CRF-Align, HHAlign (loc) and (glob), and CNFPred derive their features from the same .tgt and .tpl files mentioned in Sec. 4.1 in order to have a fairer comparison. We can observe that the two best techniques are 2nd-CRF-Align and CNFPred. On one hand, 2nd-CRF-Align performs best on POSLINDAHL (51.92%), particularly at Fold level (37%) which is the most difficult set (composed of distantly-related proteins). On the other hand, CNFPred obtains the best results on SALIGN and PROSUP. If we now take all the 7665 tested proteins (POSLINDAHL + SALING + PROSUP), we find that the percentage of cases in which 2nd-CRF-Align outperforms CNFPred is 51.96% (where 219 equal results are discarded). In order to test the significance of this result we have applied a Wilcoxon signed-rank test at the 0.05 significance level. The result of the test allows us to reject the null hypothesis (i. e., that the difference between 2nd-CRF-Align and CNFPred results is zero median) with a pvalue = 4.9 × 10 −12 < 0.05, and to infer that the already mentioned percentage, in which 2nd-CRF-Align outperforms CNFPred, is statistically significant.

Results on alignment
If we carry out a similar comparison with HHAlign (loc) we obtain that the percentage of cases in which 2nd-CRF-Align outperforms HHAlign (loc) is 67.25% (p-value = 0). The poor performance of HHAlign (loc) at Fold level (17%) can be explained due to the fact that, as we mentioned in the Introduction, HHAlign heavily depends on the sequence profile which becomes sparse and not well estimated at this level. In addition, HHAlign (loc), compared to CNFPred and 2nd-CRF-Align, does not incorporate structural features which are very useful at Fold level. The comparison between 1st-CRF-Align and 2nd-CRF-Align confirms our initial hypothesis that the incorporation of a higher order scoring function can help in the alignment. Although 1st-CRF-Align shares the same CRFs architecture as CNFPred, the additional features and strategies introduced by CNFPred (and mentioned in Sec. 4.4) can explain why 1st-CRF-Align performs worse in general. In fact, if we increase our training set to 100 protein pairs and we consider a context of size 7 for the match label features, the results obtained by 1st-CRF-Align are now 49.70 (69, 45, 35) for POSLINDAHL, 65.50 for SALIGN and 64.26 for PROSUP, i. e. an absolute improvement of almost 3% on average. All of this suggests that an increase in feature and training sizes would improve the results of 1st-CRF-Align and, consequently, of 2nd-CRF-Align as well. However, we have avoided this increase on the 2nd-CRF-Align technique for two reasons. The first one is that the computational cost of the training stage would be very high (see Sec. 4.3) and the second reason is that, despite the small training set (50 proteins), our 2nd-CRF-Align technique is not overfitted to the training set (see Sec. 3.1) and obtains competitive results compared to the other state of the art methods.

Results on structure prediction
The results of the CRF based techniques of Table 1 could be biased towards the reference alignment (DeepAlign) as the other methods are not trained on it. In order to avoid this possible advantage, as also done in [16], we will as well evaluate the 3D-model of the query protein predicted from the query-template alignment. We will use the software MODELLER [41] which takes both the alignment and the template PDB file and estimates the query PDB file. The quality of this estimation is measured by the TM-Score [42]. This score compares the estimated with the true query PDB file and it has a value ranging from 0 to 1, indicating the worst and best model quality, respectively. The true query PDB file employed in the comparison is the one comprised between the first and the last aligned residues of the reference alignment provided by PDB-Tool [31]. Table 2 shows the cumulative TM-Scores (or reference independent alignment accuracies) for different methods and datasets.
Analyzing Table 2 we can reach the same conclusions as those derived from Table 1. It is interesting to point out that now the 2nd-CRF-Align technique, apart from obtaining outstanding results at Fold level, it performs as CNFPred on the POSLINDAHL (at both Family and Superfamily levels) and PROSUP datasets. From a similar Wilcoxon test we can infer that 2nd-CRF-Align outperforms CNFPred and HHAlign (loc) in 57.63% (p-value = 8.7 × 10 −107 ) and 69.38% (p-value = 0) of the cases, respectively.
The performance comparison between the 1st and 2nd-CRF-Align methods finally corroborate the benefits of using a HO-CRF for the alignment. The last row of the table (DeepAlign Protein alignment based on HO-CRFs for TBM (oracle)) shows the results that we would obtain if the reference structure alignment was used and it gives an idea of the maximum performance that we could achieve in 3D structure prediction by alignment methods. If instead of obtaining the cumulative TM-Score of DeepAlign (oracle), we compute the average per protein TM-Score, we obtain 0.57 (0.66, 0.57, 0.49) for POSLINDAHL, 0.76 for SALIGN and 0.68 for PROSUP. These numbers, together with the sequence identity percentage indicated in Sec. 4.2, characterize the difficulty of the different sets.
These results allow us to reach the following conclusion: when the protein pairs are closely related, as on SALING (TM-Score = 0.76), 2nd-CRF-Align gives competitive results but CNFPred is better. However for distantly-related pairs, as at the Fold level in POSLINDAHL (TM-Score = 0.49), the best predictor is 2nd-CRF-Align.

Conclusions
In this work, we have described how to carry out query-template alignment of proteins using a Higher Order Conditional Random Field (HO-CRF). We have based our proposal on previous developments regarding the use of first order CRFs for protein alignment [16,18,20] and the formulation of HO-CRFs [23,24,30]. The experimental results on different public datasets of structurally-related proteins have shown that there exist higher order correlations between the features and the labels of an alignment and that the use of HO-CRF makes sense, especially when the proteins of the pair are more distantly related (i. e. the most difficult tasks). Although this paper has focused on 2nd-order CRF alignment, the extension to higher order alignments would be direct. However, this extension would require an increase in the training size to avoid overfitting and, consequently, a GPU implementation in order to accelerate HO-CRFs training. In addition, the exploitation of sparsity aspects and the incorporation of co-evolution information (as in MRFAlign [10]) are suggested as promising research directions for future work.