Improving protein tertiary structure prediction by deep learning and distance prediction in CASP14

Abstract Substantial progresses in protein structure prediction have been made by utilizing deep‐learning and residue‐residue distance prediction since CASP13. Inspired by the advances, we improve our CASP14 MULTICOM protein structure prediction system by incorporating three new components: (a) a new deep learning‐based protein inter‐residue distance predictor to improve template‐free (ab initio) tertiary structure prediction, (b) an enhanced template‐based tertiary structure prediction method, and (c) distance‐based model quality assessment methods empowered by deep learning. In the 2020 CASP14 experiment, MULTICOM predictor was ranked seventh out of 146 predictors in tertiary structure prediction and ranked third out of 136 predictors in inter‐domain structure prediction. The results demonstrate that the template‐free modeling based on deep learning and residue‐residue distance prediction can predict the correct topology for almost all template‐based modeling targets and a majority of hard targets (template‐free targets or targets whose templates cannot be recognized), which is a significant improvement over the CASP13 MULTICOM predictor. Moreover, the template‐free modeling performs better than the template‐based modeling on not only hard targets but also the targets that have homologous templates. The performance of the template‐free modeling largely depends on the accuracy of distance prediction closely related to the quality of multiple sequence alignments. The structural model quality assessment works well on targets for which enough good models can be predicted, but it may perform poorly when only a few good models are predicted for a hard target and the distribution of model quality scores is highly skewed. MULTICOM is available at https://github.com/jianlin-cheng/MULTICOM_Human_CASP14/tree/CASP14_DeepRank3 and https://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.


| INTRODUCTION
Protein structure prediction is to computationally predict the threedimensional (3D) structure of a protein from its one-dimensional (1D) amino acid sequence, which is much more efficient and costeffective than the gold-standard experimental structure determination methods such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). Computational structure prediction becomes more and more useful for elucidating protein structures as its accuracy improves. 1 Two kinds of structure prediction methods have been developed: template-based modeling and template-free (ab initio) modeling. Template-basedmodeling (TBM) methods first identify protein homologs with known structures for a target protein and then use them as templates to predict the target's structure. 2,3 A common approach of identifying homologous templates is based on Hidden Markov Models. 4 When no significant known template structures are identified, template-free modeling (FM) is the only viable approach to build structures from protein sequences. Traditional FM methods, such as Rosetta, 5 attempt to build tertiary structure by assembling the mini-structures of small sequence fragments into the conformation of the whole protein according to the guidance of statistical energy functions. Other FM tools such as CONFOLD 6 use inter-residue contact predictions as distance restraints to guide protein folding. In the 13th Critical Assessment of Protein Structure Prediction (CASP13), AlphaFold, 7 a FM method based on deep learning distance prediction achieved the highest accuracy on both TBM targets and FM targets. Other top CASP13 tertiary structure prediction methods such as Zhang Group, 8 MULTICOM, 9 and RaptorX 10 were also driven by deep learning and contact/distance predictions.
Inspired by the advances, our CASP14 MULTICOM system is equipped with a new deep-learning based protein inter-residue distance predictor (DeepDist 11,12 ) to generate accurate contact/distance predictions, which is used by DFOLD (https://github.com/jianlincheng/DFOLD) and trRosetta 13 to construct template-free structural models. Moreover, the template-based prediction in MULTICOM is simplified and enhanced by removing redundant template-identification tools and using deeper multiple sequence alignments (MSAs) in template search, while the template libraries and sequence databases are updated continuously. In addition, 11 new features calculated from predicted inter-residue distance/contact maps are used to predict the quality of protein models in conjunction with other features in DeepRank 9 to rank and select protein models. As a result of the improvements, MULTICOM was ranked seventh in tertiary structure prediction and third in inter-domain structure prediction in CASP14.
In CASP14, AlphaFold2, an end-to-end attention-based deep learning predictor achieved the unparalleled accuracy in predicting tertiary structures. Instead of predicting the residue-residue distance from multiple sequence alignments first and then reconstructing tertiary structures from the distances, it directly predicts 3D structures from multiple sequence alignments, indicating a new direction of the end-to-end prediction of tertiary structures needs to be pursued in the future.

| Overview of the MULTICOM system
The pipeline of MULTICOM human and automated server predictors can be roughly divided into six parts: template-based modeling, template-free modeling, domain parsing, model preprocessing, model ranking, and final model generation as depicted in Figure 1.
When a target protein sequence is received, the template-based modeling ( Figure 1, Part A) and template-free modeling (Figure 1, Part B) start to run in parallel. In the template-based modeling pipeline, MULTICOM first builds the multiple sequence alignments (MSA) for the target by searching it against sequence databases, which are used to generate sequence profiles. Then, the sequences profiles or the target sequence are searched against the template profile/sequence library by various alignment tools (BLAST, 14 HHSearch, 15 HHblits, 4 HMMER, 16 RaptorX, 17 I-TASSER/MUSTER, 18,19 SAM, 20 PRC, 21 and so on to identify templates and generate pairwise target-template alignments. A combined target-template alignment file is generated by combining the pairwise alignments. Structural models are built by feeding the combined alignment file into Modeller. 22 In CASP14, the MULTICOM system was blindly tested as five automated servers.
MULTICOM-CLUSTER and MULTICOM-CONSTRUCT servers used the template-based prediction system described above, which was rather slow because it needed to run multiple sequence alignment tools. To speed up prediction, MULTICOM-DEEP and MULTICOM-HYBRID servers only used HHSearch and HHblits in the HHsuite package as well as PSI-BLAST 23 and HMMER to build sequence profiles and search for homologous templates, which are much faster than MULTICOM-CLUSTER and MULTICOM-CONSTRUCT. Considering that the distance-based template-free modeling can often achieve high accuracy on template-based targets, we also tested MULTICOM-DIST server predictor that completely skipped templatebased modeling and used only template-free modeling for all the CASP14 targets.
In the newly developed distance-based template-free modeling pipeline ( Figure 1, Part B), DeepMSA 24 and DeepAln 11 are used to generate two kinds of multiple sequence alignments, which are used to calculate residue-residue coevolution features that are fed into different deep neural networks of DeepDist to predict the distance map-a two-dimensional matrix representing the inter-residue distances for the target protein. For some hard targets, the MSAs generated by HHblits on the Big Fantastic Database (BFD) 25,26 that contains hidden Markov model profiles of many proteins collected from metagenome sequence databases are also used to predict distance maps. The MSAs along with predicted distance maps are used to generate ab initio models with two different ab initio modeling tools (e.g., DFOLD and trRosetta 13 ). In MULTICOM-DEEP and MULTICOM-HYBRID, distance maps and alignments generated by DeepMSA and DeepAln were also used to select templates for template-based modeling. The detailed description of the distanceguided template-free modeling and its illustration ( Figure S1) can be found in the Appendix S1.
Domain information (Figure 1, Part C) can be extracted from the target-template sequence alignments. If no significant templates are found for a region of the sequence that is longer than 40 residues, the region is treated as a template-free (FM) domain, otherwise a template-based domain. The sequences of the domains are fed into the same pipeline above to build models for individual domains. DeepRank uses residue-residue contacts predicted by DNCON2 30 as input features, but DeepRank3_Cluster uses residue-residue distances predicted by DeepDist as input features. DeepRank_con shares the same deep network with DeepRank but replaces contact predictions from DNCON2 with those from DeepDist. The three QAs also use other features including 1D structural features (e.g., predicted F I G U R E 1 The pipeline of MULTICOM human and server protein structure predictors secondary structure, solvent accessibility) and the 3D model quality scores generated by different QA tools (e.g., RWplus, 31 Voronota, 32 Dope, 33 and OPUS 34 ).
Once the QAs generate the model rankings, final models are built by model combination, domain combination or model refinement ( Figure 1, Part F) from top-ranked models. For full-length targets, top five ranked models are combined with other similar top-ranked models (maximum 20 models) to generate the consensus models. If a target has multiple domains, top five models are generated by combining domain models using Modeller 22 or AIDA. 35 For the human prediction, if the combined models substantially deviate away from the original models, refinement tools (e.g., i3DRefine 36 and ModRefiner 37 ) will be used instead to refine the top-ranked models to generate the final top five models for submission.
There are several additional differences between the human predictor and server predictors. First, the inputs for the human predictor are the server models from CASP including MULTICOM server models. Additional models generated by MULTICOM servers after the server submission deadline may be added into the model pool for some targets if any. Models filtering and side chain repacking are applied in the human prediction before feeding the models into the quality assessment methods. Second, in the human predictor, predicted domain boundaries are adjusted based on the top-ranked models. Third, in the human prediction, the refinement tools are applied to improve the quality of top-ranked models.

| Protein model ranking
In the MULTICOM human predictor, three main quality assessment (QA) methods (DeepRank, DeepRank_con, and DeepRank3_Cluster) are applied to model selection. The methods share the similar features, including 1D features from predicted secondary structures and solvent accessibility and 3D QA scores from different QA tools (i.e., SBROD, RWplus, 31 Voronota, 32 Dope, 33 46 Pearson correlation, and ORB 47 ), which are combined with other 1D and 3D features as inputs.
All the quality assessment methods apply the same two-level network architecture. The first level of the network includes 10 neural networks trained by tenfold cross-validation to predict the GDT-TS scores of input models. Then the output scores are combined with initial input features to predict the final scores by the second level network. DeepRank, DeepRank_con and DeepRank3_Cluser were trained and tested on the models of the previous CASP experiments before they were blindly applied to the models of the CASP14 experiment.

| Model refinement and combination
To improve the quality of selected top models, four different methods (e.g., model combination, i3DRefine, ModRefiner, and TM-score based combination) are applied under different circumstances in the MULTI-COM human predictor. After predicting the quality scores of the input server models, a standard protocol ( Figure S2) is applied to generate the final top five models. Each top-ranked model is combined with other top-ranked models (maximum 20) that are similar to the start model (i.e., GDT-TS > 0.6) to generate a consensus candidate model.
If the GDT-TS score between the consensus model and the start model is smaller than 0.9, the consensus model is discarded, and the candidate model is generated by using i3Drefine to refine the start model. ModRefiner is used alternatively if severe structural violations (e.g., atom clashes) exist in the candidate model or its secondary structures need to be further improved.

| Prediction of the structures of multidomain proteins
In the MULTICOM system, a domain detection algorithm based on the target-template multiple sequence alignment generated by HHSearch or HHblits is applied to identify domains for multidomain proteins. Template sequences in the alignment are filtered out by their E-value (>1), sequence length (≤40), or alignment coverage (≤0.5) for the target. If no template is left after filtering, the target is identified as a single-domain template-free target. Otherwise, further analysis is applied to the filtered alignment to identify domains. If a region of the target is not aligned with a template and has more than 40 residues, it is classified as a template-free domain. All the other regions are classified as template-based domains.
After splitting a multi-domain target into domains, the sequence of each domain is fed to the prediction pipeline to generate structural models and the top five models for each domain are selected.
Modeller is used by default to combine the top domain models into full-length models. AIDA is used alternatively to combine domain models when the full-length model generated by Modeller has severe clashes (i.e., the distance between any two Ca atoms is <1.9 Å) or broken chain (i.e., the distance between any two adjacent atoms is >4.5 Å). The domain-based combination models may have good GDT-TS scores for individual domains, but low scores when they are com-

| RESULTS
In CASP14, both MULTICOM human and server predictors participated in the protein tertiary structure prediction. Among 92 CASP14 "all groups" domains for tertiary structure prediction, 54 domains are classified as template-based (TBM-easy or TBM-hard) domains that have some structural templates in the Protein Data Bank (PDB) and 38 as FM or FM/TBM domains that have no templates or whose templates cannot be recognized. MULTICOM human predictor was ranked seventh among all the 146 predictors (see Table 1 for top 20 out 146 predictors and their total Z-scores, average TM-scores and average GDT-TS scores) on 92 "all group" domains (https:// predictioncenter.org/casp14/zscores_final.cgi) and third among all the 136 predictors (see Table 2 for the top 20 predictors' average and total Z-scores) on 10 multidomain targets (e.g., T1030, T1038, T1052, T1053, T1058, T1061, T1085, T1086, T1094, and T1101) in the inter-domain structure prediction category (https://predictioncenter. org/casp14/zscores_interdomain.cgi). After combining multiple server predictors from the same group as one entry, MULTICOM-DEEP was ranked sixth after BAKER, RaptorX, Zhang, FEIG, and Seok groups on 58 (54 "all groups" +4 "server only") TBM domains by the assessor's formula (Table S1). MULTICOM-HYBRID server predictor was ranked fifth after Zhang, tFold, BAKER, and Yang groups on 38 FM or FM/TBM domains according to the assessor's formula (Table S2). The performance of our human and server prediction methods is systematically analyzed in the following sections using the official evaluation data downloaded from the CASP14's website. tFold_human, and MULTICOM) achieved higher performance than the best server predictor-QUARK (see Table 1). The average TM-score of MULTICOM on the 92 "all-group" domains is 0.6989, substantially higher than 0.5-a threshold for a correct fold prediction. If only the top one model per domain is considered, MULTICOM predicts the correct fold for 76 out of 92 (82.6%) domains (i.e., 98% TBM domains and 60.5% FM or FM/TBM domains). If the best of the top five models for each domain is considered, the success rate is increased to 84.8% (i.e., 98% TBM domains and 65.8% FM or FM/TBM domains).

| Performance of MULTICOM human predictor
T A B L E 1 Top 20 predictors in CASP14 tertiary structure prediction ranked by Z-score calculated from GDT-TS     Figure 4 illustrates the predicted structures and T A B L E 2 Top 20 predictors in the inter-domain structure prediction ranked by Z-score based on F1 score + Z-score based on Jaccard score + Z-score based on best of contact agreement score    This example shows that more care needs to be taken for large multidomain proteins in template-free modeling and it is useful to incorporate some template-based distance information into the distance-based free modeling.  (Table S3). In Figure 7, the GDT-TS scores of the  MULTICOM's quality assessment failed to select good start models from the model pool. One reason causing the failure is the number of good-quality models in the model pool is low and the distribution of TM-scores of the models for these targets is highly skewed. In

| DISCUSSION
It has been known that the template-free (ab initio) modeling generally works much better than the template-based modeling on FM targets that have no templates. To confirm this, we compared the quality of top-1 template-based models predicted by MULTICOM-HYBRID  Table S5), while the other 11 domains (T1052-D1, T1052-D2,   T1061-D3, T1091-D1, T1091-D2, T1091-D3, T1091-D4, T1085-D1,   T1085-D3, T1086-D1, and T1086-D2)  FM_Set, but the difference is not significant (P-value = .4192 according to the Wilcoxon signed rank test). Similarly, the average TM-score of the top-1 models for TBM_Set is 0.607, which is higher than 0.5514 for FM_Set, but the difference is not significant (Pvalue = .1795 according to the Wilcoxon signed rank test). The comparison on these two relatively small datasets (i.e., the sample size of each set = 10) seems to suggest that the existence of homologs for the TBM domains in the DeepDist's training dataset may make some insignificant contribution to the increase in the prediction accuracy.
The two analyses above together indicate that the accuracy of the template-free structure prediction and contact/distance prediction for a target is largely influenced by the quality of its multiple sequence alignment, and to a less extent, may also be affected by whether it has some homology with the proteins used to train the distance predictor.
Although it was not clear to us that our template-free modeling would work better than the template-based modeling on both TBM and FM targets prior to the CASP14 experiment, it is interesting to see MULTICOM-HYBRID selected a template-free model as top-1 model for most CASP14 (TBM, FM, and FM/TBM) targets. It sometimes selected template-based models as top-1 model only when a very significant template was found for a target (e.g., the e-value of the best template hit < E-20). Generally, MULTICOM-HYBRID prefers FM models over TBM models regardless of the type of the targets.

| CONCLUSION
We developed the MULTICOM protein structure prediction system for the CASP14 experiment and evaluated and analyzed its performance on the CASP14 targets. We demonstrate that the distancebased template-free prediction empowered by deep learning significantly improves the accuracy of protein tertiary structure prediction.
The approach can work well on both template-free and templatebased targets and therefore can be applied to elucidate the structures of many proteins without known structures in a genome. However, the quality of template-free modeling critically depends on the quality of deep learning-based residue-residue distance prediction, which in turns depends on the quality of multiple sequence alignment. In contrast to the substantial improvement in template-free structure prediction, there is little improvement in protein model quality assessment in our CAS14 system over the CASP13 methods. The quality assessment methods using more accurate residue-residue distance prediction features did not perform better than the quality assessment method using only residue-residue contact prediction features, suggesting that better methods of using distance predictions in quality assessment are needed. Moreover, domain prediction plays an important role in both model generation and evaluation. Accurate domain prediction can help generate better tertiary structure models and select better predicted models for some multidomain targets.

ACKNOWLEDGMENT
Research reported in this publication was supported in part by two NSF grants (DBI 1759934 and IIS1763246), an NIH grant (R01GM093123), and three Department of Energy grants (DE-SC0021303, DE-SC0020400 and DE-AC05-00OR22725).

PEER REVIEW
The peer review history for this article is available at https://publons. com/publon/10.1002/prot.26186.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.