DEMO2: Assemble multi-domain protein structures by coupling analogous template alignments with deep-learning inter-domain restraint prediction

Abstract Most proteins in nature contain multiple folding units (or domains). The revolutionary success of AlphaFold2 in single-domain structure prediction showed potential to extend deep-learning techniques for multi-domain structure modeling. This work presents a significantly improved method, DEMO2, which integrates analogous template structural alignments with deep-learning techniques for high-accuracy domain structure assembly. Starting from individual domain models, inter-domain spatial restraints are first predicted with deep residual convolutional networks, where full-length structure models are assembled using L-BFGS simulations under the guidance of a hybrid energy function combining deep-learning restraints and analogous multi-domain template alignments searched from the PDB. The output of DEMO2 contains deep-learning inter-domain restraints, top-ranked multi-domain structure templates, and up to five full-length structure models. DEMO2 was tested on a large-scale benchmark and the blind CASP14 experiment, where DEMO2 was shown to significantly outperform its predecessor and the state-of-the-art protein structure prediction methods. By integrating with new deep-learning techniques, DEMO2 should help fill the rapidly increasing gap between the improved ability of tertiary structure determination and the high demand for the high-quality multi-domain protein structures. The DEMO2 server is available at https://zhanggroup.org/DEMO/.


Supplementary Figures
 Figure S1. Global and local templates identification  Figure S2. Sliding-window procedure for domain-template alignment search and initial model construction  Figure S3. Relationship between the eTM-score/eRMSD and the actual TM-score/RMSD to the native  Tables   Table S1. Results of full-length models generated by different methods for different categories

Supplementary Texts
Text S1. Initial full-length model generation by sliding-window based alignment The initial full-length conformations are built based on the top 10 global templates and local templates selected according to TM-score h . Since the domain alignments are performed separately, the aligned regions of domains may be far away from each other. In this case, a sliding-window based procedure is employed to recreate domain alignments so that neighboring domains have the initial structure constructed from the neighboring regions of the template. Take a protein with 2 domains shown in Figure S2 as the example, the N-terminal domain of the query is first superposed at the N-terminal of the template, where C-terminal domain is superposed at all the right-hand positions of the N-terminal domain along the template sequence, but with a maximum gap of 10 residues from N-terminal domain. Next, the superposition of N-terminal domain is shifted by one residue to the C-terminal of the template and redo the C-terminal superpositions. This procedure is repeated with the N-terminal domain sliding through all positions along the templates, where C-terminal domain is always on the right hand of the N-terminal domains. To save time, the superposition is initially performed by Kabsch RMSD rotation matrix (1) on all the positions. The top-10 alignment positions with the lowest average RMSD are selected, whose superpositions are then regenerated by the TM-score rotation matrix (2). The alignment with the highest average TM-score of the N/C-domains among all the positions is finally selected for initial model construction. Here, structural superposition without gap (instead of structural alignment with gap) is performed for each comparison of query domain and template structures. The two ending terminals of 20 residues were skipped during domain sliding to further save time.

Text S2. Hybrid energy function for DEMO2 domain structure assembly
The energy function for domain assembly of DEMO2 is a sum of the ten terms: where m and n are domain index, and is the total number of domains. The first term is the inter-domain distance map: , log , , S2 where and represent the sequence length of the m-th and n-th domain, respectively. is the distance between the i-th ( for Glycine) atom in the m-th domain and j-th atom in the n-th domain, , , is the predicted probability of the distance located in the k-th distance bin, and 1 4 is the pseudo count to offset low-probability bins. In the calculation, we only consider atom pairs with probability peak located in [2Å, 20Å], and these atom pairs with predicted probabilities >0.5 in the last bin [>20 Å], which represents a low prediction confidence in [2Å, 20Å], are excluded.
The second term is the inter-domain orientations: where represents the inter-residue θ, ω, or φ angles defined in Ref. (3), , , is the predicted probability of the angle located in the k-th angle bin. The third term is the domain-domain interface contact energy: where is the confidence score of the i-th residue and j-th residue with the C distance <18 Å. A similar potential is also used to count for cross-link restraints when they are available, where is set to 1 with the distance cutoffs taken directly from the user-input cross-link data.
The fourth term is the hydrogen bond restraints. The predicted probability distribution of angles is converted into an energy potential with a similar from as the distance energy, where the potential is described as follows: is the hydrogen angle between residue pair i and j, i.e. the angle between vector ⃑ / ⃑ / ⃑ and ⃑ / ⃑ / ⃑ , which follows a probability distribution predicted by DeepPotential, / / is the probability that the angle is located at / / . The illustration of the hydrogen bond restraints is shown in (4).
The fifth term is designed to eliminate steric clashes between domains, i.e., where 3.75 Å is set as the clash distance cutoff. The sixth term is the generic domain-domain contact energy computed by: where the scale parameter depends on the hydrophobic and hydrophilic features of the residue pairs. 0.1, if both of the residues are hydrophobic (ALA, CYS, VAL, ILE, PRO, MET, LEU, PHE, TYR, TRP); 0.01, if the two residues are hydrophilic (SER, THR, ASP, ASN, LYS, GLU, GLN, ARG, HIS); or 0.05, otherwise. This energy item is used to control the inter-domain distance, which will push the two domains together if they are two far away each other.
The seventh term is the domain-domain distance profile deduced from the templates identified by TM-align, which is calculated by: For a residue pair (i and j, with i from N-terminal domain and j from C-terminal domain), is the number of templates that satisfy the following two conditions: (1) the template has both residue i and j aligned by TM-align; (2) 0.6| | 1.5| |, where and are the indexes of the aligned residues of i and j on the template. is the C distance between the residue and in the t-th template. The eighth term is the domain boundary energy is defined as , S10 where is the C distance between two consecutive domains, and 3.8 Å is the standard length of C -C bond. The nineth term is the local domain distance restraint: where , ′ represents the distance between the i-th C atom ( ) and its corresponding atom ′ in the initial structure generated in the template superposition process, and is the length of the protein. This term is to prevent the assembly deviating too much from the orientation obtained from the template.
The last term is radius of gyration restraint, defined as where is the radius of gyration of the decoy structure, and are the maximum and minimum estimated radius of gyration, respectively.

.
(L is the query sequence length) is the statistical minimum radius of gyration based on the known multi-domain protein models in the PDB. max 7.5, 0.55 is the statistical maximum radius of gyration based on the known multi-domain protein models in the PDB, where is the number of residues of the longest helix. The weighting parameters in Eq. (S1) are determined by maximizing the correlation between total energy and RMSD to the native on the structure decoys over a training set of 425 non-redundant proteins through a improved differential evolution algorithm (5,6

Text S3. Full-length structure decoy generation using rotation angles and translation vectors
According to inter-domain rotation angles ∅, , and , the rotation matrix can be calculated by cos cos ∅ cos θ sin ∅ sin cos sin ∅ cos θ cos ∅ sin sin sin θ sin cos ∅ cos θ sin ∅ cos sin sin ∅ cos θ cos ∅ cos cos sin θ sin θ sin ∅ sin θ cos ∅ cos θ

Text S4. Accuracy estimation for DEMO2 model
The accuracy of the DEMO2 assembled model is mainly evaluated by the estimated TM-score (eTM-score), which is calculated based on the convergence of the domain assembly simulations, the confidence of the full-length templates for domain assembly, the satisfaction rate of the inter-domain distances/contacts, and the estimated accuracy of the kth individual domain model, i.e., where is the total number of full-length decoys generated in the domain assembly simulations; is the number of structure decoys with RMSD <1.5Å to the kth reported full-length model; 〈RMSD〉 is the average RMSD between these decoys and the kth reported model. These terms are employed to evaluate the degree of convergence of the domain assembly simulations. TMScore is the template score (i.e., the harmonic mean of the TM-score between the domain model and the DEMO template) of the ith full-length template utilized for the initial full-length model construction, and TMscore (=0.85) is the cutoff of TplScore used to distinguish good templates from bad templates. T is the number of predicted inter-domain distances used to guide the domain assembly; and are the distances of the tth residue pair in the predicted distance map and the reported model, respectively. These terms are applied to assess how closely the distances in the reported model match the predicted distances by DeepPotential. The fourth term accounts for the domain-domain interface satisfaction rate of the predicted interface map in the reported model, where is the number of predicted domain-domain interfaces, and , is the number of overlapped interfaces between the predicted interface map and the kth reported model. is the total number of domains, and eTM˗score is the estimated TM-score of the Dth domain model by ResQ (7). 0.065, 0.063, 0.08, 0.01, 0.96, and 0.1 are the weighting factors, which are optimized using an improved differential evolution algorithm (6) to minimize the average error between the eTM-score and the real TM-score of the decoys to the native structure on the DEMO training set with 425 non-redundant multi-domain proteins.
The eRMSD is calculated by the same terms in Eq. (S13) but with an additional term ln (L is the sequence length of the target), where the weighting factors are

Text S5. RMSD, TM-score and rTM-score
The most widely used metric for assessing the accuracy of protein structure models is the root mean squared deviation (RMSD) defined by RMSD min where L is the length of the target protein or the number of compared residue pairs, is the distance between the ith pair of compared residues in the model and native structures, and 'min' indicates the rotation matrix to minimize the root mean squared deviation of the two structures (8). Because Eq. (S14) treats the distance error ( ) with equal weight over all residue pairs, a large local error on a few residue pairs (such as those in the loop or tail regions) can result in a quite large RMSD, even though the global fold of the model is correct. This renders the RMSD value more sensitive to the local error than to the global fold of the assessed model.
TM-score (9) is a metric for evaluating the topological similarity between protein structures, which can be calculated by where is the amino acid sequence length of the target protein, is the length of the aligned residues to the native structure which can be different from , e.g., in the case threading alignment with gaps/insertions, 1.24 15 1.8 is a scale to normalize the match difference, and 'max' refers to the optimized value selected from various rotation and translation matrices for structure superposition. The value of TM-score ranges in [0,1], where 1 indicates that the two structures are identical. Stringent statistics showed that TM-score >0.5 corresponds to a similarity with two structures having the same fold defined in SCOP/CATH (10).
Because is put in the denominator of Eq. (S15), TM-score naturally weights smaller distance errors more strongly than larger distance errors. Therefore, TM-score value is more sensitive to the global structural similarity rather than to the local structural errors, compared to RMSD. Another advantage of TM-score is the introduction of the scale 1.24 15 1.8 which makes the magnitude of TM-score length-independent for random structure pairs, while RMSD is a length-dependent metric (9). Due to these reasons, our discussion of modeling results is mainly based on TM-score. Since RMSD is intuitively more familiar to most readers, however, we also list RMSD values, when needed in the manuscript.
Although TM-score is a robust scale for assessing protein fold similarity due to its sensitivity to global fold, it may not appropriately assess the orientation of multi-domain structures for some cases. For a two-domain structure (domain-1 and domain-2) with ≫ , for example, the TM-score in Eq. (S15) will be dominated by the tertiary similarity of larger domain, and therefore insensitive to the orientation and quality of the smaller domain. To overcome this issue, we introduce a new score, the reciprocal TM-score (rTM-score), defined by rTM-score 1 TM-score 1 TM-score ⋯ 1 TM-score 16 where TM-score k is the TM-score for kth domain relative to the native, under the same rotation matrix of multi-domain complex structure superposition, and N dom is the number of domains. Please note that rTM-score has a similar form as TM-score h defined in Eq (1) but they have different meaning. While in rTM-score the complex model is superposed to the native structure with all domains rotated using the same rotation matrix, the TM-score h is the harmonic mean of TM-scores of different domains that are calculated independently. Therefore, the rotation matrixes are different for different domains in the TM-score h calculation, which cannot be used to assess inter-domain orientations.
The rTM-score has the value ranging in (0,1), where a rTM-score=1 is achieved if the complex model is identical to the native structure. Compared to TM-score, rTM-score is more sensitive to the domain orientation, as it will approach 0 if only one domain is identical to the native, but the orientation is completely different (i.e., TM-score 1~1 and TM-score 2~0 for a two-domain protein). In other words, a multi-domain complex model has a high rTM-score only when both the domain tertiary structure and the relative orientation are similar to the native. Here, we consider rTM-score >0.5 as of the correct complex fold. Mathematically, this corresponds to a complex model that has both domains with the similar relative orientation and the similar folds to the native (i.e., TM-score > 0.5) (10).         Tables   Table S1. Results of full-length models generated by different methods for different categories. Bold font highlights the best results from each category.