A unified approach to sequential and non-sequential structure alignment of proteins, RNAs, and DNAs

Summary Many distantly related structure pairs exhibit structural similarities that can only be fully captured by a non-sequential alignment program. We present US-align2, a unified protocol for both sequential and non-sequential alignment of proteins and nucleic acids. On manually curated reference alignments for protein structural pairs with non-sequential relations, US-align2 achieves ≥13% higher agreement with reference alignments than existing sequential and non-sequential alignment methods. Non-sequential alignments also enabled US-align2 to have higher sensitivities in detecting RNA pairs from the same family with sequence identities <40%, obtaining ≥9% higher area under the receiver operating characteristic curve than third-party programs. The unique ability of US-align2 to parse both proteins and nucleic acids allows the method to detect protein-RNA and protein-DNA mimicries. Additionally, US-align2 performs full and semi-non-sequential alignments with at least 48% and 14% faster speed than existing programs for the same tasks, making it particularly useful for large-scale structural similarity detection.


INTRODUCTION
Structural similarities often imply functional similarities and evolutionary relatedness, and therefore many programs have been developed to perform tertiary structure alignments between pairs of macromolecules. Some of the commonly used structure alignment methods include DALI (Holm and Sander, 1993), CE (Shindyalov and Bourne, 1998), SPalign (Yang et al., 2012), and TM-align (Zhang and Skolnick, 2005) for proteins, as well as RMalign (Zheng et al., 2019), ARTS (Dror et al., 2006), and RNA-align (Gong et al., 2019) for RNAs. These programs align a pair of structures in sequential order. Mathematically, in a sequential (SQ) alignment, for any residue pair i and j from Structure A that are aligned with residues i' and j' in Structure B, respectively, where i < j, i' is always < j'.
While this sequentiality condition allows for efficient implementation by dynamic programming, it also makes the resulting alignment less relevant for pairs of molecules with a 3D architectural similarity that can only be described by a non-sequential relationship. In fact, it was previously estimated that rearrangement of fragments that can only be identified through non-sequential (NS) alignment is present in at least 17.4% of all structurally similar protein pairs (Abyzov and Ilyin, 2007). The most frequently reported NS alignments are for circular permutation, where the N-terminal portion of one protein is aligned to the C-terminal portion of another protein. More sophisticated NS cases also exist where structure fragments are swapped without circular permutation (Abyzov and Ilyin, 2007). Moreover, in certain local structure contexts, such as a comparison of binding interfaces (Brylinski, 2014) or in the case of molecular mimicry between proteins and nucleic acids (Cui et al., 2015), NS alignment is preferred over SQ alignment, as we are more interested in the similarity of overall shape regardless of sequentially.
There are two main types of algorithms for NS alignment. The first is a semi-non-sequential (sNS) alignment, which preserves the sequential order within aligned fragment pairs, usually being secondary structure elements, while regions connecting these fragments pairs are aligned non-sequentially. Typical methods in this category include GANGSTA+ (Guerler and Knapp, 2008), MASS (Dror et al., 2003), and FlexSnap (Salem et al., 2010). The second type is a fully non-sequential (fNS) alignment, also known as a sequence-order-independent alignment. In fNS alignment, the atomic structure of a protein backbone Despite previous advances in NS alignment of protein structures, many challenges remain. First, there is no NS algorithm for full-length alignment of RNAs or DNAs. Second, there is no available NS algorithm for the alignment of different biomolecular types. For example, protein and nucleic acid molecules cannot be aligned for quantitative molecular mimicry detection. Third, due to much larger search space, an NS alignment program is typically at least twice if not several times slower than a SQ alignment program that uses a similar scoring function. This makes large-scale NS alignment computationally prohibitory.
To address these challenges, we present US-align2, which performs SQ, sNS, and fNS structure alignment for both proteins and nucleic acids using a unified scoring function, i.e., the TM-score (Gong et al., 2019;Zhang and Skolnick, 2004), which is independent of molecule length. US-align2 is an extension of US-align (Zhang et al., 2022), which we previously developed for SQ alignment of proteins, nucleic acids, and macromolecular complexes. The US-align2 algorithm not only offers a faster and more accurate NS alignment with greater structure overlap than previous methods but it is also the first method for NS alignment of full-length RNAs. Additionally, US-align2 implements all functionalities of the original US-align program.

RESULTS
Non-sequential alignment of hard-to-align protein structure pairs US-align2 and eight existing programs for NS and SQ alignments were first tested on the RIPC dataset (Mayr et al., 2007) of pairwise protein structure alignment. Different from many existing datasets for reference alignments of protein structures, such as HOMSTRAD (Mizuguchi et al., 1998), FSSP (Holm et al., 1992), and SABmark (Van Walle et al., 2005), which were generated by automated protein SQ structure alignment programs, the RIPC reference alignments are manually curated. Additionally, the protein pairs from RIPC reference alignments are hard to align due to repetitions, large insertions/deletions, circular permutations, and/or conformational changes. The expert curations and specific focus on NS relation among structure pairs make the dataset ideal for testing NS methods. Similar to a previous study (Brown et al., 2016), three reference alignments were excluded as they align pairs of proteins with identical sequences. The remaining 20 reference alignments were for protein pairs with sequence identity <30%.
As per the previous study (Brown et al., 2016), performance was measured from both reference-dependent and reference-independent metrics. There are two reference-dependent metrics. The first is equivalent reference residue (EQR), which is the total number of aligned residue pairs shared by the manually curated reference alignment and the automated alignment from a structure alignment program. The second reference-dependent metric is percentage of agreement, which equals to EQR divided by the length of the reference alignment. Reference-independent metrics included the number of aligned residues (L ali ), root mean square deviation (RMSD) of aligned residues, running time, and structure overlap (SO), which is defined as the percentage of residues aligned within 3.5Å to corresponding residues in the other structure: Here, L A and L B are the sequence length of the two proteins, d i is the distance of the i-th aligned residue pair, and I[ ] is the Iverson bracket, which equals to 1 if d i <3.5 and 0 otherwise. Since SO considers both the alignment coverage and deviation at the aligned region, it is a more useful reference-independent metric than L ali and RMSD, which often conflict with each other. For example, the alignment program with the lowest RMSD (1.76 Å ) was MASS, which also had the smallest L ali (116) ( Table 1). At the other end of the spectrum was CE, which had both the greatest L ali (205) among all programs and the highest RMSD (18.35 Å ). Neither program had the best SO. The two programs with the worst SO values (CE and DALI) both performed SQ alignment, while the three programs with the highest SO values were all fNS methods (US-align2 at fNS mode, SPalignNS, and CLICK) ( Figure 1A). This is understandable, as NS alignment, especially fNS alignment, allows optimization of the alignment in a search space that is much larger than SQ Nonetheless, a high SO alignment is not necessary biologically relevant. For example, although the fNS program SAMO had a reasonable SO (55.06%), which was higher than any of the four SQ alignment programs in this benchmark, its alignment had the worst agreement (23.76%) with the reference alignments manually curated according to evolutionary and functional insights ( Figure 1B). In fact, although fNS appears to generate higher SO alignments than sNS, all three programs that agree well with the reference alignment were operating under the sNS mode rather than the fNS mode. Among them, US-align2 sNS had the highest agreement with the reference (81.21%), which is $13% higher than the second (GANGSTA+) and third place (MASS) programs. These data suggest that US-align2 sNS alignments may have higher biological relevance than US-align2 fNS alignments, even though the latter had better apparent structural overlaps.
In terms of speed, NS alignment is generally slower than SQ alignment using a similar objective function. For example, SPalignNS, which extended the original SQ alignment method SPalign for NS alignment, was almost three times slower than SPalign (Table 1). This is also true for US-align2, whose SQ alignment mode was on average 4.7 and 2.3 times faster than its fNS and sNS modes, respectively. Nonetheless, thanks to the fast heuristic alignment-superimposition iterations implemented in US-align2 (See STAR Methods), it was faster than any other programs in the same alignment mode. US-align2 fNS, sNS, and SQ used 47.6%, 14.4%, and 20.8% less time than SPalignNS, MASS, and CE, respectively, which were the second fastest programs for fNS, sNS, and SQ alignments in our benchmark. This makes US-align2 suitable for large-scale structure analysis. Note that the running time of a program depends on the hardware. For all benchmarks in this study, all programs are run single-threaded on the Yale Grace supercomputer equipped with the Intel Xeon Gold 6240 CPUs.
Detecting protein pairs from the same structure fold by non-sequential alignment To further evaluate the ability of different structure alignment program to differentiate protein pairs from the same versus different folds, we collect a large dataset of 954 protein domains from the ASTRAL40 database version 2.08, which is a non-redundant subset of the Structural Classification of Proteins-extended (SCOPe) database with pairwise sequence identity <40%. To collect this dataset, we first only kept the structure with the best resolution in each protein superfamily in the ASTRAL40 set. We then removed proteins from all SCOPe folds with less than two superfamilies, resulting in 954 proteins, each from a different superfamily, that belong to 146 SCOPe folds.  (Brown et al., 2016). This is because most reference alignments in the RIPC dataset contain a very small number of residue pairs (%10) manually confirmed by the original author based on sequence and function conservation. Therefore, the average values for reference-based evaluation can be heavily biased by a few aligned residue pairs from the short reference alignments. For each metric, the value from the best program is highlighted in bold.

OPEN ACCESS
iScience 25, 105218, October 21, 2022 3 iScience Article All-against-all alignments were performed on this dataset by US-align2 and third-party protein alignment programs. All structure pairs were then sorted in descending order of alignment scores (e.g., TM-score for US-align2 and SPscore for SPalign), except for SAMO alignments, which were sorted in ascending order of the alignment score because a lower (i.e., more negative) SAMO score indicates a higher structural similarity. Each pair was labeled positive or negative depending on whether the pair shared the same SCOPe fold or not. Any self-hit (i.e., alignment of one protein to itself) was not considered. The performance of Rfam family detection at an alignment score cutoff c was quantified by the true positive rate (TPR) and false positive rate (FPR): where true positive TP(c) and false positive FP(c) were the number of positive and negative pairs with alignment score Rc (or %c in the case of SAMO), respectively, while P and N were the total number of positive and negative RNA pairs. The receiver operating characteristic (ROC) curve could then be drawn for TPR versus FPR at all possible alignment score cutoffs (Figure 2A). The area under ROC curve (AUROC) summarized the ability of the alignment score to differentiate positive from negative protein pairs, where a perfect method would have AUROC = 1.
This benchmark shows that SQ alignment methods are usually more capable than NS methods to detect protein pairs from the same fold ( Figure 2A). Among all methods, US-align2 SQ has the best AUROC (0.929) followed closely by SPalign (0.901), although NS methods usually produce alignments with better structure overlaps ( Figure 2B). The most extreme case is the fNS method CLICK, which has the best average structure overlap of 73.58% but the worst AUROC (0.564), which is close to random results (AUROC = 0.5). These data are not surprising given that only a minority (17.4%) of pairs of topologically similar protein structures have NS relations (Abyzov and Ilyin, 2007), in contrast to RNA structure pairs where NS relations are more common as shown in a later section. iScience Article Despite having a lower precision, NS alignment can sometimes detect structural similarity that SQ alignment is not sensitive enough to identify. For example, both the cellobiose dehydrogenase and the chitin monooxygenase (SCOPe: d4qi3a1 and SCOPe: d4a02a_, respectively) share the same immunoglobulin-like beta-sandwich fold with complicated topology. US-align2 SQ is only able to align 3 out of the 7 b-strands of d4a02a_ with a poor TM-score of 0.303 ( Figure 2C). On the other hand, US-align sNS alignment produces a completely different superimposition and a much higher TM-score (0.693), which is well above the cutoff for significant structure similarity (TM-scoreR0.5). This is because the sNS alignment can discard the connectivities among the b-strands, enabling the alignment of 6 out of all 7 strands from d4a02a_ to be aligned ( Figure 2D).

Detecting related RNA pairs by non-sequential alignment
In this section, NS and SQ alignment modes of US-align2 were further benchmarked for homologous RNA detection on the Rfam dataset. To construct this dataset, sequences of all 7668 RNA PDB chains with at least one Rfam family match in Rfam database version 14.8 (Kalvari et al., 2021) were extracted from their iScience Article PDB coordinates. Sequence redundancies among the RNAs were then removed by CD-HIT-EST (Huang et al., 2010), resulting in a final set of 508 representative RNAs mapped to 126 Rfam families, where any two RNAs shared <80% sequence identity. We then repeat the same ROC analysis for all-against-all alignments on this dataset by US-align2 under fNS, sNS, and SQ modes, as well as by two existing RNA alignment programs (RMalign and ARTS). Each pair was labeled positive or negative depending on whether the pair shared at least one identical Rfam family or not.
As shown in Figure 3A, US-align2 fNS and sNS both had $1.2% higher AUROC than US-align2 SQ, which in turn had 1.4% and 12.5% higher AUROC than third-party programs RMalign and ARTS, respectively. It may appear that the differences between the AUROCs of US-align2 NS alignments and those of US-align2 SQ and RMalign are small in absolute values. This is because the majority (68.5%) of RNA pairs from the same Rfam family are close homologs with R40% sequence identities. Since the list of top alignment hits is dominated by close homologs that are easy to align by all top performing programs, almost all programs, except for ARTS, have near perfect AUROC scores that are close to 1.
To evaluate the abilities of different programs to compare distant homologs that are difficult to align, the same benchmark is repeated for the subset of 21813 RNA pairs with <40% sequence identities ( Figure 3B). More pronounced differences in performances are observed, where the AUROC of US-align2 fNS and sNS was 6.8% and 6.0% higher than US-align2 SQ, 9.5% and 8.7% higher than RMalign, and 17.3% and 16.4% higher than ARTS, respectively. These data demonstrate the utility of NS alignment for detecting related RNA pairs, especially when the sequence similarity is low.
An example of RNA pairs whose structural similarity is recognized only by NS alignment is the eukaryotic large subunit ribosomal RNA (Rfam: RF02543) from baker's yeast and Leishmania (PDB: 6em1 Chain 1 and iScience Article PDB: 6az3 Chain 2, respectively). Although both rRNAs belong to the same Rfam family and perform the same function, US-align2 SQ alignment only reports a very low TM-score (0.182, Figure 3C). RMalign also reports a very low similarity with RMscore 0.153, where RMscoreR0.5 is the cutoff for significant structural similarity (Zheng et al., 2019). The consistent failure of structural similarity detection by SQ alignment is due to the yeast rRNA being in an intermediate state during ribosome assembly (Kater et al., 2017) while the conformation of the Leishmania rRNA corresponds to the state in the final assembled ribosome (Shalev-Benami et al., 2017). Despite their conformational differences, US-align2 sNS can identify the structural similarity at TM-score 0.565 ( Figure 3D) which is well above the cutoff for a pair of structurally related RNAs (TM-scoreR0.45).
These data, however, do not suggest that NS alignment should always be preferred over SQ alignment for RNAs. Although both modes of NS alignments by US-align2 had higher TPRs than SQ programs at the high FPR region (FPRR0.1), their TPRs were below US-align2 SQ in the low FPR region (FPR<0.1) (Figure 3). This implies that although NS alignment modes of US-align2 are less precise than US-align2 SQ for the very top hits (corresponding to the low FPR region), they are more sensitive than US-align2 SQ as more hits are considered (corresponding to the high FPR region).
Even though RMalign and ARTS both performed SQ alignment rather than the more computationally expensive NS alignment, they still required 0.74 and 0.32 second per alignment, respectively. This was at least 8.0 and 3.6 times slower than US-align2 under sNS and fNS modes, which takes 0.04 and 0.09 second per alignment, respectively. Although not as fast as US-align2 SQ at 0.02 second per alignment, US-align2 sNS and fNS were still among the fastest alignment programs for RNA tertiary structures.
Case studies for molecular mimicries validated by non-sequential structure alignment The ability of US-align2 to handle both proteins and nucleic acids makes it an ideal program for identifying protein-RNA molecular mimicry. Such mimicry between a protein and a nucleic acid molecule usually implies similarity in overall shape but not consistency in sequence order. Therefore, NS alignment would be more relevant than SQ alignment for detecting such cases.
For example, the ribosome recycling factor (RRF) protein and the tRNA is a pair of known molecular mimicry (Selmer et al., 1999). Both molecules are L-shaped and both bind to the same site within the ribosome. A tRNA transports an amino acid into an mRNA-bound ribosome to enable translation, while the RRF terminates translation by releasing the mRNA from the ribosome. Under SQ mode, US-align2 poorly captured the structure similarity, with an insignificant TM-score of 0.259, where the alignment only covers 49.2% of RNA nucleotides. The two molecules were poorly superimposed visually, with a large portion of RRF extruding outside the aligned region to the tRNA ( Figure 4A). On the other hand, US-align2 NS aligned this pair of structures much better, as the superimposition placed RRF almost entirely within the envelop of the tRNA, where the alignment covers 87.1% of the nucleotides ( Figure 4B). The TM-score is 0.464, which is above the TM-score cutoff of 0.45 to consider the RNA alignment to be significant (Gong et al., 2019).
Molecular mimicry also occurs for DNA aptamers, which are DNA oligonucleotides that fold into unusual tertiary structures to recognize protein epitopes in a similar fashion as antibody-epitope binding (Ren et al., 2021). Such protein-DNA mimicries cannot be easily detected by SQ alignment. For example, US-align2 SQ reports a low TM-score of 0.285 between the antibody for hemagglutinin and the DNA aptamer of lactate dehydrogenase. The regions for epitope recognition by the antibody and the aptamer are completely unaligned in SQ ( Figure 4C). On the other hand, US-align2 NS reports a significant TM-score of 0.500 and produces superpositions that overlays the epitope recognition sites well ( Figure 4D).

DISCUSSION AND CONCLUSION
In this work, we present the US-align2 algorithm for both SQ and NS alignment. The NS alignment modes enable accurate identification of non-sequential residue correspondence between protein structure pairs and the sensitive detection of remote RNA homologs. Such sensitivities are achieved at a speed several times faster than existing programs. The unique ability of US-align2 to align proteins with nucleic acids under a unified scoring function (TM-score) enables it to quantitatively capture a protein-RNA mimicry.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:  1eo8 Chain A) and lactate dehydrogenase (PDB: 3zh2 Chain B), which are recognized by the antibody and the DNA aptamer, respectively. Since a nucleotide residue is much bigger than an amino acid residue in size, a nucleotide is represented by two atoms (P and C4') while an amino acid is represented by one atom (Ca) when aligning the RNA to the protein, as implemented by the US-align2 option -atom ''PC4'''. These atoms were chosen because the distance between a P and the adjacent C4' atom coincides with the distance between two adjacent Ca atoms at approximately 3.8 Å . TMscores in all panels were normalized by the length of the nucleic acid, as it is shorter than the protein.

OPEN ACCESS
iScience Article normalization. The normalization factor d 0 ensures that the TM-score is independent of protein length, and is calculated as: Statistics obtained on inter-and intra-family pairwise alignment show that TM-scoreR0.5 (Xu and Zhang, 2010) or R0.45 (Gong et al., 2019) corresponds to a pair of proteins or RNAs, respectively, sharing the same topology.
It is NP-hard to identify the superimposition (i.e., rotation and translation of one structure relative to another) that maximize the TM-score. Therefore, the TM-score superimposition given the alignment (i.e., residue level correspondence) is solved numerically by extracting all continuous fragment pairs with length L ali , L ali /2, L ali /4, . , 4. A superimposition is performed for each pair of fragments using the Kabsch algorithm (Kabsch, 1976) to minimize the RMSD. Among these superimpositions, the one corresponding to the highest TM-score is considered the optimal TM-score superimposition.
For RNAs, the secondary structure, i.e., base pairing, is assigned by US-align2 as in our previous work (Gong et al., 2019). For two nucleotides to be considered as forming base pair, they must satisfy the following three conditions. First, the distance between the pair of C3' atoms should fall within 12.5 to 15.0 Å . Second, only G:C, G:U and A:U pairs are allowed. Third, the singleton pair is excluded, i.e., if neither nucleotide pair i-1 and j+1, nor nucleotide pair i+1 and j-1 satisfy the above two criteria, nucleotide i and j are not considered paired either. Based on these three criteria, a nucleotide can be assigned to one of the three secondary structure state: unpaired, paired with an upstream base, or paired with a downstream base.

US-align2 for SQ alignment
Similar to our previous study (Zhang et al., 2022), SQ alignment of US-align2 starts with five different initial alignments, which are based on gapless sliding, secondary structure matching, half-half combination of secondary structure matching and gapless sliding, superimposition of large fragments with length L/2 and L/3, and superimposition of small fragments with length 4. For each initial alignment, a set of heuristic superimposition-alignment iterations are performed until convergency. In each iteration, a TM-score superimposition is performed based on the alignment obtained from the previous iteration. The new superimposition is used to derive a new alignment using Needleman-Wunsch (NW) global alignment (Needleman and Wunsch, 1970) with a gap penalty of À0.6 and a residue-level TM-score for aligning residue i from Structure A to residue j from Structure B: ll OPEN ACCESS