Introduction

Computational Protein Design, which is to design proteins with specific structures or functions1, has been a powerful tool to prompt the exploration of sequence or topology space not yet visited by evolutionary process2,3,4 and discover proteins with better properties5. It has enabled success in membrane protein design6, enzyme design7, etc. As one of the sub-tasks of Computational Protein Design, Inverse Protein Folding (IPF), the problem of finding amino acid sequences that can fold into a given three-dimensional (3D) structure8, is of great importance as hosting a particular function often presupposes acquiring a specific backbone structure.

How to model and utilize residue interactions has been the focus of various IPF algorithms. In traditional methods, energy functions are designed to approximate backbone-sequence compatibility. Residue-pair interaction modeling is usually derived from databases by leveraging statistical preferences for particular residue pairs in a simplified local environment to estimate inter-residue energies5,9,10. The increasing computational complexity limits the statistical estimation of multi-residue interactions that are conditional on a more fine-grained representation of the local environment10,11.

In recent years, deep learning has been widely and successfully applied to protein structure modeling and prediction12,13, due to its ability to automatically learn complex non-linear many-body interactions from data. There have been efforts to solve IPF with deep learning4,14,15. Early methods often model protein structures as sequences of independent residues16,17 or atom point clouds4,15 and adopt a non-autoregressive decoding scheme as demonstrated in Fig. 1a. Their independence assumption prevents them from learning complex residue interactions and limits their performance. Some recent works use proximity graphs to represent protein structures, where residues are nodes and residue interactions are directly modeled as edges. Typically, a masked encoder-decoder architecture with an autoregressive decoding method is used (shown in Fig. 1b)18,19,20,21. Recently, a similar decoding scheme has been proposed in ABACUS-R (shown in Fig. 1c)22. This method assumes all neighbor residue types are known when decoding a central residue. Starting from a random initial sequence, it updates residues recursively based on their neighborhood until convergence. However, the dependency on previous predictions has proven to be prone to the error accumulation problem23,24. Noisy residue information is introduced into the context and propagated through the graph structure, while recovering a target residue would be easier and more accurate if more high-quality residue interactions were available and utilized.

Fig. 1: Different ways of utilizing interresidue features.
figure 1

a Nonautoregressive decoding scheme. b Autoregressive decoding scheme. c Decoding scheme proposed in ABACUS-R. d Our proposed method.

We summarize the above issues as the selection and utilization of high-quality residue interactions. To address these issues, we propose a protein sequence design model ProRefiner. The model is tasked with BERT25-like sequence inpainting conditioned on protein structures. Specifically, we mask random residues on sequences during training. The model takes the masked partial sequences and backbone structures as input and learns to reconstruct the whole sequence. During inference, the input partial sequence to ProRefiner could be constructed in two ways. In partial sequence design scenarios where only some residues need to be designed, the remaining residues can naturally serve as an oracle partial input sequence, whereas in entire sequence design settings, we introduce an entropy-based residue selection technique to utilize predictions from existing models. Specifically, given a sequence generated by an inverse folding model, which is referred to as the base model, we include residues with highly confident predictions into the partial sequence and mask less valuable predictions with low confidence. Here we use entropy to approximate the base model’s confidence in its predictions. In our experiments, the precision among the residues with the lowest 10% entropy is around 99%. By masking out high-entropy residues, a significant amount of noise can be effectively removed from the input residue environment. ProRefiner then generates the whole sequence in one shot based on the denoised partial sequence. Compared to previous left-to-right sequence design models, ProRefiner learns to exploit global residue interactions by training with partial sequence input. Its one-shot generation manner, together with the proposed residue selection technique, ensures higher-quality residue interactions and faster generation speed. It can be used as an add-on module to refine the results of existing methods.

ProRefiner’s model architecture is a stack of memory-efficient global graph attention layers as shown in Fig. 2. Attention mechanism has been proven effective in modeling global dependencies for sequential data26. However, adapting attention to the graph domain is challenging. Specifically, the attention mechanism calculates attention weights between any two nodes based on their features. For graphs, this requires storing and manipulating a square matrix of size equal to the number of nodes, which neglects the sparsity of graph structures and increases the memory complexity to quadratic in terms of node count, posing scalability issues27,28. Some methods circumvent this by confining attention within node neighborhoods, losing the global view that makes attention powerful18,29,30.

Fig. 2: The model architecture of ProRefiner.
figure 2

A partial sequence, either given or constructed from a base model’s generation, and the backbone structure are encoded to obtain the graph features. Several memory-efficient global graph attention layers are employed to propagate the graph features and learn global residue interactions. Finally the whole sequence is generated in one shot.

Moreover, these methods do not fully utilize edge features, as they only contribute to attention computation without the ability to be updated or influence node feature updates18,27,29,31. However, edge features have been proven to be critical in protein structure modeling20. In summary, to address these limitations, we aim to design an attention-based model tailored for graphs that (1) is memory efficient, (2) maintains a global view of dependencies, and (3) fully incorporates edge features.

In particular, a K-nearest neighbor graph is constructed from the backbone structure. Informative node features and edge features are extracted (detailed in Section Methods), and send to a stack of memory-efficient global graph attention layers. In each layer, every residue node globally attends to other residues. An attention score is calculated from both node and edge features to determine the amount of information that a target residue gathers from another residue. For residue pairs that are not directly connected by an edge, a learnable pseudo-edge feature is used for attention calculation. Each layer learns a separate pseudo-edge feature that is shared by all non-existing edges. The attention score is then used to weight and sum up node and edge features to produce updated node features. Edge features are also updated by the new node features. Finally, the model generates the sequence from the node features from the last layer in one shot. This memory-efficient global graph attention layer allows for global residue attention while eliminating the need for fully-connected graph construction by learning pseudo-edge features. Residues are able to leverage global interactions and whole-structure features.

Our experiment results demonstrate that our method is effective in handling both entire sequence design and partial sequence design settings. In particular, we validated ProRefiner on the task of single-point mutant design of Transposon-associated transposase B, as a special case of partial sequence design where only one residue can be modified. The proposed ProRefiner successfully identified six variants with improved gene editing activity out of the 20 mutants recommended by the model.

Results

ProRefiner is trained on CATH v4.2 training set containing 18,204 structures18.

Entire Sequence Design

ProRefiner can serve as an add-on module to refine the sequences designed by existing base models. We demonstrate this application on entire sequence design. We experiment with the recent inverse protein folding models as follows.

  • GVP-GNN19 trained on the same training set as ProRefiner. We use the official codebase and default parameters provided by19 to train and evaluate the model.

  • ProteinMPNN20 trained on selected PDB structures clustered into 25,361 clusters. We use the 48 edges, 0.20 Å noise version of pretrained model weights.

  • ProteinMPNN-C with the same architecture as ProteinMPNN but trained on the same training set as ProRefiner for fair comparison.

  • ESM-IF121 trained on CATH v4.3 training set with 16,153 structures and 12 million additional structures predicted by Alphafold213. We use the pretrained model weights released by the official codebase.

We conduct experiments on the following three benchmarks.

  • CATH. CATH v4.2 dataset18 is a standard dataset for IPF training and evaluation. We evaluate on its test split of 1120 structures.

  • TS50. TS50 is a benchmark set of 50 protein chains proposed by17. It has been used by a number of previous works15,32,33. There are four structures shared between TS50 and CATH.

  • Latest PDB. We collect the latest published structures in PDB to validate the model’s ability to generalize to new structures. We select protein structures released after 01/01/2022 with a single chain of length less than 500 and resolution < 2.5 Å, resulting in 1975 protein structures. There are no structures that overlap between Latest PDB and the other two benchmarks.

We report two metrics on all benchmarks: sequence recovery and native sequence similarity recovery (nssr)34. A pair of residues is considered similar and contributes to the nssr score if their BLOSUM62 score35 > 0. Compared with recovery which only considers residue identity, nssr takes residue similarity into account and provides a more specific comparison between two sequences. Additionally, we report perplexity metric in Supplementary Table 1.

In Table 1, we report the median recovery rates and nssr scores of ProRefiner with different base models in comparison to the base model themselves. Among all the base models, ESM-IF1 achieves the best performance, highlighting the effectiveness of data augmentation. ProRefiner consistently achieves high recovery rates and nssr scores even with relatively poor base models, demonstrating its ability to refine the input partial sequences. Additionally, when partial sequences with higher quality are available, ProRefiner outperforms the ESM-IF1 model which uses 12 million additional structures for augmentation. Entropy-based residue selection is used to remove noise in base models’ predictions for the subsequent refinement by ProRefiner. To validate its effectiveness, we remove the selection operation and send the entire predicted sequence to ProRefiner. Resulted recovery rate on CATH is plotted in Fig. 3a. Removing entropy-based selection results in a large drop in sequence recovery especially when the base model’s recovery is low. This result supports the idea that the noise in the input residue context can significantly limit sequence generation quality and lead to suboptimal sequence designs. Using entropy-based selection to filter out low-quality residue predictions is an effective strategy for improving sequence recovery. The fact that the recovery drop is more pronounced when the base model’s recovery is low suggests that the selection operation is particularly important when the input sequence information is less reliable.

Table 1 Median sequence recovery rates and nssr scores of ProRefiner with different base models on three benchmarks
Fig. 3: Model performance analysis on entire sequence design.
figure 3

a Sequence recovery with and without entropy-based residue selection on CATH dataset (n = 1120 structures). The inner box plots show the first quartile, median and the third quartile. Whiskers in box plots extend to the most extreme data point that lies within 1.5 times the inter-quartile range (IQR) from the nearest quartile. b Confusion matrix of ProRefiner + ESM-IF1 on CATH dataset (n = 1120 structures). c, d Sequence recovery breakdown to hydrophilic and hydrophobic residues and residues on different secondary structures on CATH dataset (n = 1120 structures). The box plots show the first quartile, median and the third quartile. Whiskers in box plots extend to the most extreme data point that lies within 1.5 times IQR from the nearest quartile. e, f The predicted structures of sequences designed by ProRefiner + ESM-IF1 and ESM-IF1 for two structures (2KCD and 3A57), compared to the native ones. Alphafold2 is employed to fold the designed sequences. Source data are provided as a Source Data file.

In Fig. 3b, we show the confusion matrix of ProRefiner + ESM-IF1 on CATH. A darker cell means a larger portion of the residues of the native type is predicted to be the corresponding type on the horizontal axis. It can be observed that the residue types the model tends to confuse are also physicochemically similar types, such as ILE vs VAL and GLU vs LYS. Two-sided Mantel test36 also shows that the confusion matrix is highly correlated with the BLOSUM6 amino acid substation matrices35 (p value = 0.0001). In Fig. 3c and d, we break down the sequence recovery on CATH to different amino acid types and secondary structures (H stands for 310 helix, α helix and π helix, E for isolated beta-bridge residue and strand, and C for bend, turn and coil). Our method shows improvement in the recovery of both hydrophilic and hydrophobic residues, with slightly greater improvement seen for residues located on bends, turns or coils.

We assess whether the designed sequences can fold into the target backbones by predicting their structures with Alphafold2. Fig. 3e and f show the results on proteins with PDB codes 2KCD (https://doi.org/10.2210/pdb2KCD/pdb) and 3A57 (https://doi.org/10.2210/pdb3A57/pdb) respectively. We observe that the sequences refined by ProRefiner are predicted to fold into structures more similar to native ones than those designed by ESM-IF1, as evidenced by higher TM-scores37 and lower root-mean-square deviation (RMSD).

Partial sequence design

For partial sequence design, ProRefiner fills in the unknown residues based on partial input without the need for base models. To ensure a fair comparison, we evaluate ProRefiner alongside GVP-GNN and ProteinMPNN-C, which are trained on the identical training set used for ProRefiner. Both models are autoregressive models. They implement partial sequence design by replacing the decoded residues with the given fixed amino acids when available during the autoregressive decoding. Additional results for ESM-IF1 model on partial sequence design can be found in Supplementary Table 2. We evaluate the models on the following two benchmarks.

  • EnzBench. EnzBench is a standard sequence recovery benchmark consisting of 51 proteins38. Designing algorithms are required to recover the native residues on protein design shells with other residues fixed. This benchmark is designed to test the algorithm’s ability to model protein binding and overall stability.

  • BR_EnzBench. BR_EnzBench34 aims to test the algorithm’s ability to remodel the chosen protein structure. It randomly selects 16 proteins from EnzBench and uses the Backrub server39 to create an ensemble of 20 near-native conformations for each protein. To further increase the designing difficulty, all residues on the design shell are mutated to alanine, and conformations are then energy-minimized.

When evaluated on EnzBench and BR_EnzBench, identities of residues not on design shells are fixed and available to models. Recovery rates and nssr scores for residues on design shells are reported in Table 2. ProRefiner achieves the highest recovery and nssr on both benchmarks. We further analyze the recovery rate on EnzBench for different amino acids and secondary structures, as shown in Fig. 4a and b. ProRefiner significantly improves the recovery of both hydrophobic and hydrophilic residues and surpasses other models on all secondary structures. We employ Alphafold2 to fold the designed sequences and compare the recovered design shell structures. Fig. 4c shows the designed structures of 1Y52 (https://doi.org/10.2210/pdb1Y52/pdb) and 1Y2U (https://doi.org/10.2210/pdb1Y2U/pdb), which have 19 and 13 designable residues respectively. Design shell residues are shown by atoms. ProRefiner better recovers the design shell structures, with higher TM-scores and lower RMSD. More results and discussion on structure recovery can be found in Supplementary Table 35.

Table 2 Median sequence recovery rates and nssr scores on EnzBench and BR_EnzBench
Fig. 4: Model performance analysis on partial sequence design.
figure 4

a, b Sequence recovery breakdown to hydrophilic and hydrophobic residues and residues on different secondary structures on EnzBench dataset (n = 51 structures). The box plots show the first quartile, median and the third quartile. Whiskers in box plots extend to the most extreme data point that lies within 1.5 times IQR from the nearest quartile. c The predicted structures of sequences designed by ProRefiner and ProteinMPNN-C for two structures (1Y52 and 1Y2U), compared to the native ones. Design shells are plotted in atoms. Alphafold2 is employed to fold the designed sequences. Source data are provided as a Source Data file.

Application on transposon-associated transposase B

Transposon-associated transposase B (TnpB) is thought to be the ancestor of Cas12, the type V CRISPR effector40,41. TnpB (408 amino acids) in the D.radiodurans ISDra2 element has been demonstrated to function as a hypercompact programmable RNA-guided DNA endonuclease42, and its miniature size is suitable for adeno-associated virus-based delivery. However, TnpB exhibits moderate gene editing activity in mammalian cells, limiting its therapeutic application.

We aim to improve the editing activity of TnpB through the design of single-point mutations. We consider the design of a single-point mutation as a partial sequence design, where only one residue is designable, and all others are fixed. With the empirical intuition that a more positively charged surface might potentially improve activity, we restrict the mutation target to the most positively charged amino acid, arginine (R), and restrict the candidate mutation sites to surface residues. We leverage sequence design models to compute a quality score for every candidate site, as illustrated in Fig. 5a. For each site to be examined, we mask this site in the native sequence to get the input partial sequence, and the input backbone structure is the wild-type backbone predicted by Alphafold213. The model then predicts the identity of the masked site in the form of a probability distribution over all amino acid types. We expect that a model that can effectively learn residue interactions will tend to give a higher probability to the types that are more compatible with the given residue context. Hence, we take the predicted probability for R as a measure of mutant stability. Furthermore, we consider the distance between the Cα of the site and the center of the predicted binding site43, as empirically mutation sites close to the binding site are more likely to bring improvements. The two scores are combined to obtain a quality score measuring how likely mutation sites can yield stable and improved mutants. All candidate sites are ranked by their quality score and the top 20 are taken as the recommended mutation points.

Fig. 5: The procedure and results of TnpB single-point mutation design.
figure 5

a The process of computing the quality score of one target mutation site on TnpB WT sequence. b The mutation sites recommended by ProteinMPNN-C and ProRefiner are marked in red and blue respectively. c The improvement of variants recommended by ProteinMPNN-C and ProRefiner in indel activity relative to TnpB WT. All statistical analysis were performed on n = 3 biologically independent experiments and data is shown as the mean ± SD of three biological replicates. d Indel formation at the on-target and off-target sites observed for TnpB WT and TnpB K84R. Off-target sites are chosen following50. All statistical analysis were performed on n = 3 biologically independent experiments and data is shown as the mean ± SD of three biological replicates with actual values overlaid. P values were derived by a two-sided Student’s t-test with ** denoting P < 0.01 and *** P < 0.001. The exact P-values from top to bottom are 0.00188, 0.00104, 0.00792, 0.000214, 0.000508 and 0.000207. Source data are provided as a Source Data file.

Our proposed ProRefiner and ProteinMPNN-C are employed for mutant design following the above procedure. 20 mutation points recommended by two models are displayed in Fig. 5b. To test TnpB variants activity in human cells (HEK293T), plasmids encoding the TnpB variants fused with N- and C-terminal nuclear localization (NLS) sequences and reRNA construct targeting a EMX1 site in human genomic DNA (gDNA) were transiently transfected into HEK293T cells. After 96h, gDNA was extracted and analysed by sequencing for the presence of insertions and deletions (Indels) at the targeted cleavage sites. CRISPResso2 is used to analyse Indels, with parameters as follows: minimum of 80% homology for alignment to the amplicon sequence, quantification window of 20 bp and ignoring substitutions to avoid false positives. Experiments show that 6 arginine substitutions designed by ProRefiner exhibit above 1.2-fold improvement in indel activity relative to TnpB WT, and 3 by ProteinMPNN-C. Results are given in Fig. 5c. Additionally, off-target of the variant with the highest activity, TnpB K84R, is compared with the TnpB WT. As shown in Fig. 5d, the increase in activity leads to a degree of non-specific cleavage as expected, which may compromise the nuclease’s specificity.

This experiment demonstrates that the proposed ProRefiner is effective at modeling residue interactions within a structural environment and generating sequences that best fit a given 3D context. It can be used in combination with other property measures to redesign existing proteins and improve their stability or other qualities that depend on protein stability.

Discussion on model designs

We examine several key designs in ProRefiner model architecture. Our investigation revealed that the introduced global attention layers could learn and exploit meaningful residue interactions. Specifically, we identity the residues to which a target residue pays the most attention, indicated by the highest attention scores. It’s observed that many important residue interactions are well learned and represented by the attention operation. Fig. 6a–c provide examples for three typical chemical bonds in protein structures, where the target residues are shown in blue, and the three residues with the highest mean attention scores are shown in orange. For example, in the case of HIS 9 on 2KCD (https://doi.org/10.2210/pdb2KCD/pdb), LEU 5 is among its most attended residues. HIS 9 forms a hydrogen bond with LEU 5 on the α helix and this interaction is well learned by the attention layers. Similarly, ILE 70 on the sheet heavily attends to ASN 54, which it forms a hydrogen bond with. For T4-lysozyme (https://doi.org/10.2210/pdb1LYD/pdb), ASP 70 forms a hydrogen bond with LEU 66 and a salt bridge with HIS 31, and both residues are among its most attended residues. The attention operation also captures a disulfide bond between CYS 99 and CYS 94 of human Ero1-alpha (https://www.uniprot.org/uniprotkb/Q96HE7/entry).

Fig. 6: Analysis on model design and ablation study.
figure 6

a Hydrogen bonds between two target residues in blue (HIS 9, ILE 70 on 2KCD) and their most attended residues in orange. b A salt bridge and a hydrogen bonds between ASP 70 of T4-lysozyme and its most attended residues. c. A disulfide bond between CYS 99 and CYS 94 of human Ero1-alpha (Q96HE7). d Models' median recovery rate on CATH dataset (n = 1120 structures), with respect to different percentages of residues masked on sequences generated by base model ESM-IF1. e Runtime and GPU memory usage of ProRefiner and the model without pseudo edge feature, denoted by ProRefiner - PsFeat. Data is obtained through 16 independent runs and the plots show the mean with the 95% confidence interval of the mean, estimated from 1000 bootstrap samples. f Recovery rates and nssr scores of ProRefiner and ProRefiner - PsFeat on CATH (n = 1120), EnzBench (n = 51) and BR_EnzBench (n = 320). ESM-IF1 is used as the base model for inference on CATH. The inner box plots show the first quartile, median and the third quartile. Whiskers in box plots extend to the most extreme data point that lies within 1.5 times IQR from the nearest quartile. Source data are provided as a Source Data file.

We further validate the effectiveness of two key components in ProRefiner: global attention mechanism and partial sequence input. Two models are trained for ablation study: (1) a model without the global attention view, where residues only attend to their graph neighbors, and pseudo edge features are therefore not used; and (2) a model without partial sequence input during training, where all residues are masked. We compare the recovery rates of ablated models on CATH (using base model ESM-IF1), EnzBench and BR_EnzBench. Results are given in Table 3. It is observed that removing either component results in a drop in model performance and we get p value < 0.05 by paired two-sided t-test on all datasets, indicating a significant improvement of introducing the global attantion view and input partial sequence. Notably, the model without partial sequence input exhibits a significantly larger performance degradation on BR_EnzBench, indicating that when input structures are not accurate, the ability to utilize sequence information becomes more important. We also test the robustness of these models to the input residue noise. For sequences generated by base model ESM-IF1, we mask different percentages of residues with the highest entropy, and plot the model’s median recovery rate on the CATH benchmark in Fig. 6d. The model trained without partial sequence cannot really utilize input partial sequence, and thus exhibits relatively stable performance. The recovery rate of the other two models first increases as more noisy residues are removed, and drops slightly when too many residues are masked and less sequence information is left. The model with a global attention view consistently outperforms the model with local view, demonstrating a better ability to leverage the input residue context, as residue information is available to every node, not just the ones close to them.

Table 3 Median recovery of ProRefiner (the last row) and two ablated models, either without global attention view or without partial sequence input

Finally, we compare the performance of the proposed memory-efficient global attention layers against the original global attention layers. We use ProRefiner - PsFeat to denote the model that uses the vanilla global attention layers without learning pseudo-edge features. Figure 6e illustrates the runtime and GPU memory utilization of the two models. The measurements are obtained by 16 independent runs. ProRefiner exhibits linear time and memory complexity, whereas ProRefiner - PsFeat introduces quadratic complexity due to the construction of fully connected graphs. The predictive performance of the two models is presented in Fig. 6f. We experiment on CATH benchmark with base model ESM-IF1 for entire sequence design and EnzBench and BR_EnzBench benchmarks for partial sequence design. The results indicate that while ProRefiner shows slightly greater performance variance on certain benchmarks, its overall performance remains similar and comparable to that of ProRefiner - PsFeat. Additionally, we compare the predictive performance of the two models on other tasks and report the results in Supplementary Table 6.

Discussion

In this work, we attempt to take a step towards better modeling and learning of inter-body interactions within protein structures, by proposing a method for inverse protein folding. We develop a two-pronged approach that incorporates a residue selection technique and a memory-efficient global graph attention model, which work jointly to achieve effective selection and utilization of high-quality residue interactions. Our experiments demonstrate that the proposed ProRefiner is able to capture meaningful inter-residue bonds and achieve high sequence recovery on several protein design benchmarks. We also apply the model to redesign TnpB and successfully discovered six mutants with enhanced editing activity. Our results highlight the potential of our method to facilitate the design of proteins with improved functional properties. Additionally, the memory-efficient graph attention module proposed herein provides an efficient means of modeling graph-structured data where global dependencies are critical. Potential future research directions could involve the application of this module to other protein-related tasks and the examination of other biomolecules.

Methods

Graph representation of proteins

A protein structure is represented as a proximity graph \({{{{{{{\mathcal{G}}}}}}}}=({{{{{{{\mathcal{V}}}}}}}},\, {{{{{{{\mathcal{E}}}}}}}})\), where \({{{{{{{\mathcal{V}}}}}}}}=\{{v}_{1},\, {v}_{2},\ldots,{v}_{N}\}\) denotes the residue nodes and \({{{{{{{\mathcal{E}}}}}}}}=\{{e}_{ij}\}\) denotes the directed edges form vj to vi, where residue vj is among the k = 30 nearest neighbors of vi in terms of Cα distance. Each node vi has the following structural features:

  • \(\sin\) and \(\cos\) value of dihedral angles;

  • unit vectors from the previous and next residues on sequence to vi in terms of Cα position.

Each edge eij has the following features:

  • Gaussian radial basis functions encoding of interatomic distances between N, Cα, C, O and a virtual Cβ, and encoding of distance on sequence i − j20;

  • unit vector from vj to vi in terms of Cα position.

We then employ 2 geometric vector perceptrons layers19 to embed the structural features.

For sequence features, at training time, we randomly mask 70% of residues in the native sequences. Among the remaining 30% of the residues whose identity is available to the model, we randomly select 3% and replace their identity with random amino acid types. During inference, the input partial sequences are provided as-is without any masks or replacements. The partial sequences are then embedded as node features. Masked residues with unknown identity are treated as a special type unknown. The sequence node features are concatenated with structural node features. Resulting node features are denoted as \({{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{0}}}}}}}}}\in {{\mathbb{R}}}^{N\times d}\) where \({{{{{{{{\bf{h}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{0}}}}}}}}}\in {{\mathbb{R}}}^{d}\) denotes the feature of vi. Resulting edge features are denoted as \({{{{{{{{\bf{E}}}}}}}}}^{{{{{{{{\bf{0}}}}}}}}}\in {{\mathbb{R}}}^{N\times k\times d}\) where \({{{\bf{E}}}}_{{{\bf{i}}}}^{{{\bf{0}}}}\in {{\mathbb{R}}}^{k\times d}\) is the features of k neighbors of vi and \({{{{{{{{\bf{e}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{0}}}}}}}}}\in {{\mathbb{R}}}^{d}\) denotes the feature of edge from vj to vi.

Memory-efficient global graph attention model

Attention is first introduced in Transformer model26. Let \({{{{{{{\bf{H}}}}}}}}\in {{\mathbb{R}}}^{N\times d}\) denote the d-dimension features of the input sequence with length N. The self-attention module updates the input features according to the following equations:

$$\,{{\mbox{SelfAtten}}}\,({{{{{{{\bf{H}}}}}}}})={{{{{{{\bf{AV}}}}}}}},$$
(1)
$${{{{{{{\bf{A}}}}}}}}=\,{{\mbox{Softmax}}}\,\left(\frac{{{{{{{{{\bf{QK}}}}}}}}}^{{{{{{{{\bf{T}}}}}}}}}}{\sqrt{{d}_{K}}}\right),$$
(2)
$${{{{{{{\bf{K}}}}}}}}={{{{{{{{\bf{HW}}}}}}}}}_{{{{{{{{\bf{K}}}}}}}}},{{{{{{{\bf{Q}}}}}}}}={{{{{{{{\bf{HW}}}}}}}}}_{{{{{{{{\bf{Q}}}}}}}}},{{{{{{{\bf{V}}}}}}}}={{{{{{{{\bf{HW}}}}}}}}}_{{{{{{{{\bf{V}}}}}}}}},$$
(3)

where \({{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{K}}}}}}}}}\in {{\mathbb{R}}}^{d\times {d}_{K}},{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{Q}}}}}}}}}\in {{\mathbb{R}}}^{d\times {d}_{K}},{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{V}}}}}}}}}\in {{\mathbb{R}}}^{d\times {d}_{V}}\) are parameters to map H to keys, queries and values.

There have been attempts to employ Transformer architecture for learning on graphs, with nodes denoted as sequence tokens. To utilize the global view provided by the original self-attention on graphs with edge features, previous works generally incorporate edge features into the attention matrix A:

$$\begin{array}{r}{{{{{{{\bf{A}}}}}}}}=\,{{\mbox{Softmax}}}\,\left(f\left(\frac{{{{{{{{{\bf{QK}}}}}}}}}^{{{{{{{{\bf{T}}}}}}}}}}{\sqrt{{d}_{K}}},\phi ({{{{{{{\bf{E}}}}}}}})\right)\right),\end{array}$$
(4)

where \({{{{{{{\bf{E}}}}}}}}\in {{\mathbb{R}}}^{N\times N\times d}\) is the d-dimension edge features between each pair of nodes, ϕ estimates the correlations of node pairs from edge features, which could be linear transformation27 or more sophisticated functions31, and f is an aggregation function, which could be element-wise addition27,31 or multiplication27. These methods have two limitations. First, to construct the edge feature matrix E, they require fully connected graphs as input and the memory complexity will be \({{{{{{{\mathcal{O}}}}}}}}({N}^{2})\). Second, the edge features are not fully leveraged. They are only involved in attention computation and can not be used to update node features or vice versa.

ProRefiner model is composed of a stack of L memory-efficient global graph attention layers. In each layer, nodes can globally attend to all other nodes and edge features between node pairs serve as additive attention bias terms. For non-existing edges, one solution is to convert arbitrary graphs to fully connected graphs before entering the model, then Equation (4) could be used. This could be done by setting k to a large enough number or using a fixed masking value/vector for non-existing edges as in previous works27. This operation increases memory complexity from \({{{{{{{\mathcal{O}}}}}}}}(N\times k)\) to \({{{{{{{\mathcal{O}}}}}}}}({N}^{2})\). To avoid the conversion, we design a learnable pseudo edge feature in each layer. Let Hl and El denote the input features of the lth layer. The attention is computed as follows:

$${{{{{{{{\bf{A}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}=\,{{\mbox{Softmax}}}\,\left(\frac{{{{{{{{{\bf{Q}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}{({{{{{{{{\bf{K}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}})}^{{{{{{{{\bf{T}}}}}}}}}}{\sqrt{d}}+{{{{{{{{\bf{B}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\right),$$
(5)
$${{{{{{{{\bf{B}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}=\left\{\begin{array}{ll}{\left({{{{{{{{\bf{w}}}}}}}}}_{{{{{{{{\bf{B}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\right)}^{{{{{{{{\bf{T}}}}}}}}}{{{{{{{{\bf{e}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\quad &j\in {{{{{{{{\mathcal{N}}}}}}}}}_{i}\\ {\left({{{{{{{{\bf{w}}}}}}}}}_{{{{{{{{\bf{B}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\right)}^{{{{{{{{\bf{T}}}}}}}}}{\beta }^{{{{{{{{\bf{l}}}}}}}}}\quad &j\,\, \notin \,\,{{{{{{{{\mathcal{N}}}}}}}}}_{i}\end{array}\right.,$$
(6)
$${{{{{{{{\bf{K}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}={{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{K}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}},{{{{{{{{\bf{Q}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}={{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{Q}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}},$$
(7)

where Bl is the attention bias, \({{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{K}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{d\times d},{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{Q}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{d\times d},{{{{{{{{\bf{w}}}}}}}}}_{{{{{{{{\bf{B}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{d}\) are parameters, \({{{{{{{{\boldsymbol{\beta }}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{d}\) is the pseudo edge feature in layer l. Learning a pseudo edge feature for each layer is more adaptive and flexible than using one fixed masking value across all layers and provides a better approximation to using fully connected graphs.

The attention score Al is then used to aggregate node features as well as edge features. Node features are weighted and summed as in the vanilla self-attention. Edge features are aggregated with normalized weights and concatenated with aggregated node features. Finally, a linear layer is employed to map the concatenated feature to dimension d:

$$\hat{{{\bf{h}}}}_{{{\bf{i}}}}^{{{\bf{l}}}}=\left[\mathop{\sum}\limits_{j}{{{{{{{{\bf{A}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}{{{{{{{{\bf{V}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\parallel \mathop{\sum}\limits_{j\in {{{{{{{{\mathcal{N}}}}}}}}}_{i}}\frac{{{{{{{{{\bf{A}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}}{{\gamma }_{i}^{l}}{{{{{{{{\bf{e}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\right]{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{N}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}},$$
(8)
$${{{{{{{{\bf{V}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}={{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{V}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}},$$
(9)

where \({{{\bf{W}}}}_{{{\bf{V}}}}^{{{\bf{l}}}}\in {{\mathbb{R}}}^{d\!\times \!d},\, {{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{N}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{2d\!\times \!d}\) are parameters, means concatenation operation, and \({\gamma }_{i}^{l}\) is a normalization term to normalize the sum of edge weights to 1.

Then a residue connection and layer normalization are adopted to output the final updated node features:

$${{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}\!+\!{{{{{{{\bf{1}}}}}}}}}=\,{{\mbox{LayerNorm}}}\,\,(\hat{{{\bf{H}}}}^{{{\bf{l}}}}+{{{{{{{{\bf{H}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}).$$
(10)

The edge features will then be updated as follows, with \({{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{E}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\in {{\mathbb{R}}}^{3d\times d}\) being parameters:

$$\hat{{{\bf{e}}}}_{{{\bf{ij}}}}^{{{\bf{l}}}}=[{{{{{{{{\bf{h}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}\!+\!{{{{{{{\bf{1}}}}}}}}}\parallel {{{{{{{{\bf{e}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}\parallel {{{{{{{{\bf{h}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}\!+\!{{{{{{{\bf{1}}}}}}}}}]{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{E}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}},$$
(11)
$${{{{{{{{\bf{E}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}\!+\!{{{{{{{\bf{1}}}}}}}}}=\,{{\mbox{LayerNorm}}}\,(\hat{{{\bf{E}}}}^{{{\bf{l}}}}+{{{{{{{{\bf{E}}}}}}}}}^{{{{{{{{\bf{l}}}}}}}}}).$$
(12)

Note that we adopt this naive edge feature update here for its empirical effectiveness and implementation simplicity. However, it cannot ensure the triangle inequality on distances13. Incorporating more sophisticated edge update method for triangle inequality constraints could be a promising future direction.

To leverage edge features under the global attention mechanism, compared with \({{{{{{{\mathcal{O}}}}}}}}({N}^{2})\) by previous works, our memory-efficient global graph attention only needs \({{{{{{{\mathcal{O}}}}}}}}(N\times k+L)\) additional memory, and therefore allows designing longer sequences.

The output node features from the last layer will be mapped to the distribution over 20 residue types through a linear layer with parameter \({{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{P}}}}}}}}}\in {{\mathbb{R}}}^{d\!\times \!20}\):

$$\begin{array}{r}{{{{{{{{\bf{p}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}=\,{{\mbox{Softmax}}}\,\left({{{{{{{{\bf{h}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{L}}}}}}}}}{{{{{{{{\bf{W}}}}}}}}}_{{{{{{{{\bf{P}}}}}}}}}\right).\end{array}$$
(13)

Negative log-likelihood loss is used during training.

Entire sequence design with base model

In entire sequence design setting, we use an entropy-based residue selection method to construct the partial input sequence. Suppose \({{{\bf{p}}}}_{{{\bf{i}}}}^{{{\bf{b}}}}\) is the probability distribution of vi predicted by a base model. We compute the entropy \({en}_{i}^{b}\) of distribution \({{{\bf{p}}}}_{{{\bf{i}}}}^{{{\bf{b}}}}\):

$${en}_i^b = \mathbb{E}\left[-\log {{{\bf{p}}}}_{{{\bf{i}}}}^{{{\bf{b}}}}\right]$$
(14)

Residues with the least entropy are selected and retained while others are masked. To account for the varying ability of different base models to recover native sequences, we select different percentages of residues depending on the base model being used. The percentage of residues chosen for each base model is determined based on the recovery rate on the validation split of CATH v4.2. We experiment with percentages ranging from 5% to 50% for each base model, and select the percentage resulting in the highest recovery rate. Specifically, we choose 10% for GVP-GNN, 10% for ProteinMPNN, 15% for ProteinMPNN-C, and 35% for ESM-IF1. These percentages are roughly correlated with the sequence recovery performance of each base model.

The partial sequence is fed into ProRefiner to get the probability predictions pi with entropy eni. Finally, the predictions from the base model and ProRefiner will be weighted by their entropy and fused together:

$$\begin{array}{rc}\hat{{{\bf{p}}}}_{{{\bf{i}}}}=\frac{\exp (-e{n}_{i})}{\exp (-{en}_{i})\!+\!\exp \left(-{en}_{i}^{b}\right)}{{{{{{{{\bf{p}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}+\frac{\exp \left(-{en}_{i}^{b}\right)}{\exp (-{en}_{i})\!+\!\exp \left(-{en}_{i}^{b}\right)}{{{{{{{{\bf{p}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{b}}}}}}}}}.\end{array}$$
(15)

The final predicted residue type will be the argmax of \(\hat{{{\bf{p}}}}_{{{\bf{i}}}}\).

Experiment details of TnpB design

Plasmid vector construction. The TnpB gene was optimized for expression in human cells through codon optimization and the optimized sequence was synthesized for vector construction (Sangon Biotech). The final optimized sequence was inserted into a pST1374 vector, which contained a CMV promoter and nuclear localization signal sequences at both the \({5}^{{\prime} }\) and \({3}^{{\prime} }\) termini. reRNA sequences were synthesized and cloned into a pGL3-U6 vector. Spacer sequences (EMX1: \({5}^{{\prime} }\)- ctgtttctcaggatgtttgg -\({3}^{{\prime} }\)) were cloned into by digesting the vectors with BsaI restriction enzyme (New England BioLabs) for 2 h at 37C. The resulting vector constructs were verified through Sanger sequencing to ensure accuracy.

TnpB engineering. The construction of TnpB mutants was carried out through the use of site-directed mutagenesis. PCR amplifications were performed using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). The PCR products were then ligated using 2X MultiF Seamless Assembly Mix (ABclonal). Ligated products were transformed into DH5α E. coli cells. The success of the mutagenesis was confirmed through Sanger sequencing. The modified plasmid vectors were purified using a TIANpure Midi Plasmid Kit (TIANGEN).

Cell culture and transfection. HEK293T cells were maintained in Dulbecco’s modified Eagle medium (Gibco) supplemented with 10% fetal bovine serum (Gemini) and 1% penicillin-streptomycin (Gibco) in an incubator (37C, 5% CO2). HEK293T cells were transfected at 80% confluency with a density of approximately 1 × 105 cells per 24-well using ExFect Transfection Reagent (Vazyme). For indel analysis, 500 ng of TnpB plasmid plus 500 ng of reRNA plasmid was transfected into 24-well cells.

DNA extraction and Deep sequencing. The genomic DNA of HEK293T cells was extracted using QuickExtract DNA Extraction Solution (Lucigen). Samples were incubated at 65C for 60 minutes and 98C for 2 minutes. The lysate was used as a PCR template. The first round PCR (PCR1) was conducted with barcoded primers to amplify the genomic region of interest using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). The products of PCR1 were pooled in equal moles and purified for the second round of PCR (PCR2). The PCR2 products were amplified using index primers (Vazyme) and purified by gel extraction for sequencing on the Illumina NovaSeq platform. The specific barcoded primers used in PCR1 are listed in Table 4.

Table 4 The barcoded primers used in PCR1

Statistics and reproducibility

No statistical method was used to predetermine sample size since the methods are evaluated on the full CATH test set, TS50, Latest PDB, EnzBench and BR_EnzBench dataset. We excluded multichain structures and structures of a length more than 500 or resolution > 2.5 Å for Latest PDB dataset. No data were excluded from the analysis for other benchmarks. All genome editing attempts were performed with at least three biological repeats. All attempts at reproducibility were successful and standard deviations were in the expected ranges. The experiment randomization and blinding are not applicable since we are not making a comparison between different groups.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.