Designing real novel proteins using deep graph neural networks

Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. Despite substantial advances, precisely designing sequences that fold into a predetermined shape (the “protein design” problem) remains difficult. We show that a deep graph neural network, ProteinSolver, can solve protein design phrased as a constraint satisfaction problem (CSP). To sidestep the considerable issue of optimizing the network architecture, we first develop a network that is accurately able to solve the related and straightforward problem of Sudoku puzzles. Recognizing that each protein design CSP has many solutions, we train this network on millions of real protein sequences corresponding to thousands of protein structures. We show that our method rapidly designs novel protein sequences and perform a variety of ​ in silico ​ and ​ in vitro ​ validations suggesting that our designed proteins adopt the predetermined structures. One Sentence Summary: ​ A neural network optimized using Sudoku puzzles designs protein proteins, indicating that they are of comparable stability as the target and thus stably folded. ​ (E,F) ​ We also experimentally expressed two of our designs and tested their structure using circular dichroism spectroscopy. We find that they

moderate success as of yet without any experimental validation ( 17 -20 ). However, the number of sequences that share these structural templates is many orders of magnitude larger (about 70,000,000 sequences map to the CATH superfamilies), reflecting the fact that the protein design problem is inherently underdetermined with a relatively large solution space. Thus, a suitable deep neural network trained on these sequence-structure relationships could potentially outperform previous models to solve the protein design problem.
The adjacency matrix is commonly used to represent protein folds; it is an NxN matrix consisting of the distances (in Angstrom) between residues which are spatially close and thus interacting (we used a distance threshold of 12 Angstrom); such pairs of residues are subject to constraints (e.g., in such pairs, two positive charges are usually not well tolerated). A given protein structure corresponding to one distance matrix can be formed by many different homologous sequences; these sequences all satisfy the constraints imposed. Such solutions to the constraint satisfaction problem are given to us by evolution and are available in sequence repositories such as Pfam or Gene3D ( 21 ). While the rules for this CSP for any specific fold are easily found by simply comparing sequences from one of such repositories and can be given as position weight matrices (PWMs), often represented as sequence logos, it has thus far not been possible to deduce general rules -those that would connect any given fold or adjacency matrix with a set of sequences. We here use a graph neural network to elucidate these rules. The graph in this case is made up of nodes (corresponding to amino acids) with edges being the spatial interactions between amino acids that are represented in the distance matrix. The edges thus represent the constraints that are imposed on the node properties (amino acid types).
As there had been relatively little work in using graph neural networks to solve CSPs ( 22 , 23 ), it was quite difficult to find an optimal network architecture for this relatively complex problem. Hence, we first engineered and optimized a graph neural network model to solve the related and well-defined problem All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint of Sudoku (see Fig 1). Sudoku is a straightforward form of a CSP ( 24 ). Our architecture corresponds to a graph neural network; node attributes (in Sudoku, integers from 1-9) are embedded in a 128 dimensional space and later subjected to edge convolution taking into account edge indices (see Methods and Fig 1). We generated 30 million Sudoku puzzles of above-average difficulty, and we trained the network to solve those puzzles by minimizing the cross-entropy loss between network predictions and the actual missing numbers (Fig 2A). Our final graph neural network correctly predicts 72% of the missing numbers in a single pass through the network and close to 90% of the missing number if we pass the input through the network multiple times, each time adding as a known value a single prediction from the previous iteration in which the network is the most confident ( Fig 2B). Similar accuracy is achieved on an independent test containing puzzles from an online Sudoku puzzle provider ( Fig. 2C).
After optimizing the network architecture and parameters for the well-defined problem of solving Sudoku puzzles, we applied the same network to protein design. We compiled a dataset of 72 million unique Gene3D domain sequences from UniParc ( 25 ) for which a structural template could be found in the PDB ( 26 ), and we trained the network by providing as input a partially-masked amino acid sequence and an adjacency matrix adapted from the structural template and minimizing the cross-entropy loss between network predictions and the identities of the masked amino acid residues ( Fig. 1). The training and validation accuracies achieved by the network reach a plateau after around 100 million training examples (about 6 epochs), with a training accuracy of ~22% and a validation accuracy of ~32% when half of the residues are masked in the starting sequence (Fig. 2D). The training accuracy is lower than the validation accuracy because, while the training dataset has no restriction on the similarity between sequences and structural templates, for the validation dataset we included only those sequences for which a structural template with at least 80% sequence identity to the query could be found. Reconstruction accuracy is considerably lower than Sudoku (see Fig 2A-C), as was expected: the Sudoku CSP has a single well-defined solution, thus an accuracy approaching 1 is possible. By contrast, each protein adjacency matrix can be adopted by many different sequences. The achieved accuracy of 30%-40% roughly corresponds to the common level of sequence identity within a protein fold ( 16 ). Evaluating our network by having it reconstruct sequences from our validation dataset, we note that even when removing the majority of the residues, a large proportion of the sequences are reconstructed with a sequence identity around 35% (see Fig 2F). It should be noted that reconstruction accuracy much higher than this is not to be expected (as sequences can vary substantially while retaining the same fold) and is indeed not a meaningful measure of algorithm performance.
We next asked whether network scores for single mutations can be predictive of whether that mutation is stabilizing or destabilizing; presumably, a destabilizing mutation would also disrupt some of the constraints in the CSP and would thus be scored unfavourably by our graph neural network. We indeed find that we achieve a relatively strong correlation (Pearson R: 0.42) with experimentally measured mutations from Protherm ( 27 ) (see Fig S1). Classical and statistical methods such as Rosetta, FoldX or Elaspic have long been in use for this problem and perform quite well, and indeed somewhat better than our network ( 28 -30 ) (Rosetta obtaining Pearson R: 0.56 in our hands, see Fig S1). However, it should be noted that our network was not optimized for this particular task (and was never trained on any ΔΔG data) and indeed would not be expected to perform exceptionally well at this task, since many or even most mutations in the data set would still allow the sequence to satisfy the CSP (and the protein to fold). We also note that evaluation using our network is vastly faster than classical methods (see below). As our graph network is able to reconstruct sequences with missing residues and score mutations with relatively high accuracy, we sought to use the network to generate entire sequences for a given structure (adjacency matrix). This corresponds to de novo protein design -designing a novel sequence All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint for a given fold. To this end, we chose four protein folds that had been left out of the training set and cover the breadth of the CATH hierarchy. We used an A*-like algorithm to obtain sequences that score highly with our network (see methods). In conventional protein design, both A* and dead-end elimination, as well as Markov-chain Monte Carlo sampling approaches to obtain sequences scored by a classical forcefield have been applied ( 31 , 32 ). However, this has remained a difficult problem, in part due to the great computational cost involved. Due to many factors, including the fact that ProteinSolver does not require rotamer optimization to properly evaluate a single sequence, evaluation of our network is many orders of magnitude faster than current protein design methods (see Table S1).
For the four folds (mainly α, serum albumin; mainly β, PDZ3 domain; mainly β, immunoglobulin; α+β, alanine racemase), we generated 200,000 sequences each using A* search. We find that the sequences tend to fall within 37-44% sequence identity with the native sequence, in line with the sequence identity within the respective protein families (see Fig S2-S5). As no information about the sequences was provided to the algorithm, this suggests that it was able to learn significant features of protein structure. We also find that when predicting secondary structure of our sequences ( 33 ), it matches the target secondary structures almost exactly (see Fig S2-S5).
We next chose the top 20,000 (10%) sequences, as scored by our network, for each of the 4 folds for further evaluation. First, we used QUARK ( 34 ), a de novo (not template-based) structure prediction algorithm to obtain structures for our sequences; for all four folds the obtained structures match the initial PDB structure (that corresponds to the distance matrix used to generate the sequences) almost exactly (see Fig 3A-D). We evaluate the quality of the reconstruction using a number of computational metrics, including Modeller molpdf and Rosetta REU. In all cases, the scores are in the same range or better than the target structure, suggesting that the sequences that our network generated do novo indeed fold in the shape corresponding to the adjacency matrix (see Fig 3E, 3F). For a final computational evaluation, we ran 100ns molecular dynamics on each structure, including the target structures. We find that in most cases, the target structure and the structures predicted for the designed sequences show similar root mean square fluctuation in the MD simulation, again suggesting that our sequences fold into the predetermined structure (see Fig 4A-D).
Finally, we chose two designed sequences and the corresponding sequences of the target structures (for the albumin and racemase structural templates) to evaluate experimentally. We ordered each sequence and expressed them as his-tagged constructs for Ni-NTA affinity purification. After purification, we evaluated the secondary structure of each protein using circular dichroism spectroscopy (CD). Each secondary structure element has a distinctive absorbance spectra in the far-UV region, thus similar folds should present similar absorbance spectra. We find that both sequences show a CD spectrum very similar to the native sequences. Taken together with the rest of the evidence from molecular dynamics, Modeller and Rosetta assessment scores, this strongly suggests that these sequences adopt the predetermined fold (see Fig 4E and 4F). This is particularly striking in the case of the albumin template where the spectra are indistinguishable (Fig 4E). The designed sequence from the racemase template displayed a considerable loss of solubility compared to the sequence from the target structure. Although this made its characterization challenging, we were able to obtain a clear spectra by combining a low ionic strength buffer (10 mM Na-phosphate, pH 8) with a 10 mm cuvette.
While the resulting CD spectrum is somewhat different from the target (Fig 4F), this may be due to technical issues resulting from low solubility or a more dynamic nature of the designed protein (consistent with the molecular dynamics in Fig 4C); the spectrum definitely corresponds to a folded helical structure consistent with the predetermined fold.
We present ProteinSolver, a graph neural network approach to protein design that shows remarkable accuracy in generating novel protein sequences that fold into a predetermined fold. Previous All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint approaches were hampered by the enormous computational complexity, in particular when taking into account backbone flexibility. Our approach sidesteps this problem and delivers accurate designs for a wide range of folds. It is expected that it would be able to generate sequence for completely novel imagined protein folds. As a neural network approach, its evaluation is many orders of magnitude faster than classical approaches that will enable the exploration of vastly more potential backbones.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. Funding acquisition: PMK. Competing interests: The authors are in the process of obtaining a patent on the described technology. Data and materials availability: All code and data will be publicly available on github.
All rights reserved. No reuse allowed without permission.

Table S1
Figures S1-S6 All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint  This allowed us to devise and optimize our solver graph neural network architecture. This architecture embeds both edge and node attributes in an X-dimensional space, and given the edge indices (i.e., the graph), they are subjected to a two layer edge convolution. This architecture then efficiently solves Sudoku puzzles and generates accurate protein sequences. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint Sequence identity between generated and reference sequences in cases where 0%, 50%, or 80% of the reference sequences are made available to the network.  Computationally, we performed 100ns molecular dynamics simulations with both the protein in the target structure and our de novo designs ( A-D) . In all four cases, the designs show comparable fluctuation in molecular dynamics to the original target proteins, indicating that they are of comparable stability as the target and thus stably folded. (E,F) We also experimentally expressed two of our designs and tested their structure using circular dichroism spectroscopy. We find that they are definitely folded proteins in solutions and that they adopt the same ( E) or very similar ( F ) structure as the protein that supplied the target structure.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.

Supplemental Materials
Online Methods

Data preparation
The websites which host the code and data that were used to generate those datasets are listed in Supp. Table S1. In brief, we downloaded from UniParc ( 1 )  the network can easily be verified. We treat Sudoku puzzles as graphs having 81 nodes, corresponding to squares on the Sudoku grid, and 1701 edges, corresponding to pairs of nodes that cannot be assigned the same number (see Figure 1). The node attributes correspond to the numbers entered into the squares, with an additional attribute to indicate that no number has been entered, the edge indices correspond to the 1701 pairs of nodes that are constrained such that they cannot have the same number, and the edge attributes are empty because all edges in the graph impose identical constraints.
We generated 30 million solved Sudoku puzzles using the sugen program ( 5 ), which first generates a solved Sudoku grid using a backtracking grid filler algorithm, and then randomly removes numbers from that grid until it generates a Sudoku puzzle with a unique solution and the requested difficulty level. The graph neural network was trained to reconstruct the missing numbers in the Sudoku grid by minimizing the cross-entropy loss between predicted and actual values. Throughout training, we tracked the accuracy that the network achieves on the training dataset (Figure 2A, blue line) and on the validation dataset ( Figure 2A, orange line), which contains 1000 puzzles which were excluded from the training dataset. After trying multiple different network architectures and different hyperparameters, we found that the highest validation accuracy is achieved using a graph neural network with 16 graph convolution residual blocks, 162-dimensional node and edge attribute embeddings, EdgeConv layers containing a two-layer feedforward network with 256 neurons in the hidden layer, ReLU nonlinearities, and batch normalization that is applied before, rather than after, the outputs of the residual blocks are added to the input channels. The accuracy that the network achieves on the training dataset is close to what it achieves on the validation dataset, indicating that the network is not overfitting (Figure 2A-B). The test accuracy of the trained network was evaluated using a dataset comprised of 30 curated Sudoku puzzles collected from http://1sudoku.com ( 6 , 7 ) ( Figure 2C).
After optimizing the network architecture for solving Sudoku puzzles, we applied the same network to protein design, which is a less well-defined problem than Sudoku and for which the accuracy of predictions is more difficult to ascertain. We treat proteins as graphs, where nodes correspond to the All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint individual amino acids and edges correspond to shortest distances between pairs of amino acids, considering only those pairs of amino acids that are within 12 Å of one another. The node attributes specify the amino acid type, with an additional flag to indicate that the amino acid type is not known, while the edge attributes include the shortest distance between each pair of amino acids in Cartesian space and the number of residues separating the pair of amino acids along the protein chain. For each domain sequence in our training dataset, we marked approximately half of the amino acids in the sequence as missing, and we trained the network to reconstruct the masked amino acids by minimizing the cross-entropy loss between predicted and actual values ( Figure 2D). The edge attributes corresponding to each training example were passed into the network without modification. Throughout training, we tracked the accuracy with which the network can reconstruct amino acid sequences from our training and validation datasets where, in both cases, 50% of the amino acids were marked as missing ( Figure 2D), and training was terminated once the validation accuracy appeared to have reached a plateau. We evaluated the trained network by examining how well it can reconstruct sequences from our test dataset using edge attributes alone and without any sequence information ( Figure 2E), by measuring the correlation between differences in network outputs for wild type and mutant residues and the change in the Gibbs free energy of folding (ΔΔG) associated with mutations ( 8 ) (Supp. Figure S1A). In the case of sequence reconstruction, we observe a bimodal distribution in sequence identities between generated and reference sequences, with a smaller peak around 7% sequence identity, corresponding to generated sequences that share little or no homology with the reference sequences, and a larger peak around 37% sequence identity, corresponding to generated sequences that share substantial homology with the reference sequences ( Figure 2E). While the fraction of designs falling into the second category increases when we provide the network with some of the original residues ( Figure 2F), there are still some cases where the generated and the reference sequences share little homology despite 80% of the original residues being given as input ( Figure 2F).
In the case of ΔΔG prediction, we found that the network can predict the ΔΔG associated with mutations with a Spearman's correlation coefficient of 0.426, despite never having been trained specifically for this task (Supp. Figure 1A).

Solving puzzles and generation of novel sequences
We used two different strategies to generate or complete CSPs: one-shot generation and incremental generation . In one-shot generation, we pass the input with the missing labels through the network only once, for every node accepting the label assigned the highest probability by the network. In incremental generation, we pass the input through the network once for every missing label. At each iteration, we accept the label for which the network has the most confident prediction, and we treat that label as known in the subsequent iteration.
In order to generate a library of highly-probable solutions, we used a strategy similar to A* search ( 9 ), which allows us to achieve a compromise between the accuracy provided by exhaustive breadth-first search and the speed provided by greedy search. As with the incremental generation strategy described above, each time we pass the input through the network, we select a single node for which the network is making the most confident predictions. However, instead of accepting for that node the label with the highest probability, we accept all labels that have a probability above a threshold of 0.05, and we add the inputs partially filled with the accepted labels to a priority queue. Next, we extract from the priority queue a partially-filled input with the highest priority, extend it by one iteration, and place the results back onto the priority queue. This process is repeated until we generate the desired number of solutions.
The priority P j , which is assigned to every partially-filled input placed on the priority queue, is defined by Equation 1. Here, p i is the probability assigned by the network to the label accepted at iteration i , and p * is the expected maximum probability that the network will assign to labels that are accepted in the subsequent iterations. p * is a tunable parameter which controls how much the algorithm resembles greedy search vs. exhaustive breadth-first search. If p * is set to 0, the algorithm is equivalent to the All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint incremental generation algorithm described above, where we immediately accept the most probable output from the previous iteration. On the other hand, if p * is set to 1, the algorithm is equivalent to exhaustive breadth-first search, where we use the network to extend every possible label at a given iteration before moving on to the next iteration.
On a single Titan Xp GPU, generation of 100,000 sequences for an average protein domain takes about 30 minutes.

Molecular dynamics
All water and ions atoms were removed from the structure with PDB codes 1N5U, 4Z8J, 4UNU, and 1OC7, corresponding to an all-ɑ protein, a mainly-β protein, an all-β protein and a mix-ɑβ protein, respectively. The structural models for the generated sequences were produced using Modeller, with the PDB files described above serving as templates. Using TLEAP in AMBER16 ( 10 ) and AMBER ff14SB force field ( 11 ), the structures were solvated by adding a 12 nm 3 box of explicit water molecules, TIP3P. Next, Na+ and Cl-counter-ions were added to neutralize the overall system net charge, and periodic boundary conditions were applied. Following this, the structures were minimized, equilibrated, heated over 800 ps to 300 K, and the positional restraints were gradually removed. Bonds to hydrogen were constrained using SHAKE ( 12 ) and a 2 fs time step was used. The particle mesh Ewald ( 13 ) algorithm was used to treat long-range interactions.

Experimental validation
Protein purification All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint All constructs were synthetized by Invitrogen GeneArt Synthesis (ThermoFisher) as Gene Strings. The constructs design included flanking BamHI and HindIII restriction sites for subcloning into a N-terminal 6xHis-tagged pRSet A vector. All genes were codon optimized for expression in E. coli using the GeneArt suite.
The    Figure S1. Correlations between changes in the Gibbs free energy of folding associated with mutations (ΔΔG) for Protherm Dataset [7]. (A) Predictions made by the ProteinSolver network. (B) Predictions made by Rosetta's Cartesian protocols and the betanov15 energy functions, details are provided for reference in Table S3.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint .000 generated sequences. The secondary structure of the reference structure is displayed above the LOGO.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint The secondary structure of the reference structure is displayed above the LOGO.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  Figure S4. Generating sequences that fold into a protein resembling the structure of immunoglobulin, a protein containing mainly beta sheets. .000 generated sequences. The secondary structure of the reference structure is displayed above the LOGO.
All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/868935 doi: bioRxiv preprint  Interestingly, the chosen sequence from the albumin template contains 4 pairs of potential disulfide bonds, whereas it is known that the albumin template used has only 3 of these bonds (PDB 1n5u). The designed 1n5u run in an SDS-PAGE in oxidising conditions as a single band. Furthermore, the mass spectrometry analysis by electrospray (ESI) of the molecular weight (MW) of designed 1n5u showed a loss of 8 Da against the theoretical MW, consistent with the loss of 8 protons. All together this indicates that a new disulfide bond not present in the albumin structural template was efficiently inserted into the designed sequence.
All rights reserved. No reuse allowed without permission.