From sequence to function through structure: Deep learning for protein design

The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.


Introduction
Proteins take part in nearly every process of life, controlling a wide variety of functions. This functional versatility and the fact that proteins are nanoscopic, biodegradable materials have motivated tremendous efforts toward designing artificial proteins for medical and industrial applications. In fact, human-designed proteins are widely used across medicine, agriculture and manufacturing, constituting the active form of four out of the top five top selling pharmaceuticals in 2021 [1]. To cite a few more examples, protein engineers have improved the thermostability of a malaria invasion protein for use as a vaccine antigen [2] or improved activity of an enzyme capable of hydrolyzing Polyethylene terephthalate (PET), providing a new, green, and scalable route for plastic waste recycling [3]. Despite the clear objective of designing functional properties, most protein design and engineering strategies have traditionally leveraged a structure-first approach [4], i.e. either by re-engineering known proteins or by designing novel stable structures that then could be further tweaked to target desired functions, e.g. binding to other molecules [5]. The reason why the concomitant design of sequence, structure, and function has remained so challenging is due to the computational intractability of the problem [6]: traditionally, protein design has been tackled as a mathematical optimization, where an algorithm, such as Monte Carlo [7], searched the global minima of a multidimensional physicochemical energy function [8].
In recent months, however, an explosion of methods leveraging advances in machine learning provide a fresh alternative for denovo protein design, including the design of long (i.e., with many amino acids) functional proteins. Motivated by the enormous success of structure prediction methods [9][10][11][12] and the recent availability of large putative protein structure databases [12,13], some works are exploiting a structure-first approach to designing new folds and proteins [14,15]. In contrast, others operate sequencefirst, by training large generative language models on vast sequence databases [16,17,p. 2,18,p. 2,19]. Advancing the field by exploiting different modalities (sequence, structure and even function) is fundamental, as no one modality may be able to explain all cell phenomena necessary for the design of biologics [20].
For comprehensive reviews of protein engineering or machine learning methods for protein research, we refer to the works by Yang [21] and Defresne [22]. Yet, with the fundamental advances to protein design in a brief period of time, this manuscript attempts to provide an overview of recent work using AI with a focus on the last three years. To showcase the practical uses of the work presented, we engineer a pipeline for the generation of de novo protein sequences selectable for tailored properties that may benefit the protein design community and make use of this novel pipeline to discover sequences that may generate natural products.
In particular, in the following: 1. We describe recent advances in protein design, namely those shifting from a physical-based function paradigm to one that uses deep learning architectures for sequence and structure generation. We aim to give practitioners a waymark to novel tools and their intended uses. 2. We offer a novel pipeline capable of generating protein sequences with tailored properties. In short, we couple ProtGPT2 [17], which generates de novo sequences, with ProtT5 [23,p. 5], which predicts properties from them in order to discriminate sequences by desired functions. 3. We dig into the pipeline by presenting a use case for the selection of factory proteins with the predicted ability to produce natural products. 4. We discuss challenges to the design of marketable proteins with controllable properties.

Moving from physicochemical functions to deep neural networks in protein research
The de novo design of proteins was traditionally approached as an optimization problem where an algorithm searched the global minima of an energy function [24]. This function would evaluate an astronomical number of sequences for a given backbone, quickly leading to an NP-hard problem that required turning to heuristic algorithms and static pairwise potential energy functions to limit computational complexity [25,26]. These approaches met enormous success, with a myriad of de novo generated proteins in the last 20 years [27]. These designs have remarkably evolved from often short, alpha-helical peptides [28] and bundles [29] to complex multi-domain architectures [30]. Much has been said about physicochemical-based protein design, we refer the readers to these comprehensive reviews [5,8,27].
More recently however, deep learning (DL) approaches have provided a new venue for protein design research by showing high accuracy in prediction tasks. Highly publicized progress from one such tool came from DeepMind's AlphaFold [31] in December 2018 at CASP13 [32], a multi-step pipeline incorporating DL attempted to solve the decade-long problem of protein 3D structure prediction from sequence. The successor AlphaFold 2 [10], an end-to-end engineered solution, promoted even more excitement due to its incredible ability to accurately predict protein structures from sequences in-silico [33,34]. End-to-end techniques refer to the case where a single model learns an underlying mathematical function that maps an input to a complex output. These solutions are of particular interest in the protein research realm as they map relationships that are often complex to capture explicitly with other techniques, for example the relationship between an input sequence and an output 3D structure [35,36].
In parallel to AlphaFold, natural language processing (NLP) methods were leveraged to learn novel protein representations by learning the protein language, offering an alternative route to the explicit extraction of evolutionary information from multiple sequence alignments (MSAs), historically done by collecting statistics on the co-evolution of residues within MSAs [25]. These models achieved an understanding of proteins by tasking multi-million parameter DL architectures to solve millions of cloze tests (see Supplement 4) from large protein sequence datasets, allowing to encode statistics of protein sequences without supervision on physicochemical or evolutionary relationships [23,37]. While at first protein language model (pLM) representations (shorthand: embeddings) did not outperform traditional exploitations of direct physicochemical and evolutionary knowledge [38,39], quick advancements in better mechanisms to process embeddings led to competitive predictors for protein function and structure [11,[40][41][42][43][44][45][46][47][48][49][50]. It is thus becoming clear that pLM embeddings are rich inputs to downstream prediction methods of protein function and structure competing with those that exploit MSAs. However, what is encoded in embeddings, including how much evolutionary information as defined in MSAs (i.e., residue co-evolution) is implicitly captured, remains a subject of debate, even in light of correlation between MSAs and pLM embeddings in accuracy for 3D structure prediction [41].
While these pLMs excel at embedding representations of input sequences, other types of pLMs (discussed below) are now capable of directly generating new sequences. With increasing experimental validation of ''black box" DL methods, the protein design field now has a unique opportunity to generate sequences, interpret DL representations, and build pipelines aiding at various stages of protein design and drug development, from initial library generation to refinement and optimization.
3. The deep learning era of protein sequence and structure generation Previous reviews have focused on describing the types of neural network architectures used in protein design [22,51]. We focus on the type of problem that these methods attempt to solve: fixedbackbone design (Fig. 1, Panel 1), structure generation ( Fig. 1, Panel 2), sequence generation ( Fig. 1, Panel 3), and concomitant structure and sequence design most often via hallucination (Fig. 1, Panel 4).
We discuss several DL methods (bold and underlined to match rows Table 1) from the last three years with a focus on those not using Potts models [52,53], which have been reviewed and analyzed elsewhere [53][54][55]. A virtual version of this table can be found at https://github.com/hefeda/design_tools, while another comprehensive list of DL methods for protein design can be found at https://github.com/Peldom/papers_for_protein_design_using_ DL.
The first category of methods we describe focus on solving the traditional protein design problem, i.e., finding a sequence that optimally adopts a desired backbone (Fig. 1, Panel 1). The performance of these methods is usually evaluated by native sequence recovery (NSR), i.e., the percentage of wild-type amino acids recovered for an input sequence by the design method. While this metric imposes some limitations, given that the identity percentage does not necessarily correlate with expression or functional levels [56], it is nevertheless a convenient measure to evaluate how well the method recapitulates wild-type sequences. Some of the first attempts came from SPIN [57] and SPIN2 [58,p. 2], which leveraged three-layered fully-connected neural networks (FNN) to learn from structural features embedded as a 1-dimensional (1D) tensor representing backbone torsion angles, local fragment-derived profiles, and global energy-based features. While SPIN and SPIN2 achieved an NSR of 30 % and 34 %, respectively, they suffered from information loss due to the 1D input representation and the shortcomings of FNNs to encode local and global context. Their successor, SPROF [59], tried to remedy information loss by leveraging a 2D input and a 2D Convolutional Neural Network (CNN), which brought NSR to 39.8 %. 2D CNN models excelled for years in computer vision tasks as they handle local and global information better than FNNs [60]. In the protein context, these may be spatial and geometric features of 3D structure, which can be extracted in an unsupervised fashion through CNNs, given that protein 3D structures can be proxied accurately (up to symmetry) by 2D representations like contact-or distance maps. For instance, ProDCoNN [61] learned to reconstruct sequences by representing atomic discretized environments through voxelized 18 Å-cubic boxes, leading to a NSR of over 45 %. Anand [14]'s tool designed sidechain rotamers besides recovering sequence identity at a given input backbone position, reaching an NSR of 57 %. Three of the designs were validated with X-ray crystallography. Meanwhile, Deep Learning researchers attempted to solve training challenges of CNNs, such as the elevated computational cost associated with increasing neural network depth. For instance, DenseNet [62] expanded on the concept of residual connections, i.e. wiring a layer's input to all subsequent residual blocks, which inspired Den-seCPD [63]. Other attempts tried to encode more information in the input to CNNs, e.g. CNN protein landscape [64], which amongst others encoded side chain atoms, partial charges and solvent accessibility reaching 60 % NSR, and TIMED [65], which besides included reimplementations of several CNN-based protein design methods.
Protein 3D structures can also be represented by graphs where nodes are residues (or atoms) and edges represent structural proximity. Graph neural networks (GNN) directly harness graph repre- Leveraging deep learning for different protein design goals. We classify DL approaches in (1) models that try to solve the traditional protein design problem of inverse folding, i.e. find sequences that fold into a desired structure, (2) models capable of generating structure-encoding objects, like contact or distance maps, (3) models that learn to generate protein sequences or (4) concomitantly model sequence and structure to generate either. Categories 1-4 are cross-linked in Table 1. Most of the early work in the second category of methods generating structure encoding objects (Fig. 1, Panel 2) came from the Po-Ssu Huang lab, which initially trained generative adversarial networks (GANs) on contact maps used as input to an convex optimization algorithm (alternating direction method of multipliers algorithm, ADMM) to recover 3D coordinates [78]. Later, the network was improved in Anand2 to learn from distance maps, and included a learned coordinate recovery module replacing ADMM [79]. In these methods, sequences were generated from the designed backbones using Rosetta [80]. GANs were subject to a lack of satisfactory accuracy in the coordinate recovery process and the loss of resolution when inputting structures as either contact or distance maps, which led to missing biochemical features and unrealistic designs [81]. IG-VAE [81] addressed some of these shortcomings by training a variational autoencoder (VAE) that directly generated 3D coordinates of backbone atoms for classspecific Immunoglobulin proteins. The VAE implemented in Lai [82] instead outputs a conformational ensemble of protein structures. Similar methods with the goal of generating structures are RamaNet [83], a long-short term memory network (LSTM), trained in an autoregressive manner to output a sequence of u and w angles to design alpha-helical structures, and DECO-VAE [84], based on VAEs. As opposed to the fixed-backbone methods (Fig. 1, Panel 1), these generative structure methods allowed to explore novel, unseen topologies, which could host novel functions. Nevertheless, it is often desirable to control the design process, i.e., to condition the sampling towards aspects of function, structure, or sequence. Two methods (SCUBA and GENESIS) allow to do so through inputting a series of secondary structure rules of sketches (such as 'helix-loop-helix'), which guide the network.
SCUBA [85] used statistical representations of backbones by tasking FNNs to learn from radial kernels encoding different representations of 3D structure. Designs from SCUBA were experimentally evaluated through X-ray crystallography, leading to the discovery of three novel topologies. GENESIS [86] implemented a VAE that takes secondary structure sketches and outputs contact maps with finer definitions of secondary structural elements. ProteinSGM [87] uses stochastic differential equations (SDE) to generate matrices that capture distance and torsional angles, which are then passed to Rosetta to produce 3D folded structures. FoldingDiff [88] merges this two-step approach by using a set of six internal angles, directly producing good quality structures without needing other methods like Rosetta for refinement. ProtDiff and SMCDiff [89] adopt a similar approach, producing 3D coordinates directly as an output. Another emerging protein design branch focuses on sequence generation (Fig. 1, Panel 3), mostly inspired by the impressive advances in natural language processing (NLP) over the last few years [51,90]. Language models have been extensively applied to protein sequences (e.g., ESM [37,41] or UniRep [91], focused on protein representation learning). Although many of these models could generate protein sequences by, for example, sampling single-site substitutions using a Monte Carlo approach [92], we do not include them here since they were not evaluated for this objective. Possibly the first advance in the area of models trained specifically for sequence generation came from ProteinGAN [93], a GAN trained on the family of malate dehydrogenases (MDH) capable of producing novel functional MDH sequences with as low as 66 % identity to natural protein sequences. Since then, a myriad of autoregressive protein language models (pLMs) with generative capabilities, often leveraging the successful Transformer architecture [94], have followed. Since the applications of autoregressive Transformers to create protein language models have extensively been reviewed [51], we will only briefly mention these. ProGen [16] was the first reported decoder-only model specifically trained for protein sequence design and included over 1,100 UniProt [95] control tags (keywords). These tags could be used to control the generation process, e.g. by selecting for acyltransferase activity. Transformer models can also be ''fine-tuned" to achieve a desired goal, practically an alternative technique to using tags for controllable generation. ProGen was employed for the generation of Lysozimes using fine-tuning on five fold diverse Lysozimes families resulting in about 50,000 sequences. Generated sequences showed enzymatic activity in the range of natural counterparts, and one sequence was purified and its structure resolved via X-ray crystallography [96]. ProtTrans [23], an extensive probe into the ability of six transformer-based architectures to encode protein sequence knowledge, included the training of autoregressive models (ProtXL, ProtXLNet, ProtElectra-Generator-BFD, and ProtT5) which have, in principle, sequence generation ability, although they were not originally tested for this task. DARK3 [19] is a decoder-only model with 100 M parameters trained on synthetic sequences. Following the principles of DARK3, ProtGPT2 leveraged a GPT2-like model [97] an trained on the UniRef50 dataset [95], leading to a model able to generate proteins in unexplored regions of the natural protein space, while presenting natural-like properties [17]. RITA [98] included a study on the scalability of generative Transformer models with several model-specific (e.g. perplexity) and applicationspecific (e.g. sequence properties) benchmarks, similar to ProGen2 [18,p. 2], which was accompanied by the release of all pre-trained weights and architectures, ranging from 151 M to 6.4B parameter models. Similar to these models, Tranception [99] exploited the autoregressive objective, but with a novel attention mechanism aimed at extracting subsequence k-mers, which proved to be very effective for protein modeling. The work included ProteinGym, a benchmark to assay the performance of fitness predictors. Another model with generative ability is EVE [100], a VAE used to predict the pathogenicity of protein variants. A particularly interesting application of language models came with ReLSO [101], which used a transformer autoencoder paired with function prediction, inferring protein functionality from sequence embeddings. This model can also be used to generate new sequences by optimizing the latent space with gradient ascent.
The last category of design methods encompasses those capable of concomitantly designing sequence and structure (Fig. 1, Panel   4). Possibly the first method in this class, Hallucination [102] allowed to ''hallucinate" 100-amino acid long de-novo protein sequences leveraging trRosetta. The term hallucination was inspired by DeepDream [103], a CNN capable of generating mesmerizing, psychedelic images by combining input patterns iteratively (e.g. generating a house from eyes and faces). Protein hallucination works similarly: random sequences (which only have arbitrary local structural patterns) are passed to a structure prediction method, such as trRosetta, which predicts a distance map. The difference between this map and a background distribution trained on high-resolution natural structures is iteratively minimized by mutating the sequence one mutation at the time and recomputing its distance map in order to ultimately reach a minimal (or optimal) distance between background distribution and distance map. Iterating over this process for 40,000 steps using Monte Carlo led to sharply defined distance maps. 129 designs were expressed in E.coli, of which 27 were monomeric and wellfolded, with three being validated through crystallization [102]. Similar to other applications, it is often paramount to gain control over specific properties, such as building a scaffold around a particular structural motif, like a binding pocket. Constrained hallucination [104] modified the hallucination process in two ways: first by using a composite loss function, combining the losses from Norn [105] and Anishchenko [102], which allowed to create a model to concomitantly find a sequence for the structural motif, while hallucinating the scaffold around it; second, the Monte Carlo sampling procedure was replaced by a gradient-based sequence optimization, leading to an 18-fold decrease in sampling time (from 90 to 5 min). In RFjoint [106], this approach was further enhanced by employing RoseTTAFold [107] instead of trRosetta, thus leading to finding a minimum for 3D structure differences instead of distance maps [106] and in Roney [108], AlphaFold2 was used as the core model for the hallucination. However, this constrained hallucination schema was still too computationally expensive, and thus the authors fine-tuned a RoseTTAFold variant with a three-term loss which allowed RFjoint to inpaint missing sequences and structures in a few seconds [106]. The method was extensively evaluated experimentally [106]. Lastly, Protein Diffusion [109] leveraged diffusion models [110][111][112] popularized by generative image approaches [113] to train a model capable of generating sequence and structure based on a set of secondary structure input constraints.

An offline pipeline for protein generation and selection through visual exploration of predicted features
Despite significant advances, the field appears to remain distant from tools allowing the seamless design of proteins that fulfill specific properties in an end-to-end fashion, and even farther for the end to be marketable devices primed to ace clinical trials or pass environmental regulations. A step towards this goal would be DL models that prompted with a set of biological mechanism and industry-relevant properties (e.g., desired thermostability, aggregate viscosity, sequence length, subcellular localization, catalytic capabilities, or binding partners) output a sequence or structure satisfying the selected criteria with high precision in a timely fashion. Whilst this may not yet be possible, we present an offline attempt (i.e., not an end-to-end solution) by combining a generative protein sequence model (Fig. 1, Panel 3) [17] with an oracle discriminator DL model helping to query generated sequences for desired properties [23].
In particular, we unconditionally (i.e., without priors on family, function or structure) generated a set of 100,000 protein sequences using ProtGPT2, and predicted secondary structure [23], Gene Ontology (GO) terms [45], residue ability to bind small molecules, nucleotides or metals [48], protein subcellular localization [47], transmembrane topology [42], residue conservation [43], residue disorder [44] and CATH family [46]. Remarkably, this generated a repertoire of 100,000 protein sequences with 12 predicted features of structure and function from a single script in approximately 3.5 h (Supplement 1). To analyze whether the generated sequences resemble natural ones, we prepared two more sets to compare against: 1) 100,000 sampled protein sequences from Uni-Ref50 [95] which we call ''U50" and 2) a nonsense sequence set created by shuffling residues within the sequences of the sampled UniRef50 set (e.g. ''SEQVENCE" may become ''QEEVNCSE") which we call ''random". We then predict structure and function using the same methods and script applied to the generated sequences for both sets.
Qualitatively comparing distributions of several predicted features for the three sets (Fig. 2) suggests that generated protein sequences resemble natural sequences more than random sequences. When comparing distributions of predicted features between generated and natural sequences, p-values were larger than when comparing the same distributions for either natural or generated against random sequences, suggesting that the newly generated sequences are closer to natural ones than to the random background (analyses included in our Notebooks; see Availability). However, pairwise t-Tests on predicted features with multipletesting correction failed to give a clear indication about whether generated and natural predicted feature distributions are more similar to each other than those for random. As generated sequences are accompanied by a wealth of interpretable predicted features, we provide a simple Jupyter Notebook (https://github.-com/hefeda/PGP) that allows to query the generated protein set in quasi-natural language via the predicted features. To cite two examples, a user could shortlist protein sequences with more than 200 residues that have at least 30 % of residues involved in alphahelical secondary structure and that locate on the outer cell membrane according to GO annotations, or select for sequences shorter than 100 residues with transmembrane strand content and binding to small molecules. The resulting sequences filtered by the desired query displayed in the Jupyter Notebook are linked to LambdaPP [114], a web-server for visual exploration of predicted features that also predicts and displays protein 3D structure using ColabFold [115] and allows 3D structure comparisons through FoldSeek [116].
Whilst not an end-to-end solution, carefully crafted combinations of lightning-fast DL generators and predictors, easy query mechanisms, and rich visual tools allow to push the envelope in discovery of novel candidates, as explored in the following use case.

Use case: selecting protein factories for molecular glues through deep learning
Proteins can serve as factories increasingly engineered to manufacture other products, such as the triterpene squalene [117]. Often protein factories are produced by genes in biosynthetic gene clusters (BGC) found in many microorganisms. Protein language model-based deep learning was used to create a tool able to scan genomes for BGCs, assess their functional classification according to Pfam [118], and predict their output product [119]. These protein factories are also the source of many therapeutically used natural products, of which an estimated 50 % of all FDA approved drugs are derived [120]. It was discovered that many of these natural products act through a newly established mechanism [121] used by microbial, plant, fungal and animal cells to regulate protein interaction networks at the proteome scale [121], such as plant hormones Auxin and Jasmonate [122]. These small molecules are now aptly named molecular glues as they work by gluing a protein of interest to a regulating protein, often an enzyme called the effector protein. Multiple approved drugs with previously unknown mechanisms of action have now been identified to work through this mechanism [123,124]. Molecular glues hold tremendous promise, allowing to drug protein classes that are traditionally considered undruggable, as they don't require binding enzymatically active sites, can work in shallow, featureless pockets [125], as well as on intrinsically disordered proteins which can become ordered at the stabilized protein interface [126,127]. Approved molecular glues, however, have all been discovered through serendipity, and it appears that nature has been much more successful in designing this class systematically.
To show the therapeutic promise of artificially designed proteins, we mine unconditionally ProtGPT2-generated protein sequences for their potential to act as factories to produce novel molecular glues. These sequences could be used as a starting pool in a directed evolution approach. Generating sequences, in contrast to the typically used point mutation and shuffling strategies, can achieve a better balance between exploration of sequence diversity and maintaining reaction stability. To do so, we filtered the set of ProtGPT2-generated sequences discussed previously using embedding-based annotation transfer (EAT). First, we filter sequences by those with the Gene Ontology (GO) annotation ''secondary metabolite biosynthetic process" (GO:0044550) or its children terms, using a cutoff of embedding euclidean distance of up to 0.6, resulting in 1,345 sequences (we use a more relaxed embedding distance cutoff compared to Littmann et al. [45] in order to derive a sizable set of candidate sequences). Secondly, we compute embedding euclidean distances between these 1,345 sequences and sequences in the biosynthetic gene cluster dataset (MiBIG) [128], and remove all generated sequences which fall below a distance of 0.6 to any sequence in MiBIG. We discover that even with no further optimization, 234 generated, filtered sequences are similar (up to an embedding euclidean distance of 0.6 to both relevant GO annotations and annotated sequences with MiBIG) to naturally occurring BGC proteins (Fig. 3).
We focus on one generated protein (sequence in Supplement 2) which is close to a short chain dehydrogenase BGC constituent involved in producing fusicoccin A (one of the molecules depicted in Supplement 3, Fig. S1, panel A). Fusicoccin A is a phytotoxic molecular glue which was originally discovered to stabilize the interface between the scaffolding protein 14-3-3 and H+-ATPases from Arabidopsis thaliana (AHA2) and Nicotiana plumbaginifolia (PMA2) [129,130], but has since then been expanded semisynthetically as an anti-cancer drug [131]. By searching for similar proteins in embedding space for different datasets (Fig. 3, panel B and C) and predicted fold space, by querying the AlphaFold database [13] through FoldSeek [116] using the ColabFold [115] predicted 3D structure for the designed sequence (Fig. 3, panel D), three natural proteins emerged. All three predicted structures belong to fungal, short-chain dehydrogenases with very similar folds, but are involved in the production of completely different natural products, as expected from different BGCs. This demonstrates a potential in-silico workflow of generating sequences, with the aim of evolving specific BGCs to produce desired molecules after further in-vitro confirmation, for instance by applying directed evolution [132].

Discussion
Deep learning ushers in a new wave of tools for protein design, engineering, prediction, and optimization. The efficient combination of in-silico and in-vitro approaches in multi-stage Fig. 2. Distributions of predicted structural and functional features suggest generated proteins closer to natural sequences than random. Legend: U50/ : 100 k sequences sampled from UniRef50; ProtGPT2/ violet: 100 k Dl-generated sequences; Random/ green: sequences created by randomly shuffling the residues within the U50 set; Length: length of the stretch, e.g. 10 consecutive residues; Euclidean Distance: the Euclidean distance in embedding space to the first sequence in an annotated set. Structure at residue resolution: consecutive residues forming stretches falling into predicted transmembrane strands (A), helices (B), or general strands (C), helices (D) or disorder (E) follow similar stretch length distribution between U50 ( ) and ProtGPT2 ( ), but differ from random ( ). For instance, random sequences have a preference for short stretches (A-D, left of distribution) and are overall less likely to be predicted as part of a TM-strand or -helix (A-B). Function at sequence resolution: embedding-based annotation transfer (EAT) distribution of distance between proteins in U50, ProtGPT2 and Random to the closest protein with existing GO annotations taken from SwissProt of either biological processes (BPO/F), cellular compartment (CCO/G) or molecular function (MFO/H), or to sequences with CATH annotations (I). The distance distribution in the high confidence interval (up to Euclidean distance of 0.5 for F-H or up to 0.8 for I, indicated by black, dashed line; thresholds differ due to different embedding spaces and tasks: F-H use ProT5 embedding space to infer function [45], while ''I" uses ProtTucker [46] embedding space to infer structure) is similar for U50 ( ) and ProtGPT2 ( ), but differs for random ( ). Comparing the three sets suggests that annotation transfers at Euclidean distance greater than the respective thresholds in embedding space (right of vertical dashed line, F-I) invite uncertainty, whilst transfers between proteins at Euclidean embedding distance lower than 0.6 may be considered reliable.
pipelines, such as in drug discovery, remains complex, costly and often requires many types of expertise. However, practitioners now have access to an increasing collection of in-silico DL tools which particularly address the first stages of these pipelines (Table 1). DL methods have started a revolution in protein design, a field that has undergone a paradigm shift over the last few years, moving away from traditional physico-chemical energy functions. As we highlight in the use case, generative sequence models can be employed to create large libraries of plausible sequences for which oracle models predict structure and function properties in a matter of hours (Fig. 2). This process requires no technical DL expertise and provides data from which experts can select and refine to one or a set of promising candidates that constitute starting points for in-vitro experiments (Fig. 3).
End-to-end protein design for marketable biologics remains remarkably complex. Despite advances in end-to-end DL models for protein design, except for a few recent breakthroughs [133], bridging the gap from computational to marketable (i.e., using a single computational system to generate devices that can be put on pharmacy shelves) remains difficult, an unfortunate reality since proteins would have the potential to tackle many emerging biomedical and environmental challenges [134].
Two interesting aspects of DL solutions are their end-to-end nature and the ability to combine different losses, i.e. to condition the output (e.g. sequence) on the input (e.g. structure) using a mathematical formulation which takes into account aspects (e.g. thermostability) interesting in the application context (e.g. biopharma). DL models for protein design are optimized end-toend on large biological collections, which are often limited by experimental shortcomings, for instance capturing a static notion of protein 3D structure, as opposed to its dynamic nature (e.g., conformational modifications during protein-protein interactions).
Thus, while DL models for protein design may encode biological mechanisms required to propose viable biologics, they cannot yet be blindly utilized without experimental validation, and do not yet explicitly account for marketable variables. Similarly, as protein function lacks rigorous definition (e.g., may be binding or localization, or both) and scale [135], designing around protein function remains more challenging than designing around sequence or structure, and selecting for function in production is often better validated through targeted functional assays.
An opportunity to connect in-silico with in-vitro approaches, where physical experiments, accounting for all variables, inform DL models in a differentiable manner may come from advancements in automated labs [136,137] and a new frontier of cloud labs [138]. However, a bottleneck from combining DL models with physical experimentation is high-throughput [135], as digital systems often vastly outpace physical counterparts in efficiency and scale. Yet, digital copies of physical labs (or digital twins [139]), which have already been proven useful in boosting manufacturing [140,141] provide an opportunity for the future. A DL solution aiming at outputting marketable devices should also account for nonbiological variables, such as the ones influencing production costs or clinical response. Data necessary to inform such models is however lacking, although improvements are underway, such as for clinical trial design [142].
Looking forward: protein design community on a rigorous journey to establish goals. In the first half of the 1990s, at a time when having solved the protein folding challenge sporadically made headlines, the Critical Assessment of Structure Prediction (CASP) [32] set the standard against which in-silico predictive methods for protein structure needed to prove advancement. In combination with establishing data standardization and a single data repository, this led to multiple revolutionary approaches, from early ), other biosynthetic enzymes (from UniProt; ), and generated sequences in (panel A). On the right the structural superposition of the chosen protein (w in panel A and in B-D, predicted by ColabFold [106]) with the closest biosynthetic protein from UniProt (accession: D7UTD0 -obtained from the Alpha Fold database [13]; ; panel B), the closest BGC protein from MiBig (UniProt: D2IKP3 -obtained from the AlphaFold database, MiBIG: BGC0000304|9); ; panel C), and the closest protein structure from the AlphaFold database found via FoldSeek [116] (id: AF-Q00674; ; panel D).
uses of single-frequency models (i.e., positional scoring matrices), to complex co-evolutionary representations through directcoupling analysis [25], all the way to end-to-end DL solutions like AlphaFold 2 [10]. Arguably, the success of structure prediction is a combination of many factors, including technical, biological understanding and intuition, and more complex and principled statistical methods. Fundamentally, however, structure prediction through CASP sets an example of how innovation can be fostered. As the protein design field moves to more complex DL approaches, benchmarks, which promoted the success of structure prediction tools, appear to be lacking. Arguably, this is due to the underlying complexity of defining, in principle, what protein design is supposed to be, especially as it is moving from theoretical exercise to practical applications. Is it sequence design? Is it structure design? Is it sequence and structure that culminate in function? In fact, function would most likely be the design goal from a practical standpoint. Attempts at measuring advancements in protein function prediction exist, e.g. CAFA [143] and CAGI [144], however they focus on scoring how in-silico tools predict known function from sequence, rather than their ability to infer proteins (sequences or structures) that perform a desired, sometimes nonnaturally found function. Conversely, model developers score their tools by metrics like Natural Sequence Recovery (NSR), which validate a model's ability to link structure to sequence, but often not a model's ability to generate diverse sequences fitting a desired structure [56]. A recent push in benchmarks scoring models' ability to engineer proteins [99,[145][146][147][148] highlights three aspects that protein design tools should strive to solve: 1) emulate laboratory conditions, i.e., extrapolate from very little available data; 2) set multiple function generalization goals, i.e, measure different aspects of function, with the intent of finding an optimal solution rather than maximizing any one metric; 3) focus on the ability to address out of observed data distributions, i.e., design proteins that achieve functions not observed in nature.
Ultimately, by overcoming the challenges of measuring advances, deep learning is set to enable protein engineers to design sequence, structure, and function with controllable properties.
Availability pLM generated and UniRef50 sampled sequence sets and predictions are available at http://data.bioembeddings.com/public/design. Code-base and Notebooks for analysis are available at https:// github.com/hefeda/PGP. An online version of Table 1 can be found at https://github.com/hefeda/design_tools.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: CD was employed by VantAI and NVIDIA at different peri-ods during the time of writing. MA, AG, LN are employees of VantAI. NVIDIA and VantAI had no influence on the contents of this manuscript.