Beyond sequence: Structure-based machine learning

Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.


Introduction
Protein bioinformatics is a thriving and fast-growing field dealing with algorithms and data structures to explore, analyse and compare (groups of) proteins in order to better understand their various biological, physicochemical and molecular properties and functions. With the increase in protein sequence data obtained from large-scale high-throughput sequencing technology, machine learning (ML) has become a key methodology in protein bioinformatics. In protein structure prediction, be it secondary structure, backbone angles, contacts, folds, or full-atom structure, ML has become indispensable and forms the basis of a number of popular tools and algorithms. ML has also successfully been applied to predict protein function, protein-protein interactions, drug-target binding, enzyme substrate specificity, thermostability, catalytic rates, binding affinity, variant and mutant effects and more. ML is data-driven and attempts to identify patterns in existing data to predict properties of new, unseen data. Given ML's requirement of large amounts of diverse data, the overwhelming majority of ML applications on proteins use sequences as input, some of which are powering different aspects of popular resources such as Ensembl [1], Pfam [2] and UniProt [3]. However, numerous protein families have divergent protein sequences yet share highly similar three-dimensional structures, topologies and folds, since structure tends to evolve slower than sequence [4]. Furthermore, protein tertiary structure typically provides a wealth of information not found in sequence -spatial topology, residue interactions, solvent accessibility, residue dynamics and electrostatics, and more.
Historically, structural biology depends primarily on experimental structure determination methods including X-ray crystallography, nuclear magnetic resonance (NMR), small-angle scattering, and cryo-electron microscopy (cryo-EM). The Protein Data Bank (PDB) [5], established in 1971, stores these experimentally determined structures and its size has been steadily increasing over the years. At the time of writing the PDB consists of 195,325 structures and grows by an average of 13,723 structures a year (calculated over 2017-2021). However, these numbers pale in comparison to the growing deluge of protein sequence data, with the UniProt protein database containing 226,771,949 sequence entries at the time of writing, over 771,752 more than the previous release with a release cycle of 8 weeks. This phenomenon is often referred to as the sequence-structure knowledge gap [6]. Fortunately, experimental approaches are not the only way to obtain structural information, and computational structure prediction techniques are fast closing this gap. A protein's structure can be modelled from its sequence either using the experimental structures of one or more homologous proteins (template-based, comparative or homology modelling), or using de novo prediction techniques (template-free or de novo modelling). Given that homology modelling performs well when using templates with > 30% sequence identity to the protein of interest, accurate structural models can be obtained for over 60% of the genes in the top 12 most accessed genomes on UniProt [7,8]. Template-free modelling, on the other hand, does not rely on global similarity to a known structure and hence can be applied to proteins with rarer folds. A recent breakthrough, the highly accurate deeplearning based AlphaFold2 model from DeepMind [9] trained on experimental structures to predict the structure for an input sequence, has allowed structural modelling to realise as high accuracy and resolution as the best experimentally resolved structures in many cases. In collaboration with EBI, DeepMind has released the AlphaFold Protein Structure Database [10], currently containing over 200 million structural models. This increases high quality structural coverage by an average of 25% compared to homology modelling across 11 proteomes [11], reaching over 76% for the human genome and reducing the fraction of the human "dark proteome" from 26% to 10% [12]. Thus, we can theoretically obtain high resolution protein structural information for a large number of available protein sequences. In addition, computationally predicted models can help better resolve experimental structures [13][14][15].
With these advances, we are at the brink of a structural revolution with millions of newly modelled structures at our disposal. Thus ML applications in protein bioinformatics, already shown to be very powerful in shedding light on biological problems, now have a wealth of structural information to exploit as input instead of, or along with, the typically used protein sequences. These sequence-and structurebased ML methods (hereafter referred to as "structure-based") can greatly outperform purely sequence-based approaches, as demonstrated in studies where the same ML architecture is validated using only sequence and both sequence and structure information [16][17][18][19], though sometimes data biases have prevented useful training of structure-based methods [20]. The past years have already seen movement in the direction of protein structure-based ML and its role is sure to increase drastically in future research. In this review, we describe the space of machine learning on protein structures in terms of the kinds of tasks that structures can help solve and the kinds of algorithms applicable to these tasks. We outline the various structural features and representations currently obtainable. Finally, we look at open problems and challenges, as well as promising opportunities in this exciting field.

Machine learning in the protein field
Machine learning (ML) is defined as "the study of computer algorithms that improve automatically through experience and by the use of data" [21]. Typically, these algorithms find patterns in datasets and link such patterns to specific outcomes or groupings. Deep learning (DL) is a sub-field of ML which uses artificial neural networks with multiple stacked layers of network connections enabling learning of increasingly complex information through huge amounts of data compared to the more "classical" ML approaches. In this work, we use ML to refer to both DL and classical ML.
Supervised ML attempts to predict a certain response by learning patterns from labelled data. In the case of classification, this response is the membership of the data point in a particular grouping or class. Regression, on the other hand, predicts a real-valued numeric outcome. Unsupervised ML attempts to find clusters or learn reduced representations from data without any labels. See [22] for an in-depth introduction to these topics.
ML has been used widely across biology for decades, with reviews outlining its usage in the fields of omics [23], synthetic biology [24], biomedicine [25], and drug discovery [26]. In the context of proteins, ML approaches, both supervised and unsupervised, can broadly be divided into protein family based and protein universe based techniques. These two categories differ in the kinds of prediction problems they are applied to, the kinds of algorithms used, and the kinds of representations used as input.

Protein family based ML
Protein family based ML is used to predict properties of the members of individual protein families or sub-families, usually consisting of hundreds to thousands of experimentally characterised training proteins. Some of the questions in protein family supervised ML include specificity prediction of substrates, intermediates, products, and inhibitors; state prediction in the context of engineering thermostability, binding affinity and activity; and prediction of the effects of mutations. In many cases, such as the immensely diverse lipocalins [27] and the fast-evolving enzyme families involved in specialised metabolism [28], the sequence diversity within a family make it impossible for sequence-based techniques to predict family properties. Even very similar sequences can have mutations in key structural regions resulting in completely different activities, which is easier to ascertain from structure than from sequence alone. In addition, insights from computational prediction methods which also use structure as input can better drive experimental studies due to the generally higher accuracy of structure-based prediction, and better enable exploration of the protein family space with structural stability and activity taken into account. We give examples of supervised ML tasks for some well known protein families below.
The superfamily of G protein-coupled receptors (GPCRs) is the largest family of targets for approved drugs in modern drug discovery, and hence also a popular target for ML approaches to drive exploration and understanding. GPCRs play an essential role in physiological processes such as vision, olfaction, neuronal signal transmission, cell differentiation, pain, muscle contraction, and hormone secretion [29]. Recent ML studies on GPCRs have started incorporating structural information to improve prediction performance, and to derive biological insight into the residues and mechanisms involved. As commonly used ML models for structure, interaction and interface prediction are trained on soluble proteins, specialised GPCR-specific oligomerization and interface predictors were developed [30,31], able to handle their long transmembrane regions. Recent work even modified the existing AlphaFold2 algorithm to generate rarer GPCR conformations [32]. GPCRs often display high conformational flexibility and low thermostability, making their structural, biophysical, and biochemical characterisation in the laboratory challenging. Given that experimental identification of thermostabilizing mutations is very resource intensive and must be repeated for each individual receptor, computational prediction of GPCR mutant stability is a crucial task in this field [33]. Finally, GPCRs bind to a very diverse range of ligands and ML is used to identify biologically active ligands and binding inhibitors, estimating affinity and other binding properties, and probe ligand-specific binding mechanisms [34].
Another important class of drug targets are the kinases [35], with over 500,000 publications, 20,000 patents, inhibition assays for the majority of the human kinome and 115,000 kinase inhibitors covering 20% of the kinome [36]. With over 7000 structures solved covering 308 kinases across 8 groups and complexed with over 3000 unique ligands and inhibitors, structure-based ML approaches are widely used for addressing challenges within this superfamily. These include methods to predict inhibition [37] and binding affinity [38] in specific kinase families. Another common kinase challenge is predicting conformational change between the so-called active and inactive conformations [39,40]. For drug targets, predicting the effects of mutation of a single protein could also be considered a protein family ML task, as the inputs are still proteins sharing the same structural fold with key differences caused by changes in the sequence. PremPLI [41] uses features from modelled protein-ligand complexes to predict the effect of mutation on binding affinity to a number of inhibitors for a kinase cancer target.
In the field of natural products and specialised metabolism in plants, bacteria, and fungi, ML has slowly been gaining popularity over more traditional approaches involving similarity search or analysis of a few, closely related proteins. ML has been used for successful prediction of substrate [42,43] and product [44] specificity in various natural product enzyme families. In 2013, a structureinformed approach was used to engineer highly thermostable cytochrome p450s [19].
Though computationally predicted structures are shown to be highly accurate at the backbone level, tasks such as the ones described above which involve small molecule binding may need further family-specific processing and ML-based approaches to harness the structural information specifically related to ligand interaction. For example, [45] show that AlphaFold-predicted GPCR structures differ in crucial features such as domain assembly, ligand-binding pockets, and interface conformation, thus impeding their direct use in functional studies.
Unsupervised ML in the protein family space hosts a new subfield of structural bioinformatics, dubbed "comparative structuromics" by Mohammed AlQuraishi. This is concerned with tools, algorithms, and techniques to compare and contrast assorted datasets of protein structures to answer a variety of biological questions -the evolutionary relationships between structural orthologs, interaction networks and how they are affected by structural changes, folding and changes within different cellular contexts and organisms, and how structure and folding are coupled with different functional characteristics. Zebra3D [46] is an example of such a technique. It provides a systematic analysis of 3D protein structure alignments combined with the identification of subfamily-specific regions using unsupervised ML clustering algorithms -these regions represent patterns of local 3D structure similar within subfamilies, but differing between them, thus likely to be associated with functional diversity and function-related conformational plasticity. The work of de Lima et al. [47] is another example of unsupervised protein family ML concerned with the detection of subfamilies and simultaneous identification of differentiating residues. Clustering and dimensionality reduction techniques have been used to describe the conformational landscape of proteins and identify binding-induced conformational change [48,49]. a small number of data points. A wide range of algorithms are at our disposal for these tasks, including but not limited to k-nearest neighbours algorithms (k-NNs) [50], support vector machines (SVMs) [51], Gaussian processes [52], and ensemble methods such as Random Forests [53] and gradient boosting trees [54]. In addition, many approaches in this field aim to interpret prediction results to derive insights about underlying mechanisms and residues which may be important for function. Such predictions and insights obtained from protein family ML are often used to drive experimental research to explore and characterise novel, interesting or relevant proteins.

Protein universe based ML
The larger-scale protein universe based ML typically uses tens of thousands of proteins from diverse superfamilies to learn global properties of proteins, such as secondary and tertiary structure and folding, interactions, disorder, broad function classes etc. DL is a common choice for such problems, as it is known to drastically outperform other techniques in the presence of large amounts of data. In fact, protein structure prediction is in itself a protein universe task in which the use of DL has in many cases eclipsed other ML or statistical methods. This is true for prediction of secondary structure, solvent accessibility [55], backbone torsion angles [56,57], residue-residue contacts or distance matrices from co-evolution [58][59][60][61][62], and in de novo all atom structure modelling. In fact, all the top-performing Critical Assessment of Structure Prediction (CASP13 [63], CASP14 [64]) methods for de novo modelling rely on deep convolutional neural networks for predicting residue contacts or distances, predicting backbone torsion angles and/or ranking the final models. For recent reviews on the underlying techniques used, including those in AlphaFold2 and related approaches, see [65,66].
With the availability of protein structures, a number of additional tasks can make use of structure-based ML instead of sequence. These are listed in Table 1, grouped by the kinds of inputs used. Recent examples as well as common datasets used to validate and benchmark novel algorithms created for each task are also listed.
In the 2020 CASP14 competition, the breakthrough results of AlphaFold2 prompted a press release declaring the protein structure problem for single protein chains solved [64]. This emphasis on "single protein chains" revealed the new frontier for structural bioinformatics -complex structures are yet to be successfully predicted at the same breakthrough levels. Thus the related yet distinct tasks of predicting whether two proteins interact, and predicting the interface of two interacting proteins are common protein universe problems with a number of solutions, based on docking [87,104], templates [105], end-to-end learning [84] and, most recently, protein complex prediction approaches building upon AlphaFold2 [128][129][130]. The latter generation combines the AlphaFold2 DL architecture with a modified paired MSA generation approach which encapsulates co-evolutionary information across the subunits of the desired complex. This yielded success rates for complex prediction up to double that of previous template-based and docking methods, marking significant progress in the field. However, these success rates are still only around 50% and vary drastically across species, protein families, types of complexes, and stoichiometries considered [129,131]. Similarly, the popular de novo protein structure prediction algorithm RoseTTAFold, has been extended to the prediction of nucleic acid and protein-nucleic acid complexes [132], though again only around half of the tested complexes could be successfully modelled.
Structure-based drug discovery also hosts some significant applications of protein universe ML [133], starting from the computational modelling of putative receptor targets. Subsequently, binding sites in the target structure and putative drug candidates are identified using cavity/pocket prediction techniques [76], prediction of "druggable" regions, and protein-ligand binding site [134] prediction. This is typically followed by molecular docking to evaluate protein-ligand interaction and affinity between the target and a variety of drug candidates. In the case of unknown target proteins or to identify off-target binding candidates, reverse/inverse docking [135][136][137][138] is used to create embeddings of drugs and search across protein structure databases for good docking solutions. In these contexts, ML approaches are used to improve scoring functions of binding affinity and plausible docking poses [81,116,121,138,139]. Indeed, [140] show that computationally predicted structures perform on par with experimental structures at reverse docking tasksalthough the docking and scoring methods themselves could use major improvements to further drug discovery and design.
Predicting the effects of variants and mutations, especially those involved in diseases, is another common task. Sen et al. [141] took advantage of the latest de novo structure prediction techniques to model human disease-associated proteins, many of which do not have existing structures or even close homologues. Afterwards, they compared disease-associated mutations to ligand binding sites, protein-protein interfaces and conserved regions predicted from the models, in order to provide some rationale for most of the mutations. However, the current DL-based structure predictors are not yet able to successfully predict mutations in protein structures as their training procedure is designed to be robust to small changes in sequence. This has been practically demonstrated in studies aiming to predict stability effects of mutations using predicted structures [142,143], and it indicates an under-explored area of structure prediction.
Approaches building upon AlphaFold2 and its underlying architectures have been used successfully in design tasks [144][145][146][147], indicating that the AlphaFold2 breakthrough may also cause a leap in protein design prediction. The process of constructing idealised folds during protein design can reveal new information about the physical and structural constraints that dictate which conformations a protein can adopt [148,149]. Such insights could be of vital importance to solving fundamental biological questions behind the evolution of proteins, as well as for further improvement of protein engineering and design [150]. See [151] for a recent review of DL approaches in the protein design field.
Instrinsically disordered proteins (IDPs) lack a fixed or ordered three-dimensional structure. This widespread phenomenon, thought to occur in over 33% of eukaryotic proteins, has been linked with allosteric regulation, enzyme catalysis, and a variety of diseases [152]. While structure-based prediction of intrinsic disorder may seem contradictory, energy scores obtained from existing structures [100] as well as residue-level computational modelling scores [11,101] contain information correlating with disorder and are effective for prediction. Structure-based ML has also been used to sample the very diverse conformational ensembles of IDPs [153].
Unsupervised techniques in the protein universe support tasks such as structure query and retrieval, clustering for motif and hotspot discovery, and structure-based fold annotation. For the former task, an array of fast techniques that allows near-instant retrieval of structures matching an input structure [154][155][156][157][158]. Recent approaches for structure-based clustering allow pinpointing novel or rare folds [11,159], as well as residues and structural regions associated with function [160]. Another common task is the generation of fixed-dimensional unsupervised embeddings which capture global and local protein characteristics. These can be used in downstream ML algorithms, as discussed in the next section.

Computational representations of protein structures
Protein structures contain interconnected high-dimensional information about the amino acids involved, their positions and relative orientations, and the varying physicochemical and electrostatic effects they have on each other. Fig. 1 shows an overview of the most common steps taken in structure-based ML. Once a set of structures with or without associated labels has been collected (Fig. 1A), the next step typically consists of choosing a format to represent this information that can be understood by computers (Fig. 1C). One way to do this is by explicitly extracting a set of  [107], STRING [108], HPRD [109], BioGRID [110], HPIDB [111] Protein binding affinity Protein-protein complex, Protein + Protein [112][113][114] Affinity benchmark [91], SKEMPI2 [115] Ligand screening and binding affinity Protein-ligand complex, Protein + Ligand [38,79,[116][117][118][119][120][121][122][123][124] PDBBind [125], Binding MOAD [126], DUD-E [127] The Input column describes the typical form of input given to the algorithms used. Multiple input format possibilities are comma-separated. All inputs refer to the structural context, i.e. "Protein" refers to the 3D protein structure, "Residue" to aspects associated with each individual residue -its physicochemical, electrostatic, geometric properties etc.
(similarly for "Mutation"), "Ligand" to the 2D and/or 3D structure of a small molecule ligand.
attributes or features from proteins to create a tabular feature matrix.
Another approach is to generate reduced fixed-dimensional protein representations, referred to as embeddings. Both these approaches (Fig. 1B) are followed by the use of ML algorithms that take the feature matrix or embedding as input and return various results (Fig. 1D) and insights (Fig. 1E) for user interpretation.
A number of studies have demonstrated that high-confidence predicted structural models (both homology-based and DL-based) have predictive power and can even perform as well as experimental structures on specific tasks [11,16,33,161]. However, this is unlikely to be a general statement as it is highly dependent on both the types of proteins and the task at hand. For example, membrane proteins, intrinsically disordered proteins, and proteins with high conformational flexibility would still benefit from experimental structures solved in different conditions to increase the diversity of structures available and thus our knowledge of them. In addition, side-chain modelling accuracy, crucial for tasks involving side-chain interactions, tends to lag behind main chain accuracy. Finally, in a significant number of cases, AlphaFold2 and related approaches do not produce high-confidence structures. It was recently shown that while residues predicted by AlphaFold2 with high confidence (> 90 plDDT) have a very low prediction error (median 0.6 Å), this quickly increases to over 3 Å error for low confidence residues (< 70 plDDT) [162]. For such cases with only low confidence structure information present, we may still have to fall back on sequence-based approaches or utilise embedding techniques as described in Section 3.2.

Generating structure feature matrices
Broadly, protein structures are compared at the residue level, where features are extracted from each individual residue in the structure, or at a structural environment level, where features are extracted from well-defined portions of the structure (or the entire structure) containing relevant and localised properties. The former approach is commonly used in structurally conserved protein family ML tasks involving the entire protein, and the latter is used for more divergent proteins or for more specific tasks involving the corresponding structural environments. Both approaches use a range of techniques to align or arrange the extracted features into the fixed dimensional feature matrix format.

Residue level
Many different features can be extracted from each residue in a protein structure using a plethora of computational tools, as listed in Table 2.
When the proteins under consideration are evolutionarily closely related, multiple protein alignment is commonly used to generate the input feature matrix. While sequence alignment has generally been much more popular than structure alignment, the existence of protein families which share the same structural fold despite having little sequence similarity necessitates the use of structure-based alignment methods. This has driven the development of fast multiple structure aligners capable of scaling to the numbers of proteins required to train ML algorithms [178][179][180].
An alternative to the tabular format is a (dis)similarity matrix, often used as input to kernel-based methods such as SVMs or in unsupervised dimensionality reduction. For instance, de Lima et al. [47] calculate protein-protein similarity by combining similarities calculated from, among other features, structural alignment, alignment-free structural comparisons, putative active sites, and instability indices. Fig. 2 depicts some structural environments commonly used in computational representations. For tasks such as hotspot prediction or interface residue prediction, each input data point could be a single residue. In such situations, including aggregate features with weighted neighbour averages over the spatial nearest neighbouring residues, as shown in Fig. 2A, often improves the discriminatory power of predictors [181]. Some environment representations were borne out of ease of adaption of approaches from other fields to protein structures -for example, viewing the three-dimensional coordinates of atoms in a structure as a 3D image grid (Fig. 2B) allows the application of voxelization followed by the use of 3D convolutional neural networks often applied in the field of computer vision. Whereas in the case of images the red, green and blue values are often encoded as different channels, for proteins these channels have been used to encode different atom types [77,95]. Another approach that can also take into account atomic density and radii is the use of geometric tessellations to define a set of polyhedra around atoms or residues in a structure [182][183][184][185] (Fig. 2C).

Structural environment level
Representations of the molecular surface (Fig. 2D) are useful for tasks related to protein interactions and protein-solvent interactions. For example, MaSIF [86] depicts the surface as a series of overlapping radial patches with associated geometric features such as shape index and distance-dependent curvature, as well as chemical features such as hydropathy index, continuum electrostatics and the location of free electrons and proton donors. A geometric deep neural network is applied to these input features to spatially localise features and optimise them towards particular tasks. Other approaches have used 3D Zernike or similar descriptors of surfaces which are invariant to rotation, thus allowing structures and surfaces of different proteins to be compared [186][187][188]. In fact, one of the main problems to solve when representing entire protein structures is this rotational and translational invariance. Fig. 2E depicts one way to address this, namely by using a 2D residue-residue distance or contact map [189,190]. Another approach gaining popularity is the representation of a protein structure as a graph (Fig. 2F) with rotation and translation invariant properties attached to the nodes and/or edges [17,[191][192][193][194]. These graphs form the ideal input for geometric deep learning approaches and have the capacity to encode most of the information contained in the protein structure [195,196].
Proteins often interact with other molecules -other proteins, peptides, nucleic acids and small molecule ligands -so computational representations of these binding regions or interfaces are necessary for a number of tasks. Graph [122,197,198] and voxelbased [79,116,199] approaches can be used on experimentally solved or computationally docked protein-ligand complexes, usually by zooming in to the ligand binding pocket. In addition, there are specialised approaches to take into account explicit protein-ligand interactions within the ligand binding pocket in a complex [124,200]; see [201] for more examples of protein-ligand feature representations. In cases where data about the complex is absent but unbound structures are present, some approaches concatenate features of the individual entities as their representation [117,119,120].

Learning protein embeddings
A complementary approach to generate the tabular input required for ML is by using end-to-end or pre-trained embedding algorithms. These typically make use of unsupervised DL methods trained on a large dataset of proteins to produce a series of values representing a given protein in a fixed high-dimensional space, often without the need for explicitly handcrafted features. Due to the training process, these values place similar proteins closer together in this space thus capturing overall protein variation and relationships between individual proteins. For example, recent global sequence embeddings have been shown to capture amino acid characteristics and other physiological properties of proteins as a whole [202][203][204][205]. These have recently been extended to include structural information as well [206,207]. Unlike protein family ML, alignment is generally not an option in such techniques since most proteins used for training are evolutionarily remote, thus most described embedding techniques depend on learning alignment-free patterns across diverse proteins or on generating on-the-fly alignments of sub-groups of data during the learning process.
End-to-end learning is popular in this area, covering techniques which start from the raw protein structure with minimal processing ProtDCal [167] and automatically extract features based on optimising prediction accuracy in a given end task -thus the intermediate feature representations or embeddings learned are more applicable to the task at hand and can be retrained to adapt better to different tasks. ContactLib-ATT [208] applies this concept to predict the SCOP (Structural Classification Of Proteins) classification of an input structure, using attention-based learning [209] on vectors of hydrogen bond properties extracted from the structure. SASNet [84] is an example of such an approach applied to interface prediction. Local atomic environments of each surface residue are voxelized and a 3D convolutional neural network is applied to the resulting grids of each pair of residues to learn their interaction propensity. Interestingly, this method was trained based only on residues within bound structures of interacting partners and yet performs exceedingly well also on unbound counterparts, indicating that complex features beyond simple shape complementarity can be learned in this endto-end fashion. dMaSIF [210], the successor to MaSIF (mentioned above), performs end-to-end learning of molecular surface representations directly from 3D point cloud data, optimised to each prediction task. Removing the reliance on handcrafted features improved the running time of dMaSIF by many orders of magnitude compared to MaSIF while maintaining and even improving accuracy. Recent DL approaches use the concept of "equivariance" (i.e rotation and translation of coordinates does not affect the learning process) in sequence, graph-based, and diffusion architectures for end-to-end predictive and generative learning [211][212][213]213]. GeoPPI [113] is an unsupervised approach that operates on the graph of a protein complex and uses a message passing neural network to reconstruct the structure of a perturbed complex, i.e one in which a random residue is modified. This enables learning of intrinsic binding interactions, optimal for the prediction of protein-protein binding affinity. An advantage of such "self"-supervised approaches is that they are not specific to a single task while still encoding more global protein context; i.e GeoPPI embeddings could easily be used as input for any prediction task. This kind of repurposing of unsupervised or pretrained embeddings is quite popular in the sequence world [214,215], and likely the same will hold through for structure-based ML in the future. Pretrained embeddings can also be used in a transfer learning context, where they are further fine-tuned to a more specific case of a general protein problem, such as the prediction of antibody-antigen interfaces from an embedding trained across all protein-protein interfaces [17].
Another interesting and relevant approach is structure-guided sequence embeddings [203,216,217] -these make use of structural information only in the training stage while the input to the embedding algorithm from the perspective of the end user is just the sequence. This provides a compromise between the use of structure data, which may be computationally expensive to produce, and more easily accessible sequence data while still making use of implicit structural information. Some recent work [194,218] has even made use of the intermediate representations generated by AlphaFold2 during the structure prediction process instead of, or along with, the predicted structure itself -these representations contain information about homologous sequences and structures, especially useful for predicting the effects of mutations or ligand binding, most of which is lost on generation of the final structure.

Challenges and future directions
Despite rapid progress in the direction of structure-based ML, there are challenges to address before it can become as ubiquitously used as sequence-based ML. Just as there exists a wide variety of tools for answering questions from a sequence perspective, there need to be tools in structural bioinformatics that are as easy to use, as intuitive to interpret, as optimised, and as feature-rich.

Structure-based approaches are computationally expensive
The universal and widespread use of protein sequence data, combined with its one-dimensional nature, has resulted in a diverse landscape of highly optimised sequence-based tools and algorithms. Many of these, including clustering algorithms, aligners, feature extractors etc., scale to hundreds of thousands of sequences with ease. This cannot be said for structure-based approaches yet, both due to their relative newness and to structural data being much more complex than sequence data.
Often this resource intensiveness starts from the very first stepi.e. generating structural models. Template-based or homology modelling approaches take a matter of minutes to hours for generating a single model, often exacerbated by the need to infer multiple models for better robustness and expensive additions such as loop modelling for special cases. Recent template-free methods such as AlphaFold2 and RosettaFold run in minutes, though scaling very poorly with the number of residues, and require GPUs and high amounts of memory and disk space. Memory and space requirements for both are somewhat alleviated by the presence of servers such as SWISS-MODEL [219] for template-based modelling and the recently released ColabFold [220] for template-free modelling, both of which allow running these resource intensive modelling steps on shared external servers. In addition, the growth of the AlphaFold protein structure database [9] will eventually reduce the need for remodelling from scratch for a large number of sequenced proteins. Mutants, designed and novel proteins will still need computational modelling however, indicating that speeding up the modelling process is still a relevant problem in the field. Recent approaches that use protein language model embeddings as input instead of calculating time-intensive multiple sequence alignments (MSAs) provide a step in this direction [221]. With the growth of exascale computing resources, modelling structural dynamics via molecular simulations is increasingly accessible, though there is a long way to go for this to become commonplace.
Once a dataset of structures is gathered or generated, the next steps often involve structural comparison and feature extraction. Alignment-free structural comparison techniques are relatively fast already, but structural aligners that scale to the sizes of datasets required for ML have only recently started to appear. These are still a far cry from the highly optimised sequence aligners, but many of these optimisation techniques can be transferred to structure-based approaches and represent a logical next step as ML on structures grows in popularity. Extraction of many of the features detailed in Table 2 is time consuming as well. While some improvements can be made with parallelisation and making better use of modern hardware, this is unlikely to scale to hundreds of thousands of proteins in a similar timescale as sequence feature extraction.

End-to-end learning on structures
End-to-end learning, where a DL model learns a mathematical function to map an input to a complex output [222], with minimal handcrafting of intermediate features and tasks, was seen to be highly successful for the extremely complex task of mapping an input sequence to a 3D structure [66]. This has been followed by a boom in end-to-end learning approaches on proteins sequences for function prediction, as well as on protein structures for generating designed protein sequences. See [223] for a recent review.
End-to-end learning is becoming popular for a number of tasks as large models trained once on huge datasets of structures can then be reused for smaller sets of proteins and adapted to similar tasks with much less resource consumption and, at the same time, a great increase in performance for even sparse amounts of data [16,212,213,224,225]. In addition, these approaches can learn to make use of relevant intermediate information from proteins that may not be required or prioritised for the structure prediction task but are crucial for other downstream tasks -for example, residue masking in the AlphaFold2 learning procedure increases its robustness and improves overall structure prediction but makes it impossible to predict the structural changes caused by mutations, while much of this information is still present in the intermediate representations and useful for mutant effect prediction [218].
However, these learners do need huge initial training sets of diverse data and careful architecture engineering to avoid overfitting as well as large amounts of computational resources for training and inference. In addition, results from such approaches are difficult to interpret in terms of which kinds of protein properties are being used to make certain decisions, which is a useful property of more handcrafted ML techniques to hypothesise about the underlying biology.

Dynamic representations of structure
Since proteins are inherently dynamic in nature, their true "structure" is much more than the rigid three-dimensional coordinates which serve as the basis for many of the approaches detailed in the previous sections. Instead, a protein is an ensemble of possible conformations, with some areas displaying more flexibility than others. This is further influenced by the constant interaction of proteins with the surrounding solvent, small molecules, nucleic acids, peptides and of course other proteins, all of which drive conformational changes within the protein. Protein biological activity often involves adopting specific conformations, contributions from local fluctuations, and even large-scale structural transitions between different conformations. In fact, the old paradigm that sequence encodes structure, and structure determines function can now be rephrased as sequence encodes structure, structure determines dynamics, and dynamics encodes function [226].
Protein flexibility and conformational diversity can be modelled in multiple ways. One of the most common approaches is using molecular dynamics (MD) simulations, which calculates the force exerted on each atom by all other atoms as a function of time using a molecular mechanics force field [227]. However, MD simulations, which are already computationally extremely expensive, do not address covalent bond formation or breakage, both crucial in a number of enzyme families. This sometimes leads to the need for the even more expensive and challenging set up of Quantum mechanics/ molecular mechanics (QM/MM) simulations [228]. Coarse-grained modelling with Monte Carlo simulations (CG-MC) and elastic network models (ENM, a.k.a normal mode analysis) both provide simplified protein representations that still allow for understanding some aspects of protein flexibility while greatly reducing computational time [226,229]. structures resolved by cryo-EM, a fast-growing number.
Together, these computational techniques can provide information about globular protein flexibility and mutations [230,231], large-scale structural transitions (e.g.from active to inactive conformations) [232][233][234][235], and conformations involved in the formation of protein complexes [236]. They have also been used to assess and refine 3D models [237][238][239], improve ligand positioning [240,241], and to create receptor ensembles for ensemble docking [242,243]. The faster and cruder CG-MC and ENM approaches can be combined with atomistic-level MD, providing efficient strategies and starting points for multiscale simulations of proteins and complexes [244]. While ML is becoming more prevalent in the MD and CG-MC fields, to construct force field models, model energy surfaces, and perform conformational sampling [245][246][247], future efforts will likely also utilise the flexibility information obtained from these techniques to use as input in ML-based predictors of protein function, with a few early examples already doing this in unsupervised [248,249] and supervised settings [250,251]. There is some evidence that this can improve over static structure-based prediction [252].

Probing underlying protein mechanisms
A major limitation of DL-based structure prediction techniques, where prediction acts merely as an alternative to an experimental technique, is that they do not immediately provide us with a deeper understanding of the processes behind the folding of proteins as this is not their aim [253]. In contrast, many approaches using structural data to predict protein properties, especially those in protein family ML, have tried to make more explicit use of the rich feature sources provided to extract mechanistic insights and interpret the residues, causes and processes involved behind specific predictions, as well as guide experimental design in the most relevant directions.
Interpretable ML is a crucial concept in bioinformatics, as often we are as interested in the how and why of a prediction as we are in the what. Thus an important next step in structure-based ML is to couple predictions with an understanding of protein biology in terms of folding, interaction, function, and the interplay between the three. From a protein universe perspective, interpretation becomes dependent on the model inspection techniques specific to DL approaches. While this is a nascent field, techniques such as integrated gradients, saliency and class activation maps exist for this purpose, though they are rarely used yet in structure-based ML tasks [254]. Large-scale unsupervised techniques exploring the protein structural space can also be helpful to pinpoint folds, pockets, and interfaces upon which evolutionary and function-specific analyses can be conducted and for which ML representations and techniques that lend well to linking of prediction to cause can be used. Most importantly, a tight coupling of computational prediction with experimental set up is required, creating a feedback loop that improves prediction and experimentally characterizes relevant functional space.

A unified approach to function
Biological function is only partly determined by an individual protein -its genomic and cellular contexts also play a big role. Each protein is determined by an underlying gene sequence, but the mapping from gene to protein is not so straightforward, complicated by the existence of alternatively spliced transcript variants [255], pre-protein sequences in need of further processing [256], and moonlighting pseudoenzymes [257]. In addition, post-translational modifications, the developmental stage of an organism's life, their subcellular localisation and environment in the cell, and even the extra-cellular conditions all have an effect on protein expression and function [258]. More often than not, proteins also work in concert with a wide variety of other entities, ranging from metal ions and cofactors, water and other solvent molecules, small molecule ligands, peptides, nucleic acids, and other proteins.
One area of study focused on integrating these different contexts of proteins and their complex interactions is network biology. This field is crucial for the accurate modelling of biological systems, and given the influx of data from high-throughput interaction assays and large-scale multi-omics studies, a great target for ML and DL methods. The future holds an increasing number of opportunities for this combination of network biology and ML [259] -in understanding and fighting diseases by inspecting protein and gene interaction networks, in locating off-target effects of drugs and concocting valuable drug combination therapies based on chemical networks and multi-omics data from drug treatments [260], in understanding microbial interactions through metabolic networks, in finding biosynthetic gene clusters through gene neighbourhoods, transcriptomics, and expression profiling, and in designing synthetic gene circuits combining interconnected genes, promoters, and ribosome binding sites. Apart from a few examples [261], structural data has rarely been used in such large scale integrative approaches due to its scarcity and complexity. With the former being solved, the future holds promise in finding and using algorithms and approaches to link protein structures with all of their interlinked data in a unified approach to model function [262].

Conclusion
Protein structure is a central component to understanding biological processes, and thus a great addition to ML approaches in the protein bioinformatics field. In this review we described the space of structure-based ML in terms of the tasks it can be applied to, and the kinds of input representations and algorithms used with a number of examples demonstrating the powerful predictions that can be obtained. Mainly due to the recent breakthroughs in computational structure prediction, the field of structure-based ML is expanding very rapidly, with a high number of actively cited preprints in this review attesting to this. At the moment, sequence-based features, aligners, representations, and ML approaches still far outnumber structure-based ones and they are generally much faster as well. However, the power of structural information to improve computational prediction of protein biology is alluring, and the growth of structural databases, algorithms for alignment and representation, and increasing accessibility of relevant DL approaches and architectures will foster a new generation of protein bioinformatics in which structure will play a starring role.

Conflicts of Interest
We have no conflicts of interest to disclose.