A Review of Protein Structure Prediction using Deep Learning

Proteins are macromolecules composed of 20 types of amino acids in a specific order. Understanding how proteins fold is vital because its 3-dimensional structure determines the function of a protein. Prediction of protein structure based on amino acid strands and evolutionary information becomes the basis for other studies such as predicting the function, property or behaviour of a protein and modifying or designing new proteins to perform certain desired functions. Machine learning advances, particularly deep learning, are igniting a paradigm shift in scientific study. In this review, we summarize recent work in applying deep learning techniques to tackle problems in protein structural prediction. We discuss various deep learning approaches used to predict protein structure and future achievements and challenges. This review is expected to help provide perspectives on problems in biochemistry that can take advantage of the deep learning approach. Some of the unanswered challenges with current computational approaches are predicting the location and precision orientation of protein side chains, predicting protein interactions with DNA, RNA and other small molecules and predicting the structure of protein complexes.


Introduction
Proteins are an essential part of living things that trigger cells to perform different functions. Unlike DNA that never changes, a group of protein known as the proteome changes allows organisms to grow and develop. Only a few proteins work alone; most interact and form relationships with other proteins. The interactions between these proteins also change over time. Evolutionary relationships between proteins occur because organisms have to maintain certain functions as they evolve. The analogy is like a social network between humans. With about 10 billion protein molecules, a cell has a complex network of proteins. This protein network determines the health of a cell, which also affects the health of the organism in which the cell is located [1].
Proteins are linear chains of amino acids linked by covalent bonds. Amino acids are named in the form of a code with a length of 25 characters consisting of the alphabet. The rules for naming amino acids are 20 characters for standard amino acids, 2 for non-standard amino acids selenocysteine and pyrrolysine, 2 for ambiguous amino acids, and 1 for unknown amino acids [2], [3]. Apart from being encoded as a strand, proteins also have a 3D molecular structure. The various levels of protein are primary (chain of amino acids), secondary (local features), and tertiary (global features). Proteins are usually composed of several large domain proteins, strands of which are evolutionarily preserved and have well-defined folds and functions.
Knowledge of protein structure is fundamental to become the basis for other researches such as * Corresponding author: meredita.susanty@universitaspertamina.ac.id understanding certain diseases, developing new catalysts or developing drugs that are more effective and have lower side effects. Knowledge of protein structure provides an understanding of the function and workings of proteins, allowing researchers to influence, control, or modify proteins.
One way to find out the structure of a protein is to use an experimental approach, such as performing sequencing to determine the primary structure of a protein using mass spectrometry [1], or the use of technologies such as X-Ray diffraction crystallography and Nuclear Magnetic Resonance (NMR) Spectroscopy [4] to determine the tertiary structure. However, finding the tertiary structure of proteins through experiments requires a large amount of money and time. The computational approach becomes an alternative way to overcome this limitation.
The researchers undertook the initiative to hold a biennial event known as the Critical Assessment of Protein Structure Prediction (CASP) to track the development of protein structure prediction. CASP is a competition between research groups trying to predict how proteins fold. The protein structure used in CASP is a structure that has been measured experimentally but has not been published so that the competition participants do not know the 3D structure of the protein [5]. The participants' prediction results were compared with the actual protein form obtained from the experimental results, and the similarity was measured using the Global Distance Test (GDT) metric in percentage form. The deep learning approach to predict protein structure has made rapid progress in the last 2 CASP periods [6]. In previous years the GDT value in CASP did not reach 40%, but in the last 2 CASP periods (CASP 13 and CASP 14), one group (DeepMind) managed to achieve GDT extraordinary value of 60% in 2018 using the model known as AlphaFold. At CASP14, the group updated its model, AlphaFold2 and achieved a GDT of around 90%. However, some areas where AlphaFold2 does not perform well, namely in predicting protein complexes (oligomers) in which several amino acids interact. AlphaFold2 only predicts individual proteins, whereas many individual proteins combine to form protein complexes to function. AlphaFold2 has not been able to predict how proteins interact with DNA, RNA and small molecules and determine the exact location of side chains.

Protein Structure Prediction
In general, computational approaches are divided into two broad categories; (1) based on physical principles and (2) based on evolutionary principles as shown in Fig  1. Table 1 shows the development of protein structure prediction using various computational machine learning and deep learning methods for both categories. The physics-based approach simulates the folding process of the amino acid chain using molecular dynamics based on the potential energy of the force field in a particular time or fragment assembly using the energy function to form an energy stable 3D structure. However, molecular dynamics are only effective for small proteins, while fragment assembly has good accuracy if it has protein similarity information [7].
An approach based on evolutionary principles is based on the assumption that all living things came from a common ancestor and then evolved due to adaptation to the environment. In this adaptation, there is a change in the protein structure so that it can function optimally. When there is a change in structure, the essential amino acids do not change. Only the essential amino acid supporting amino acids change to suit the biophysical environment in which the protein must function, known as homology. However, this approach requires homologous sequence information so that it is difficult to determine the structure for a completely new protein.
This approach is also difficult to investigate the impact of mutations on function.

Physic based Approach
Physics-based approaches to protein folding typically involve designing energy functions that guide protein dynamics in the conformation landscape from the unfolded to the folded state. Various approaches were carried out in the last few decades to design energy functions [8], [9] also using the first principle atomistic force field [10]- [16], which was then simplified using a coarse-grained approach protein modelling [17], [18]. In this context, artificial neural networks can help design energy functions to account for multibody terms that are not easy to model analytically.
An unsupervised approach to predicting the contact between residues is carried out by training the model on protein strands without any information about protein structure. The primary approach is to study the evolutionary boundaries between sets of similar protein strands using Markov Random Fields (Potts model) against the MSA underlying a protein strand, a technique known as Direct Coupling Analysis (DCA). Several studies have proposed using deep neural networks to replace shallow Markov Random Fields (MRF). Riesselman et al. trained the autoregressive model against MSA but ignored alignment and showed that protein function could be identified from unaligned strands [19]. In contrast to research by Rao et al. [20], which used multiple MSA, Riesselman et al. [19] only used a set of related strands and did not use an end-toend model to extract contact proteins.
Another model uses the Long Short Term Memory architecture with inputs in the form of amino acid sequences and PSSM and torsion angles to produce 3D structures [7]. The model built consists of 3 stages, computational, geometric and evaluation, by utilizing the results of Alquraishi's research [21] on the geometric stages.

Evolutionary based Approach
Prediction of protein structure using a supervised approach using a deep neural network has made a breakthrough for predicting protein structure [22]- [24]. Early research using a supervised approach made use of co-evolutionary features [22], [25]- [29]. Furthermore, Multi Sequence Alignment (MSA) was used as input to predict protein structure using a supervised approach. [30], [31] studied a model that received MSA as input directly and then used 2D convolution [30] and Gated Recurrent Unit (GRU) [31] to process the input. A recent state-of-the-art protein prediction study, AlphaFold2 [23], used attention to process MSA-MSA using a supervised end-to-end model of protein structure.
Since the advent of large-scale language models for natural language processing [40], [41], this approach has been used in other domains, namely protein structure prediction [35]- [37], [39], [45], [46]. The strands of amino acids that make up proteins are considered similar to the arrangement of words that make up sentences where a word can have a relationship with the word next to it and a word that is a bit far from a certain word position.
Previous research predicts protein contact using a language model using a supervised approach. Bepler and Berger combined unsupervised sequence pretraining with structural supervision to produce sequence embedding. They were the first to fine-tune a pre-trained model using a Long Short-Term Memory (LSTM) architecture on protein strands to pair contacts between residues [47]. [36], [39] showed that the LSTM language model was able to capture biological properties.
The first study to study protein structure using language model Transformer showed that information about the contact between residues could be recovered from the learned representation by performing a supervised linear projection of the protein structure [35]. Another study conducted an in-depth analysis of the self-attention mechanism in Transformers. It identified its relationship to relevant biological features and found that the various layers in the model contribute to the study of various features [48]. In particular, this study found a correlation between self-attention maps and contact patterns in proteins. Alley et al. and Heinzinger et al. performed a comparison of various protein language models using a deep residual network [35], [49]. Rao et al. [37] comparing the Potts model (trained against individual MSA) and Transformers (trained against a large string database). The result shows that just as the Potts model represents contacts directly through their pairwise components (weights), transformers also represent contacts directly via its pairwise component (self-attention). This study also shows the relationship between model performance, MSA depth and language model perplexity. Bhattacharya et al. [45] also showed that a single layer of self-attention could perform essentially the same computations as the Potts model. Based on the results of these two studies, the transport architecture is used to improve performance and perform sampling while maintaining contact [20].
Several other studies have explored alternatives to masked language modelling, such as the use of conditional generation [38], contrastive loss function [50], and set of strands for supervision [51], [52]. Sturmfels et al. expanded the use of unsupervised language modelling to predict position-specific scoring matrix (PSSM) profiles [51]. Sercu et al. used amortized optimization to predict profile and pairwise coupling simultaneously [52].
Recently, a deep learning approach has been applied to perform MSA fittings. Heinzinger et al. showed that the factors studied by the Variational Autoencoder (VAE) model could be correlated with protein structure [39]. Smith et al. used the features of the Potts model with pseudolikelihood maximization to predict the pairwise distance with the deep residual network and optimize the final structure using Rosetta [14] [53].

Challenges
The deep learning approaches enable fast and accurate protein structure prediction. It can produce an individual protein structure in a couple of days compared to the experimental approach that usually takes months or years. However, several things that can be predicted using an experimental approach has not been successfully achieved using a computational approach.

Shallow MSA
Deep learning models are trained using the publicly available data consisting of hundreds of thousand protein structures that have been experimentally generated. The evolutionary-based approach uses MSA as an input. Structure prediction from very shallow MSA and even a single amino acid sequence remains a fundamental challenge.

Model Interpretation
A neural network is a flexible and powerful regression model. Furthermore, because of their highly recursive structure, neural networks are frequently referred to as "black boxes," meaning that the resulting parameters and functions are too complex for practitioners to comprehend.
Although Vig et al.
give reliable interpretations of Transformer architecture, particularly BERT, current deep learning models offer a limited understanding of the complex patterns they learn.
Reverse the process and works backwards to understand what information it was using was important as that might give some information into how the folding mechanism works within a cell. Because understanding the folding mechanism is one of the aspects that goes along with trying to crack the protein folding problem. If this could be understood, it may also then be possible to create a new protein structure and then to use the reverse mechanism to see what the original protein sequence would be.

Side Chain Location
Generally, the approach taken is to predict the protein backbone alone and then embeds the side-chain to the backbone structure. The side-chain conformation is predicted using conformation or energy-minimizing search [54]. Another approach uses Rosetta [53], optimizing the structure using a backbone-dependent rotamer library [55]. In both cases, the placement of the side-chain depends on the predicted backbone structure.
Research that only predicts the side-chain structure generally uses a physics-based approach, especially the energy function. However, Liu et al. tried to make predictions using deep neural network architecture without physics-based assumptions [56]. Several studies have attempted to combine backbone and side-chain [28], [57]. Yang et al. include a representation of the inter-residue orientation of the beta carbon to predict the structural features that help locate the position of sidechain atoms as well but does not provide a complete representation of the side-chain structure [28]. SidechainNet [57] generated a new dataset that extends the Protein-Net dataset [58]. This dataset includes atomic angle and coordinate information capable of describing all heavy atoms of each protein structure.

Protein Quaternary Structure
The current state of the art in protein structure prediction can predict the tertiary structure of individual proteins accurately. However, because many proteins act in a cell as a complex (a form of quaternary structure), how one protein is sandwiched amongst other proteins and knowing a full pitch of the complete complex provides far more information than just that protein individually. While experimental approaches can predict protein complex information, computational approaches have not shown an accurate prediction.

Protein Interaction
Protein does not work alone. Protein might include DNA in its structure. How the protein interacts with the DNA, RNA or other small molecules can only be predicted using an experimental approach. The existing deep learning models only predict protein structure based on one individual amino acid sequence.

Conclusion
We have discussed the current state-of-the-art deep learning techniques applied to the problem of protein structure prediction. Having protein structure gives us insight into the protein's function, and that just useful for understanding what cells doing and understanding what's going on within each of the cells. There are a variety of things we can get with a protein structure. Firstly, it gives an understanding of interactions, for example, with different drugs. So this is very useful for drug discovery. Secondly, a lot of diseases are caused by a genetic mutation within different genes, and those genetic mutations can often alter the amino acid presence within a protein. Having this predictive system helps researchers see how the protein changes when that amino acid is changed and try to understand that disease mechanism. Third, diseases such as Alzheimer and Diabetes is often seen that protein aggregated. Understanding the aggregation process can be greatly aided by having the structures of these different proteins. However, the recent achievement in protein structure prediction can only predict individual protein backbone structure. AlphaFold2 is one of the most significant advancements to date, yet there are still many questions to be answered, as with all scientific research. There are still many challenges to explore, such as how to predict structure from shallow MSA, how numerous proteins form complexes, how proteins interact with DNA, RNA, or tiny molecules, how to precisely locate all amino acid side chains and how to reverse engineer the process.