Accurate prediction of functional effect of single amino acid variants with deep learning

The assessment of functional effect of amino acid variants is a critical biological problem in proteomics for clinical medicine and protein engineering. Although natively occurring variants offer insights into deleterious variants, high-throughput deep mutational experiments enable comprehensive investigation of amino acid variants for a given protein. However, these mutational experiments are too expensive to dissect millions of variants on thousands of proteins. Thus, computational approaches have been proposed, but they heavily rely on hand-crafted evolutionary conservation, limiting their accuracy. Recent advancement in transformers provides a promising solution to precisely estimate the functional effects of protein variants on high-throughput experimental data. Here, we introduce a novel deep learning model, namely Rep2Mut-V2, which leverages learned representation from transformer models. Rep2Mut-V2 significantly enhances the prediction accuracy for 27 types of measurements of functional effects of protein variants. In the evaluation of 38 protein datasets with 118,933 single amino acid variants, Rep2Mut-V2 achieved an average Spearman’s correlation coefficient of 0.7. This surpasses the performance of six state-of-the-art methods, including the recently released methods ESM, DeepSequence and EVE. Even with limited training data, Rep2Mut-V2 outperforms ESM and DeepSequence, showing its potential to extend high-throughput experimental analysis for more protein variants to reduce experimental cost. In conclusion, Rep2Mut-V2 provides accurate predictions of the functional effects of single amino acid variants of protein coding sequences. This tool can significantly aid in the interpretation of variants in human disease studies.


Introduction
Proteins play fundamental roles in carrying out diverse cellular functions.Their biological activities could be affected by numerous variants.Although most of these variants have negligible effects on protein functions, a small fraction of native amino acid variants of human proteins are closely associated with human diseases [1].Additionally, synthetically introduced variants are crucial in protein engineering to design proteins with specific characteristics.In both applications, accurately estimating the functional effects of millions of protein variants is a fundamental and challenging problem.
Thus, computational approaches were proposed to overcome this limitation.These approaches can be classified into four groups.The first group, which includes PolyPhen-2 [29], SIFT [30] and SNAP2 [31], usually relies on evolutionary conservation of homologous sequences [32], while the second group, which includes CADD [33], integrates diverse annotations in the inference of mutational effects.The third group, which includes EVmutation [34] and DeepSequence [35], considers epistatic couplings between amino acids [36,37].DeepSequence also uses a variational auto-encoder to detect latent features in sequences to predict a variant's functional effect.The fourth group of the methods, which includes transformer models such as Evolutionary Scale Modeling (ESM) [38,39], has recently been used to automatically capture higher-order hidden information behind the sequences.The transformer models were trained on millions of available protein sequences, and provide a novel way to predict the mutational effects of protein variants without the need for hand-craft features.However, few studies investigated how learned features from transformers perform on those high-throughput experimental datasets with various mutational measurements of functional effects upon protein variants.
Here, we propose a deep learning method, named Rep2Mut-V2, to use protein sequences as the sole input to accurately predict 27 types of measurements of mutational effects of protein variants.Rep2Mut-V2 is an improvement of our previous model Rep2Mut [40] which was designed to predict the transcriptional activity of HIV Tat protein (GigaAssay [3]).In an assessment of 38 protein datasets, Rep2Mut-V2 demonstrated superior performance when compared to six existing methods.Rep2Mut-V2 exhibits great potential to assist the investigation of mutational effects for more proteins, aiding in the interpretation of protein variants and human disease studies.Our tool is publicly available at https://github.com/qgenlab/Rep2Mut.

Datasets
A total of 38 protein datasets, comprising 118,933 single amino acid variants (Table 1), were used to evaluate our method and the state-ofthe-art methods.Each of these datasets investigates a protein with various numbers of variants, and the number of variants range from 313 (on YAP1 dataset) to 12,236 (on BF520 dataset) with a median of 1725 as shown in Table 1.Each dataset is also associated with a specific functional measurement, such as transcriptional activities, fitness, CRIPT, MIC score, etc. [35].The 38 datasets encompass a total of 27 distinct functional measurements.Most of the datasets were generated by deep mutational scanning and collected by Riesselman et al. [35], while the HIV Tat data was generated by GigaAssay [3].

Deep learning framework to predict functional effects of protein variants
Rep2Mut-V2 is a deep learning-based method to estimate various functional effects of protein variants.Rep2Mut-V2 uses a pair of protein

Table 1
The 38 datasets for testing the model.Among them, fitness is used as the measurement for six datasets, and these six datasets are BF520, BG505, P84126, POLG_HCVJF, TIM_SULSO, and TIM_THEMA."#variants": the number of variants; "Seq length": the length of protein sequences.sequences as input, i.e., a wildtype sequence (WT), and a mutated sequence with a substitution of an amino acid at a position of interest (Fig. 1).Rep2Mut-V2 uses the WT and mutated sequences as input of ESM to learn the representation of the mutated position [38] (denoted as "ESM-f", which distinguishes ESM predictions).ESM-f is composed of multiple transformer layers and is trained on millions of protein sequences with the masked language modeling objective.It endeavors to learn multiple levels of protein knowledge such as biochemical properties and evolutionary information.ESM-f has several different releases, and ESM-1v, used in Rep2Mut-V2, comprises 34 transformer layers trained on the UniRef90 dataset [41].The 33rd layer of ESM-1v generates a 1,280-element vector which is used to represent either WT or variant information for the position of interest in the protein sequence.Each of the representation vectors is then used as the input of a fully connected neural network layer with a vector of 128 elements as output (Layers 1 and 2 in Fig. 1).After that, the two 128-dimension vectors are merged using an entry-wise product and then fed into Layer 3 to estimate the functional effect of a variant.The entry-wise product (or Hadamard product) takes two matrices of the same dimensions as inputs and generates another matrix of the same dimension as the operands using a binary operation.For example, given two matrices A m,n and B m,n with m × n dimensions, the entry-wise product where 0 <i ≤ m, and 0 <j ≤ n.Additionally, a PReLU activation function [42] and a dropout rate of 0.2 are applied to the fully connected neural network layers to avoid overfitting.

Training and testing Rep2Mut-V2
Three different strategies are used to evaluate Rep2Mut-V2: The first strategy is ten-fold cross-validation.With this strategy, Rep2Mut-V2 is assessed in two steps: a leave-one dataset-out pretraining step, and a cross-validation fine-tuning step.Initially, 36 of 37 non-GigaAssay datasets are used to pretrain the deep learning framework to capture shared information across proteins.During this step, layers 1 and 2 are shared across the datasets, and each dataset has its own specific layer 3. The framework was pretrained using 10 epochs with a batch size of 256 and a learning rate of 1e-5.After that, the remaining dataset is randomly split into 10 groups, and each group contains ~10% variants.Each time, 90% of the variants of the dataset are used for fine-tuning, and the other 10% are for testing.The fine-tuning process is conducted with a batch size of 8, and a learning rate of 1e-4 for layer 3, and 5e-6 for layers 1 and 2. Both pretraining and fine-tuning processes use the Adam optimizer [43] and MSE loss function in the back-propagation process.MSE is defined in Eq. (1) where n is the number of variants, Y i are the observed activities and Ŷ are the predicted activities.
To compare Rep2Mut-V2 with other methods and to avoid the influence of random split of datasets, the fine-tuning process was repeated 50 times, and the final evaluation was based on the averaged performance.
The second strategy is few-shot learning.The few-shot learning model is trained on a small fraction of variants, but can be used to accurately predict the functional effects of a wide array of variants.Our few-shot learning process is evaluated on six datasets with fitness measurement (in Table 1), as other measurements are used on relatively fewer available datasets.During few-shot learning, we use 5 out of 6 fitness datasets for pretraining, as we did previously.Then, 30% of the variants of the remaining dataset are used for the fine-tuning, and the remaining 70% are used for testing.The testing is repeated five times on each of 6 fitness datasets.
The third strategy is zero-shot learning, where a new dataset that is not used in training process is used for testing.It allows the use of our model without further training.To evaluate zero-shot performance, we train the model using 5 out of 6 fitness datasets, and test our model on the 6th dataset.This process is repeated six times for each fitness dataset.

Estimate mutational effects of protein variants with state-of-the-art methods
We evaluated the performance of Rep2Mut-V2 against six published methods that were described below.

ESM:
The first published method is ESM [38,39], a pretrained model Fig. 1.The architecture of Rep2Mut-V2 to predict mutational effect from protein sequences.The model consists of 328,067 trainable parameters."ESM" is used to generate representation vectors only; therefore, we denote it as "ESM-f" to distinguish from ESM prediction.
to estimate protein's activity and functions.The ESM approach estimates the functional effects of protein variants through the following process: Given a protein sequence, ESM produces a representation vector of each position; Then, an additional layer is added to calculate a probability vector of all amino acid types for each position in the sequence [39]; after that, given a position of interest, the amino acid in the wildtype protein serves as a reference state and is compared to the mutated amino acid type.The variant effect is then calculated using the logarithmic ratio of the probability between the mutated amino acid and the WT amino acid [39] as shown below: where p is a probability vector for a position of interest, T is the set of mutated positions for a variant, x \T is the masked input sequence, x mt t and x wt t represent the mutant and wildtype amino acids, p is the probability assigned to the mutated amino acid x mt t , and p is the probability assigned to the wildtype.ESM released two versions (ESM-v1 and ESM-v2) each with several pretrained models.For ESM-v1, we employed two models in this evaluation: esm1v_t33_650M_UR90S_1 [39] (denoted as ESM-M2) if the protein sequence length is shorter or equal to 1024, and esm1_t34_670M_UR50S [38] (denoted as ESM-M1) for all datasets due to its ability to handle protein sequences longer than 1024 amino acids.ESM-2 [44] (denoted as ESM-v2) uses more parameters: 36 layers with up to 15 billion parameters vs 33 layers used in ESM-M1 and ESM-M2.The used ESM-v2 model is esm2_t36_3B_UR50D.
DeepSequence: DeepSequence [35] is a deep latent-variable model.It uses the concept of variational autoencoders (VAE) [45] to extract latent factors from a protein (or RNA) sequence, and can capture higher-order correlations in biological sequence families.Although DeepSequence is a generative model, predicting a variant's effect on each protein sequence requires additional training.Given a sequence, we used a multiple sequence alignment (MSA) tool to generate MSA sequences.As recommended tools by DeepSequence, we used EVcoupling from the website v2.evcouplings.orgtogether with a bit score of 0.5 bits/residue as a threshold during MSA.These MSA sequences were then used to retrain DeepSequence for predicting mutational effects.
SIFT: Sorting Intolerant From Tolerant (SIFT) [30] is an old tool to predict a variant's effect on protein function.It uses substitution tolerance of a protein position to estimate the variant's effect.Given a sequence, SIFT usually collects a set of related sequences and aligns them against the target protein.Then, it calculates the degree of conservation of amino acids and uses this to estimate a score that specifies whether a variant is tolerated or deleterious.
CPT: Cross-protein transfer (CPT) [46] uses various features to predict a variant's effect.These features include scores from EVE and ESM-1v, MSAs, structural features from AlphaFold2, as well as amino acid descriptors such as charge, polarity, hydrophobicity, size, local flexibility, and so on.Based on these features, CPT uses a linear regression algorithm to train a model on five human proteins (CALM1, MTHR, SUMO1, UBC9, and TPK1).The model is mainly evaluated on human proteins for clinical variant interpretation.The variant's predictions on human proteins were downloaded and used for performance evaluation.
VariPred: VariPred [47] is another ESM-based approach to predict pathogenicity of amino acid variants.Its manuscript was released on bioRxiv after our initial submission.It uses ESM to generate vector representations for predicting a variant's pathogenicity.Its prediction outcome is binary: 1 denotes pathogenic and 0 means not pathogenic.We also extracted VariPred's predictions for human proteins from our 38 datasets and evaluated its performance.
EVE: evolutionary model of variant effect (EVE) [48], like Deep-Sequence, utilizes evolutionary information to predict the clinical significance of human variants.It uses multiple sequence alignment (MSA) as input to train a Bayesian Variational autoencoder (VAE), and estimates an evolutionary index to distinguish variant sequences and wild-type sequences.

Evaluation measurements
We use the Spearman's correlation coefficient (SRCC) to measure the performance of each tested method across the 38 datasets.For the implementation, we used the Python package scipy to calculate the SRCC between the predicted and experimental estimates.Specifically, we let X and Y be experimental and predicted estimates for a list of variants.SRCC is calculated using Eq. ( 3).
where R( * ) is the ranking of items in * , cov(R(X), R(Y)) is the covariance of X and Y, σ R(X) is the standard deviation of X, and σ R(Y) is the standard deviation of Y.

Evaluation of Rep2Mut-V2 under zero-shot and few shot strategies
Rep2Mut-V2 was first evaluated under three strategies: zero-shot transfer learning, few-shot learning and leave-one position-out crossvalidation.The evaluation was conducted on the six fitness datasets (as detailed in Table 1), because other measurements were not available for more datasets and different measurements may not be comparable.
The zero-shot learning model was tested on a dataset after being trained on other fitness datasets.Through leave-one dataset-out crossvalidation, the results in Table 2 reveal that Rep2Mut-V2 achieved better performance on three of the six datasets when compared with ESM, and on four of the six datasets when compared with DeepSequence.
Few-shot learning usually fine-tunes a model on a small fraction of variants of a dataset and then tests the model on other variants of the dataset, after being pretrained on other datasets.In our evaluation, we pretrained our few-shot learning model on five of six datasets, and finetuned the model using 10% of the variants of the remaining testing dataset.The model was assessed on 90% of the variants of the testing dataset.The results are presented in Table 2 which shows that Rep2Mut-V2 outperforms DeepSequence and ESM on all six datasets.
We further assessed Rep2Mut-V2 using leave-one position-out crossvalidation, because mutated positions might be correlated with fitness measurements.In our cross-validation, training and testing data were split based on mutated positions, rather than randomly splitting.As presented in Table 2, Rep2Mut-v2 still generates better results than state of the arts models on most of datasets.Rep2Mut-v2 outperformed DeepSequence, ESM-1 and ESM-2 by average SRCC improvements of 0.269, 0.265 and 0.112 respectively.

Evaluation of Rep2Mut-V2 with ten-fold cross-validation
The performance of Rep2Mut-V2 on the 38 datasets under ten-fold cross-validation is presented in Table 3 and Fig. 2, together with the predictions made by SIFT, ESM and DeepSequence on the 38 datasets.The results generated by CPT and VariPred on human proteins are also provided in Table 3, considering that the developers of CPT and Var-iPred mainly assessed the predictions of variants' effect on human proteins.
When compared to SIFT and EVE, Rep2Mut-V2 exhibited a consistent trend of performance improvement, outperforming both SIFT and

Performance of Rep2Mut-V2 on smaller training data
The above assessment of Rep2Mut-V2 used 90% of the dataset for training and 10% for testing.However, it is generally expensive and time consuming to generate more variant data through wet-lab experiments.Therefore, we tested Rep2Mut-V2's performance with less variants for training.In detail, we used 30% of the variants from a dataset for finetuning Rep2Mut-V2 and the remaining 70% variants for testing.
We compared this Rep2Mut-V2 model to DeepSequence and ESM.As

Discussion
Rep2Mut-V2 was evaluated on 118,933 single amino acid variants from 38 protein datasets with 27 types of measurements of functional effects.The evaluation was conducted under various cross-validation strategies, and the performance of Rep2Mut-V2 was compared against six existing methods.The results consistently highlighted the superiority of Rep2Mut-V2 in accurately predicting the mutational effects upon protein variants.Even using limited variants for training, Rep2Mut-V2 maintained superior performance over existing methods.Notably, Rep2Mut-V2 relies solely on protein sequences and does not require protein 3D structures for precise prediction.Given the availability of millions of protein sequences compared to the limited number of proteins with experimental 3D structures, Rep2Mut-V2 proves to be a highly valuable tool, especially for those proteins which have experimentally determined functional effects only for a small fraction of variants.
It is important to note that Rep2Mut-V2 uses representation vectors generated from an ESM framework as input, but its prediction performance is higher than the prediction performance of both ESM and DeepSequence.This success is partially attributed to the fact that ESM prediction relies solely on representation vectors of WT sequences, whereas our method uses representation vectors of both WT sequences and mutant sequences as inputs.Our design allows the models to learn the difference between WT and mutant vectors for accurate prediction.On the other hand, DeepSequence depends on evolutionary data generated from multiple sequencing alignments to infer mutational effects.Its performance is thus limited by the availability of similar sequences to refine the model.Our method has some limitations.First, our method was designed to predict the functional effects of single amino acid variants and is not currently equipped to handle high-order variants.We are presently extending our framework to address the prediction of functional effects of double/triple variants, although there are limited datasets with highorder variants.Second, we tested transfer learning with Rep2Mut-V2.However, transfer learning generally treats the contribution of each dataset equally, while the reliability of measuring functional effects of variants is different across those datasets.This uniform treatment may mislead transfer learning.To overcome this, a potential solution is to weigh the contribution of each dataset to shared layers (in Fig. 1) based on experimental reliability of the measurement of functional effects.Unfortunately, with only 38 experimental protein datasets, it is hard to arrive at a robust conclusion.As the functional effects of more protein variants are experimentally determined through high-throughput methods, transfer learning could effectively learn shared information across proteins and then substantially enhance the prediction of variants' effects.Consequently, these two limitations will be overcome with the availability of more datasets.

Conclusion
In this study, we proposed and tested Rep2Mut-V2 across 38 protein datasets with various effect measurements.Our approach was compared with six existing methods, and the evaluation demonstrates that our approach can achieve much better performance on most of the datasets.
By relying solely on protein sequences, our approach achieved accurate prediction of functional effects even with a limited number of variants for training.These observations strongly suggest that Rep2Mut-V2 has the potential to study mutational effects across a broader spectrum of proteins, thereby benefiting human disease studies.manuscript.All authors have read and agreed to the published version of the manuscript.

illustrated in Fig. 4 ,
this Rep2Mut-V2 model still outperformed Deep-Sequence ion 26 datasets and ESM on 29 datasets.Compared to Rep2Mut-V2 with 90% variants for training, the performance of this Rep2Mut-V2 model decreased by 0.077 points on average.This robust performance clearly demonstrates Rep2Mut-V2's ability to generate accurate predictions for variant analysis, especially in cases with limited available variants determined by wet-lab experiments.This offers a better solution for analyzing functional effect for numerous variants to reduce the intensive financial and human resources required in wet-lab experiments.