Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models

Abstract Motivation Protein language models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a protein language model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data preprocessing to mitigate overfitting. Results We ensure comprehensive comparisons by fine-tuning various pretrained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. Availability and implementation Code and data at https://github.com/RitAreaSciencePark/PLM4Muts.


Tuning Hyperparameters
The Multi-Layer Perceptron block consists of one hidden layer with size 768, followed by a ReLU activation and a dropout layer (with probability p of an element to be zeroed set to p = 0.2) that feeds into the final linear layer. We

Figure 2 :
Figure 2: MSA depth (number of sequences) in the S155329 training set (2a) and in the S669 test set (2b).
adopt a PyTorch implementation with AdamW optimizer (torch.optim.AdamW with default parameters, apart from the learning rate) and OneCycleR learning-rate scheduler (torch.optim.lrscheduler.OneCycleLR).The learning rate is selected for each model and training dataset by means of a grid search with lr ∈ {0.5 × 10 −5 , 10 −5 , 0.5 × 10 −4 , 10 −4 , 0.5 × 10 −3 , 10 −3 , 0.5 × 10 −2 , 10 −2 }.In order to reduce the memory footprint, the PyTorch Automatic Mixed Precision is enabled by means of torch.autocast(specifically using dtype=torch.float16) in combination with the gradient scaling function torch.cuda.amp.GradScaler().The gradient norm is clipped using torch.nn.utils.clipgrad norm with max norm = 10, to prevent possible exploding gradient problems.We set a maximum number of 20 epochs for the training in combination with an early stopping criterion.The latter consists in saving the model parameters associated to the epoch with lowest MAE on the validation set, and stopping the training if the MAE on the validation is not decreasing for the next 5 epochs.For training with both the small and designed datasets, we validate on the ssym set, which does not overlap with both the training and the test sequences (Figure3a).When training with the large dataset, a customized validation set is constructed (S1030 ) by excluding from the small dataset sequences that are similar to those in the large dataset (Figure3b), following the criterion stated in the main text.Using the PyTorch DistributedDataParallel class (torch.nn.parallel.DistributedDataParallel), the finetuning is performed on 16 nodes of the LEONARDO Booster partition at CINECA National Supercomputing Center, each consisting of a single socket 32 cores Intel Xeon Platinum 8358 (2.60GHz), together with 4 × NVIDIA A100 GPUs (64GB HBM2e and NVLink 3.0) and 512 GB DDR4 (3200 MHz) RAM.The internal network is provided by NVIDIA Mellanox Infiniband HDR DragonFly+ 200 Gbps.We consider a batch size B = 1 for a single GPU, so the overall effective batch size B eff = 64.Finally, we load the model parameters for the inference on the test sets on a single GPU.

Figure 3 :
Figure 3: Pairwise BLASTp alignments (% identity, e-value and alignments overlap).3a: S669 test set against the ssym set.3b: S155329 training set against the S1413 dataset, before (left) and after (right) filtering by similarity to obtain the S1030 dataset (used for validation in the S155329 training).

Figure 4 :
Figure 4: Number of proteins vs number of mutations (a-d) and ∆∆G distribution (e-h) for all datasets.

Table 1 :
MSAesm ddG model trained with the symmetrical, reverse and direct ∆∆G distributed S1413.

Table 2 :
MSAesm ddG fine-tuned and baseline, trained with different datasets.