mmCSM-NA: accurately predicting effects of single and multiple mutations on protein–nucleic acid binding affinity

Abstract While protein–nucleic acid interactions are pivotal for many crucial biological processes, limited experimental data has made the development of computational approaches to characterise these interactions a challenge. Consequently, most approaches to understand the effects of missense mutations on protein-nucleic acid affinity have focused on single-point mutations and have presented a limited performance on independent data sets. To overcome this, we have curated the largest dataset of experimentally measured effects of mutations on nucleic acid binding affinity to date, encompassing 856 single-point mutations and 141 multiple-point mutations across 155 experimentally solved complexes. This was used in combination with an optimized version of our graph-based signatures to develop mmCSM-NA (http://biosig.unimelb.edu.au/mmcsm_na), the first scalable method capable of quantitatively and accurately predicting the effects of multiple-point mutations on nucleic acid binding affinities. mmCSM-NA obtained a Pearson's correlation of up to 0.67 (RMSE of 1.06 Kcal/mol) on single-point mutations under cross-validation, and up to 0.65 on independent non-redundant datasets of multiple-point mutations (RMSE of 1.12 kcal/mol), outperforming similar tools. mmCSM-NA is freely available as an easy-to-use web-server and API. We believe it will be an invaluable tool to shed light on the role of mutations affecting protein–nucleic acid interactions in diseases.


INTRODUCTION
The interactions between proteins and nucleic acids play essential roles in crucial biological processes, from gene regulation (1) to replication and transcription (2,3), translation (4), DNA repair (5)(6)(7)(8) and DNA packaging (9,10). Missense mutations within nucleic acid binding proteins can lead to a range of diseases, including cancers (11,12), viral infections (13)(14)(15)(16)(17)(18) and neurodegenerative disorders (19). With the increasing speed and availability of high-throughput sequencing, there is a pressing need to be able to rapidly evaluate the molecular consequences of novel variants, however, traditional experimental approaches are time-consuming and laborious. Despite increasing interest in the ability to predict the effects of mutations on these protein-nucleic acid (protein-NA) interactions, it remains a significant challenge. In part, this is due to the limited availability of experimental data compared to other protein interactions. Consequently, most developed methods have focused on only the effects of single-point missense mutations and been found to be poorly generalisable (20)(21)(22).
We have previously shown that by representing protein structures and their interacting partners as a graph-based signature, it can be used to train predictive models capable of unravelling the link between genotype and phenotype by accurately predicting the effects of mutations on protein folding, stability (23)(24)(25), dynamics (26,27) and interactions (28)(29)(30)(31)(32). This approach enabled, for the first time, quantitative and scalable assessment of the effects of singlepoint missense mutations on protein-NA binding affinities (23,33).
Here, we have curated the largest data set with highquality experimentally characterised effects of mutations on protein-NA affinity to date and developed an optimised version of our graph-based signatures to train and test mmCSM-NA (Figure 1), the first method capable of accurately and scalably predicting the effects of multiple-point mutations on protein-NA binding affinity.

Data sets
Experimentally measured effects of mutations on protein-NA binding affinity were carefully curated, combined and manually corrected for errors from ProNIT 2.0 (34), dbAMERPNI (35) and PrPDH (21). ProNIT provides the effects of mutations as experimental thermodynamic values given in the form of the difference in Gibbs free energy of binding, G (in kcal/mol), or kinetic equilibrium constant, K D (in molar). dbAMERPNI comprises of >570 alanine scanning data points with experimental thermodynamic values. PrPDH is a data set consisting of 214 mutations of protein-DNA interactions with experimental G values. In total, our combined data set has 856 single-point mutations across 151 unique protein-NA complexes (484 neutral, 349 decreasing and 23 increasing affinities) and 141 multiple-point mutations on 25 protein-NA complexes (77 decreasing, 62 neutral and 2 increasing affinities). The target outcome for mmCSM-NA is the changes in Gibbs binding free energy (or G). G was defined as the difference between wild-type and mutant ( G = G wt -G mut ), where negative values of G correspond to mutations decreasing protein-NA affinity and positive values to those increasing affinity.
Only single-point missense mutations were considered within our training data set, which was further divided into four distinct subclasses based on the interacting partners: double-stranded DNA (dsDNA; 461 single-point mutations across 81 complexes), single-stranded DNA (ss-DNA; 60 single-point mutations across 13 complexes), double-stranded RNA (dsRNA; 188 single-point mutations across 30 complexes), and single-stranded RNA (ssRNA; 147 single-point mutations across 27 complexes). Structures where separate single-stranded nucleic acids formed a pair were considered as double-stranded.
The Gibbs free energy can be presented as a thermodynamic state function, i.e. the Gibbs free energy only depends on the beginning and end states. In other words, the change in Gibbs free energy of a mutation from the wild-type to mutant ( G wt→mut ) is equivalent in intensity but with opposite sign of the reverse mutation, i.e. from the mutant to wild-type ( G mut→wt = -G wt→mut ). The use of hypothetical reverse mutations has been proposed to help balance naturally imbalanced thermodynamic data sets (36). One limitation of this approach, however, is that mutations leading to large changes in G are more likely to lead to larger changes in protein structure, which are unlikely to be accurately modelled. To avoid this, we generated mutant structures and have only considered reverse mutations where their G wt→mut ≥ −2 kcal/mol. In total, we consid-   Table S1).
i) Graph-based signatures. The mCSM graph-based signatures have been applied to represent both the environment of the wild-type residue of the protein and the nucleic acid by calculating the physicochemical and geometry of their interactions (23). The signature defines atoms as nodes and their interactions as edges. Nodes are labelled based on the seven physicochemical properties (aromatic, hydrophobic, negative, positive, hydrogen bond donor, hydrogen bond acceptor, and neutral) of the amino acid residues/nucleic acids (i.e. pharmacophores). Nucleic acid nodes were labeled based on nucleic acid pharmacophores considering the base ring (purines or pyrimidines), and nucleic acid components (phosphate, sugar and base) and protein nodes as performed previously (33). By using the cutoff scanning matrix algorithm, residue environments are represented as cumulative distributions of distances. ii) Physicochemical properties of protein amino acid residues. The statistical quasi-chemical potential with the composition-corrected pair scale and two substitu- Figure 1. mmCSM-NA workflow and application on real-world data. The method relies on graph-based structural signatures that model distance patterns on the wild-type residue environment. In addition, the effect of mutations was also evaluated using protein dynamics, and interaction network. Complementary information including from mutated residues and predicted protein stability change upon mutation are also used to train and test the predictive models.
tion matrices from AAindex database (38) were used to calculate the effect of mutations. iii) Normal mode analysis (NMA). The dynamics aspects of the mutation effects were characterised by the vibrational entropy change of the protein-NA compelx using DynaMut (26). iv) Protein-nucleic acid interaction network. The noncovalent interactions between protein and nucleic acid in both wild-type and mutated complexes were calculated using Arpeggio (39). Thirteen interaction types are considered: clash, covalent, Van der Waals, clash in Van der Waals, hydrogen bond, weak hydrogen bond, proximal, halogen bond, aromatic, ionic, carbonyl, hydrophobic and metal interactions. v) Nucleic acid type. Four types of nucleic acids, ssRNA, dsRNA, ssDNA and dsDNA were considered in our prediction model. The type was split into RNA/DNA and double/single-stranded as binary variables.
Multiple mutation features were calculated as either the average or accumulation of the respective single point mutations.

Machine learning approach
A range of supervised machine learning algorithms for regression available on Scikit-Learn version 0.20.3 (40), were evaluated under a group-based 5-, 10-and 20-fold crossvalidation procedure and evaluated against the blind test set. To reduce issues with redundancy, the forward mutations and their respective reverse counterparts were kept in the same set.
Under this evaluation framework, the following learning algorithms were analyzed: Gradient Boosting, Extreme Gradient Boosting, Random Forest, Extremely Randomized Trees, AdaBoost, K-Nearest Neighbour, Support Vector Regressor, Gaussian Processes and Neural Networks (with a standard Multi-Layer Perceptron). Models were optimized using a traditional bottom-up greedy feature selection approach. The best performing model, based on Pearson's correlation coefficient (PCC) and root mean square error (RMSE) on the training set, was extremely randomized trees. Other performance metrics were also considered and reported including Spearman's and Kendall's correlations; accuracy, F1 and Matthew's correlation coefficient (MCC) for classification-by-regression analyses. 13 features were selected as representatives for the predicted model.

Web-server
The server front end was built using the Bootstrap framework version 3.3.7, whereas the back-end was built using Python via the Flask framework (Version 0.10.1), on a Linux server running Apache.

Data set analysis
The distributions of the changes in Gibbs free energy, G, in the 856 single-point mutations and 141 multiple-point mutations having experimental values show that majority of the mutations are destabilizing (Supplementary Figure  S1). To avoid introducing potential bias into the final model, hypothetical reverse mutations of the single-point missense mutations were included in the training set. We next analyzed the distribution of different amino acid types on wild-type and on mutated residues. The majority of the 856 experimentally-determined single-point mutations involve positively charged residues, namely Arg and Lys (17%, and 14%, respectively) (Supplementary Figure S2A), followed by negatively charged, and aromatic residues, namely Tyr, Asp, Phe and Glu (8%, 7%, 6% and 6%, respectively). Most of the residues involved in single-point mutations are mutated to Ala (74%), as a reflection of alanine scanning efforts, while other amino acid types account for no more than 2%. Similarly, the majority of 141 multiple-point mutations consist of 16% of Arg and 19% of Lys (Supplementary Figure S2B), followed by Ser (11%), His (7%), Asp (6%) and Glu (6%). Most of the residues in multiple-point mutations are also mutated to Ala (41%), followed by Asn (7%), Lys (6%), Glu (6%), Arg (5%) and Gly (5%), while other amino acid types account for <5%.

Performance on single-point mutations
The performance of mmCSM-NA was evaluated via multiple forms of cross-validation. Using 5-, 10-and 20-fold cross-validation, for single-point mutations mmCSM-NA achieved Pearson's, Spearman's, and Kendall's correlations of up to 0.67, 0.65 and 0.47 respectively, with small deviations across repetitions (<0.01; Figure 2A) and RMSE of 1.06 kcal/mol. Performance increases significantly on 90% of the data, removing 10% of worst predicted data points, achieving Pearson's, Spearman's, and Kendall's correlations of up to 0.78, 0.74, and 0.54 (RMSE = 0.84 kcal/mol). Looking closer at the outliers revealed most of them were associated with extreme values ( G either lower than −3 kcal/mol or higher than 3 kcal/mol). Reassuringly, however, their direction of change was still predicted correctly.
In addition to the overall model performance, we next considered the performance of our final model on each nucleic acid type. Pearson's correlation values of mmCSM-NA on the protein-NA complexes are 0.75, 0.77, 0.76, and 0.57, for ssRNA, dsRNA, ssDNA and dsDNA, respectively (Figure 2B-E). After 10% outlier removal, the corresponding Pearson's correlations increased to 0.84, 0.78, 0.79 and 0.75.
We further assessed the robustness of the mmCSM-NA model on a low-redundancy setting, via leave-one-complexout cross-validation. mmCSM-NA was able to achieve Pearson's, Spearman's, and Kendall's correlations of up to 0.50, 0.57 and 0.40, respectively (RMSE = 1.24 kcal/mol), still consistent with its original performance.
As many experimental screening efforts have focussed on alanine scanning, the performance of the mmCSM-NA model on different types of mutations was evaluated. While the Pearson's correlation for mutations to alanine was 0.55 (RMSE = 1.09 kcal/mol), interestingly, mutations to other residues had a Pearson's correlation of 0.61 (RMSE = 1.26 kcal/mol) (Supplementary Table S2), indicating that our approach has a comparable performance on non-alanine scanning mutations.
We also analysed the dependency of the performance of the mmCSM-NA model on the distance between NA and protein atoms. We used a cutoff of 8Å to distinguish proximal from distal mutations (Supplementary Figure S3). mmCSM-NA was able to achieve comparable performance between mutations within and beyond 8Å from the interface, achieving Pearson's correlations up to 0.58 (RMSE = 0.90 kcal/mol) and 0.69 (RMSE = 1.13 kcal/mol), respectively (Supplementary Table S3).
To further assess the performance of the developed method, we evaluated its ability to correctly identify the direction of affinity change (i.e. either increase or decrease affinity). For this purpose, we removed all the single-point mutations that were considered as neutral (−0.5 ≤ G ≤ 0.5). The mmCSM-NA model was able to accurately classify the direction of affinity change due to mutations, with accuracy, F1 and MCC of up to 0.86, 0.88 and 0.70, respectively, achieving an AUC of 0.91 ( Supplementary Figure S4).

Benchmarking with other available methods
Most efforts to build computational models exploring the effects of mutations on protein-NA binding affinity have focused on predicting residues significantly contributing to binding affinity (i.e. hot-spot residues). We, therefore, sought to compare the ability of mmCSM-NA to identify these hot-spots with other available tools. For this task, hotspots were considered as residues where mutation to Alanine results in complete disruption of the interaction.
Three methods were used in a comparison with mmCSM-NA, namely, iPNHOT (20), PrabHot (22) and PrPDH (21), which predict the hot-spots of protein-NA, protein-RNA and protein-DNA complexes, respectively. Unlike mmCSM-NA, these methods are limited to identifying hotspot residues in close proximity to the binding site and are not capable of predicting quantitative effects.
Different approaches have used different cutoff values to define hot-spots/non-hot-spots. For direct comparison purposes, we therefore used the same threshold defined in each of the respective methods. For example, in the case of iPN-HOT, we used the cutoff of −2 kcal/mol, while in the cases of PrPDH and PrabHot, we used the cutoff of −1 kcal/mol.
We assessed the ability of the methods to identify hotspots by using all mutations to Alanine that had available experimental values in our curated data set as a blind test. mmCSM-NA outperformed all three methods, achieving MCCs of 0.37, 0.39 and 0.28 when using different cutoffs to define hot-spots in comparison with iPNHOT (MCC =  Table 1.

Performance on multiple-point missense mutations
The predicted model was evaluated using a blind data set of multiple-point mutations. Although the training set did not contain any multiple-point mutations, mmCSM-NA was able to accurately predict the change in Gibbs binding free energy of multiple-point mutations with Pearson's, Spearman's and Kendall's correlations of 0.65, 0.56 and 0.41, respectively (RMSE = 1.12 kcal/mol) ( Supplementary Figure S5). This performance shows the ability of the method to predict multiple-point mutations using the information from only single-point mutations. To further assess the performance of the developed method, we evaluated its ability to correctly identify the direction of affinity change (i.e. either increase or decrease affinity). For this purpose, we removed all the multiple-point mutations that were considered as neutral (−0.5 ≤ G ≤ 0.5). The mmCSM-NA model was able to accurately classify the direction of affinity change due to mutations, with accuracy, F1, and MCC of up to 0.96, 0.98 and 0.55, respectively and AUC of 0.90.
The combination of multiple point mutations can be either additive, where the overall effect equals the sum of contributions of individual mutations, or synergistic, where there are compensatory effects between the mutations. We therefore assessed the performance of mmCSM-NA on additive and synergistic multiple-point mutations. Across our dataset of multiple point mutations, we had 104 mutations across 17 complexes that had experimental information available on all the mutations in isolation, in addition to the multiple-point mutation construct. Multiplepoint mutation constructs were defined as additive mutations when the difference between the overall effect and the contributions of the individual mutations effects was either lower than 0.2 kcal/mol (leading to 39 additive mutations) or varied by <10% (leading to 12 additive mutations). Mutations were considered synergistic otherwise, leading to 65 and 92 synergistic mutations, respectively. Our method performed consistently on both synergistic and additive mutations, achieving Pearson's correlations of 0.60 and 0.63 (for additive and synergistic, respectively, for the ±10% criteria) and 0.61 and 0.59 (for additive and synergistic, respectively, for the ±0.2 kcal/mol criteria) (Supplementary Figure S6).

Assessing feature importance
The final mmCSM-NA model is composed of a diverse set of 13 features. We assessed their relative contribu- tion to the model (Supplementary Figure S7) using SHAP (41). This highlighted that the 3D environment of the mutation (graph-based signatures in addition to changes in vibration entropy and protein dynamics) and its interactions with the nucleic acid (Arpeggio interactions and distance to nucleic acid) were the most important contributors to overall performance. In addition, amino acid similarity matrices from AAIndex and the nature of the nucleic acid (DNA or RNA) were also important considerations.

Web-server
We have made mmCSM-NA available to the research community as an easy-to-use web-server. To perform a prediction, users need to provide either a PDB file or a PDB code of the protein-NA complex and specify the type of nucleic acid (ssRNA, dsRNA, ssDNA and dsDNA) (Supplementary Figure S8). Users can select either 'Prediction Mode', where users supply a list of mutations, or 'Design Mode', where the server performs saturation mutagenesis across all residues within 8Å of the protein-nucleic acid interface. With Prediction Mode, users can either type or upload a list of mutations. The point mutation should contain the chain identifier of the wild-type residue in the protein, its single letter code, its corresponding residue number, and the single letter code of the mutant residue. The chain identifier and the single letter code should be separated by a space. For multiple mutations, these can be listed sequentially using ';' as a separator, while distinct constructs should be separated by different lines. The 'Prediction Mode' is also available via RESTful Application Programming Interfaces (APIs) and examples of its usage are described in the web-server. mmCSM-NA predicts the numerical values of the change in Gibbs binding free energy ( G in kcal/mol) (Supplementary Figure S9) in tabular format, which is made available to download as a comma-separated file. A negative value corresponds to the destabilizing effect, while a positive value presents the stabilizing effect.
The users can visualize their uploaded PDB file with its wild-type residue environment from the server using GLmol molecular viewer (Supplementary Figure S9), and a Pymol session file showing all the intra-and inter-molecular interactions made by the wild-type residue, calculated by Arpeggio, is available for download and viewing in Pymol for preparation of publication quality figures and to allow further analysis.

DISCUSSION
In this work, we present mmCSM-NA as a new method that can, for the first time, rapidly and accurately predict the effects of single-and multiple-point mutations on the proteinnucleic acid binding affinity. Our approach, which was developed using graph-based signatures, was robust across multiple forms of validation; and despite only being trained using single-point mutations was able to accurately predict the effects of multiple-point mutations on protein-nucleic acid binding affinity.
We benchmarked our methods with other available methods, particularly iPNHOT, PrPDH and PrabHot. The benchmark showed that mmCSM-NA outperformed those methods, with the advantage of predicting quantitatively the change in binding free energy even for remote regions of the binding site.
We believe mmCSM-NA will be of great value for the study and design of mutations affecting protein-nucleic acid interactions.

DATA AVAILABILITY
A user-friendly web-server implementing the mmCSM-NA's predictive model and all curated data is freely available at: http://biosig.unimelb.edu.au/mmcsm na/.

SUPPLEMENTARY DATA
Supplementary Data are available at NARGAB Online.