Predicting protein stability and solubility changes upon mutations: data perspective

Understanding mutational effects on protein stability and solubility is of particular importance for creating industrially relevant biocatalysts, resolving mechanisms of many human diseases, and producing efficient biopharmaceuticals, to name a few. For in silico predictions, the complexity of the underlying processes and increasing computational capabilities favor the use of machine learning. However, this approach requires sufficient training data of reasonable quality for making precise predictions. This minireview aims to summarize and scrutinize available mutational datasets commonly used for training predictors. We analyze their structure and discuss the possible directions of improvement in terms of data size, quality, and availability. We also present perspectives on the development of mutational data for accelerating the design of efficient predictors, introducing two new manually curated databases FireProtDB and SoluProtMutDB for protein stability and solubility, respectively.


Introduction
Efficient design of stable and soluble protein variants is one of the principal goals of biocatalyst engineering. Understanding the mechanisms governing protein stability and solubility changes upon mutations is of paramount importance in several domains, including biotechnology, medicine, and biopharmaceutics. [1,2] Biocatalyst production suffers losses from time and resources wasted on poorly soluble and unstable protein mutants, [3] and in industrial applications, improved stability against harsh environments often becomes critical. [4] Improving properties of valuable but difficult-to-work-with proteins that have borderline stability or are poorly soluble outside living cells presents another major challenge in biotechnology. [5,6] Moreover, human neurodegenerative disorders, metabolic diseases, and cancer are often linked to mutations leading to protein misfolding, aggregation, [7][8][9] or decreased solubility, [10,11] and unstable or insoluble proteins may lead to precipitates triggering an unwanted immune response in patients. [12] Given the astoundingly vast protein sequence space to explore in the pursuit of improved stability and solubility, computational tools are used increasingly to narrow down the search to ideally only a few promising mutations to be tested experimentally. [1] Many recent successes in designing improved variants relied on incorporating in silico prediction in the pipeline, [13,14] e. g. in the recent application of computer-assisted protein engineering strategy to modify fibroblast growth factor [15] to yield unprecedented stability and uncompromised biological function ( Figure 1). Several new reviews provide excellent overviews of modern approaches and success stories in applying computational methods for engineering stable and soluble biocatalysts. [16][17][18][19][20] Computational tools often rely on a set of rationally chosen rules applied at different steps of the protein engineering workflow. Many factors may potentially affect the outcome of introducing mutations: from physico-chemical properties of substitutions locally [21] to global changes in the protein backbone. [22] Composite combinations of those factors have also been reported to lead to better design capabilities, e. g. protein spectra derived by digital signal processing was shown to be useful in the design of stereoselectivity. [23,24] This complexity promotes the use of machine learning (ML) techniques, i. e. general-purpose algorithms for automatic rule generation based on patterns in available data. [25] Such algorithms have already substantially advanced our capabilities in image analysis, speech recognition, natural language processing, and other intrinsically complex tasks. [26,27] Therefore, their application to such sophisticated problems as predicting mutational effects on protein stability and solubility was only a matter of time.
Many promising ML-based predictors have been published for either task. [1,[28][29][30] However, they all seem to be testing a similar limit to the prediction accuracy, e. g. the root mean square error of around 1 kcal/mol for stability predictions. [1] Moreover, independent experimental validation in subsequent studies often reveals modest performance. [28,[31][32][33] Several explanations can be provided, one of which is the limited data size and quality available for training: data quality and abundance are critical for ML algorithms as they ultimately aim to identify and generalize patterns in the training data.
In this minireview, we focus on the databases and data sets habitually used to train such predictors. We briefly discuss their structure and associated challenges, from misleading notations and erroneous entries to the problem of aggregating results from different experimental setups. We conclude with the perspectives in improving those sets with the hope that it will further accelerate the usage of modern data analysis approaches to uncover the driving forces behind the effects in question.
We chose to consider both stability and solubility mutational data due to the similar structure of data as well as intertwined effects of the underlying mechanisms. On the one hand, unstable proteins tend to aggregate and are prone to faster degradation by proteolysis, [34] producing a negative signal in solubility assays. On the other hand, protein stabilization achieved by means of protein engineering was often reported to come at the cost of reducing protein solubility. [17,28] For instance, stabilization strategies frequently suggest surface mutations that increase hydrophobicity, and while such muta-tions do often increase stability, [35] they tend to have a detrimental effect on solubility. The flexibilities of side chains and whole protein regions have been reported to guide the engineering of both stability [36] and solubility. [37] Moreover, the structure of mutational datasets for these two tasks is quite similar. The joint focus on both problems is, therefore, expected to bring benefit to the communities working on either task.

Data for training protein stability predictors
Recent developments in X-ray crystallography, NMR, cryoelectron microscopy allow solving protein structures at Angstrom and even sub Angstrom resolution [38] revealing the structural basis of protein binding, catalysis, and stability at the level of individual amino acids. However, such experiments are expensive, low throughput, require sophisticated instrumentation, and are often limited by the protein size. Therefore, most data on protein stability changes upon mutation come from less demanding techniques, namely differential scanning calorimetry, light scattering, circular dichroism, fluorescence spectroscopy, etc. [39] In those experiments, protein in solution is denatured by physical (temperature, pressure), chemical (pH, osmolytes), or biological (proteases) perturbation, and the output signal is recorded and analyzed. For temperature denaturation, this analysis typically yields the melting temperature T m , loosely defined as the apparent midpoint of the transition in the signal, the difference in Gibbs free energy of the unfolded and folded states~G, typically derived from data fitting, or the activity-related temperature T 50 at which the residual activity is reduced by 50 % after incubation.
The pioneering effort in collecting mutational stability data from literature resulted in ProTherm, [40][41][42] a comprehensive database comprising numerical data from protein denaturation experiments, structural information, description of experimental methods and conditions. The overwhelming majority of protein stability change predictors were trained on the data from this database. [1] In Table 1, we summarize the most commonly used derivatives of ProTherm as well as recent additions. Unfortunately, the database was last updated in 2013 and has not been actively maintained since then. This resulted in many outdated, imprecise, or erroneous entries, which necessitated substantial manual data cleaning. Among major issues unidentified by the teams working on stability predictions [43][44][45][46] were nonmatching protein sequences and PDB entries, wrong signs and units of reported values, data incompatibility due to a wide range of experimental conditions, lack of representation for some substitutions, inadequate disclosure in the source papers. Many reported~T m and~~G values were determined under the assumption of a simple one-step reversible denaturation, whereas many proteins undergo multi-step denaturation that is not evident without proper data analysis. [47,48] The occasional presence of heat capacity difference of unfolding~C p introduces a temperature dependence to~G, [49] rendering the latter values accurate only in a narrow temperature range of the transition. However, the values reported were sometimes extrapolated to the room temperature or T m of the wild type.
Several tendencies can be identified based on Table 1. All the datasets are restricted to single-point mutants, and in most of them only those with available PDB structure are preserved. Multiple values are averaged, and extreme conditions are sometimes excluded, as well as extreme values due to higher expected measurement errors and more significant changes to the structure of wild types upon introducing mutations. Only several teams performed a manual cleanup of the data and revealed massive inconsistencies in reported values, parameters, and structures. Moreover, data preprocessing in general varies significantly. This supports our hypothesis that the limited data size and quality might be the reason for a modest performance of ML-based predictors in independent tests.
Regarding the data structure, the wild type proteins are uniformly distributed among the four major SCOP structural classifications, as observed in S1948. [50] The authors of S1564 [43] identified that the largest numbers of variants are for lysozyme (16 %), followed by barnase (8 %) and gene V protein (7 %); most common are substitutions for alanine (26 %) and substitutions from valine (11 %); the least frequent are substitutions from tryptophan (only 18 out of 1564), and some substitutions are not represented at all. In each dataset, most mutations are destabilizing, and this imbalance may affect negatively the performance of the resulting predictor. Indeed, many predictors were reported to demonstrate a similar bias: mutations are usually correctly predicted as destabilizing, but those predicted stabilizing on average turn out to be neutral during experimental validation. [28] Apart from ProTherm data, some predictors were tested on 42 mutations of the DNA binding domain of the tumor suppressor protein p53, [60] and the performance of several predictors was recently evaluated on two newly collected datasets: 96 single-point mutants of guanylate kinase [61] and 51 mutants of β-glucosidase. [33] Several teams performed an additional independent literature search, revealing the promising prospects of seeing improved protein stability predictors in the near future. Many data sets from Table 1 can also be found in VariBench -a platform for sharing published variation data for benchmarking. [58,59] Augmenting the datasets with reverse mutations with opposite signs of ΔΔG or ΔT m has also gained attention recently to promote the so-called anti-symmetry of predictors: reverse mutations should produce the same predictions but with opposite signs, which turns out not to be the case for many predictors. [31,32] To provide the community with Stanislav Mazurenko received his PhD. in applied mathematics and cybernetics from Lomonosov Moscow State University in 2013. He then joined the protein engineering group Loschmidt Laboratories at Masaryk University as a postdoc to work on data analysis and modelling of protein thermal denaturation. In 2018, he completed a one-year stay at the University of Liverpool, working in nonlinear optimization. He now leads a team in Loschmidt Laboratories, focusing on machine learning methods for protein engineering.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57 additional high quality data, we have manually processed the data from ProTherm as well as new data from literature and are depositing them to our database FireProt DB , where they can be accessed via a user-friendly graphical interface ( Figure 2). We expect to release the databased in the next few months, and its landing page can be found at loschmidt.chemi.muni.cz/fireprotdb.

Data for training protein solubility predictors
Mutational datasets for protein solubility are much more scant and heterogeneous. Protein solubility is typically defined as the concentration of folded protein in a saturated solution when in equilibrium with the solid phase. This quantity is usually estimated in vitro by increasing protein concentration, e. g. by adding lyophilized protein to the solvent or by protein ultrafiltration with subsequent estimating of protein fractions in the supernatant and the pellet, sometimes with the aid of various precipitants such as salts, organic solvents, or long-chain polymers. [62] At the same time, solubility can be defined more generally as in vivo expression, which is usually estimated as expression yield or its proxy, e. g. fluorescence intensity in split-GFP systems [63] or luminescence in split-NaNoLuc assays. [64] Protein expressibility depends on many factors as many components of a cell are involved in its synthesis and folding    3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57 pathways; and any perturbation of those pathways affects the solubility.
Early attempts to collect solubility data systematically at the scales suitable for general ML were made towards full sequences. In 2009, a collaborative effort of the Targeted Proteins Research Project resulted in eSoL database that comprises solubility data of around 4000 Escherichia coli proteins measured using the PURE cell-free expression system. [65] The more prolonged Protein Structure Initiative resulted in the TargetTrack database with more than 300 000 protein expressed and annotated. [66] Although aimed at a largescale structure determination, it provides a proxy for quantification of protein solubility based on expressibility. The major limitation of the two databases in our context is the absence of mutational data. While some studies demonstrated potential in predicting mutational effects on solubility after training on wild

58/55
Only single-point mutations accompanied by experimental pH, temperature, and structures at atomic resolution were considered. In S1925, 12 mutants for two proteins whose structures had missing residues, one trivial mutation, and 10 mutations with fewer than six nearest neighbors were removed.

131/67
Only single-point mutations in globular proteins with available X-ray or NMR structure were considered. Mutations in pseudo wild types and hemeproteins, those destabilizing the structure by more than 5 kcal/mol, and those involving proline were removed due to significant expected structural modifications. Multiple ΔΔG values were weighted-averaged, preferring pH close to 7, a temperature close to 25°C, and no additives. The subset S350 was generated randomly to provide a benchmark.

60
Only single-point mutations in proteins with available PDB structures were considered. The authors used Profix to fix structural defects (missing atoms, residues, or gaps), and TINKER for energy minimization, removing the proteins that failed to be processed by either tool. Only the data measured for pH 6-8 were considered assuming ionizable residues will have default charged states then. Multiple ΔΔG values were averaged.

49/42
Only single-point mutations in proteins with available PDB structures were considered. Only the data measured for pH 5-9 were kept. Multiple ΔΔG values with the variation < 0.1 kcal/mol were averaged. The tDB subset consists of cases with X-ray structures with no ligands.

99
A manually corrected subset of single-point mutants. The cases with~G values between À 0.5 and 0.5 kcal/mol are considered neutral. No chain IDs are given, and 131 entries for 20 proteins lack PDB IDs.

ChemCatChem
Minireviews doi.org/10.1002/cctc.202000933 type sequences only, [67] the major effort in the area was focused on assembling mutational data from existing literature (Table 2), similar to the datasets used for training protein stability predictors, and training an ML-based predictor on those datasets even despite the modest data size. Regarding their structure, these datasets show only slight imbalance, except for CamSol dataset with just three mutations decreasing solubility. They were compiled from multiple independent publications, and the different scales for classifying solubility changes reveal that considerable effort is required to make the values compatible. In the largest data set PonSol, the number of mutants per protein ranges from 52 for Interleukin-1β to below 3 for a dozen proteins. The most common are substitutions for alanine (16 %) and substitutions from leucine (11 %) and lysine (10 %); approximately half of the possible substitution pairs are not represented at all. A significant overlap in the data among different sets can be observed, which hinders a proper comparison of ML predictors trained on different sets. This indicates that the community will benefit significantly from a curated database resolving the overlaps as well as absorbing data published more recently. To address this limitation, we are currently working on the manually curated database SoluProtMut DB (loschmidt.chemi.muni.cz/soluprotmutdb) that will comprise both the data systematically collected from published sources and the experimental data collected in our laboratory.

Perspectives
The analysis of the literature presented in this study demonstrates how challenging the task of collecting mutational data is even for such habitually measured protein properties as stability and solubility. Apart from data scarcity, which is arguably most urgent in the latter case, the data quality requires much attention. This problem comes in different flavors: from inaccuracies, insufficient disclosure, and lack of standard protocols of data analysis in the original publications, to the errors and difficulties of aggregating information from different sources, biases and imbalances in the resulting datasets. Therefore, the community of researchers developing ML predictors of protein stability and solubility changes will greatly benefit from up-to-date, manually curated, user-friendly, and ML-friendly databases.
Manual curation is indispensable, as demonstrated by the teams that had to discard or change the majority of data from ProTherm (Table 1) due to erroneous values, incorrect or missing structures and sequences, non-existent substitutions, and ambiguous experimental conditions. Many other derivative datasets were not cleaned thoroughly, compromising the quality of the resulting predictors. However, this does not come as a surprise, since as an ML developer, one might have neither enough resources nor proper expertise to check the sources and validate the quality of experiments for each data point. Moreover, with the lack of an updatable database to report inconsistencies and compare dataset overlaps, one has to repeat the cleaning steps almost from scratch every time before training a predictor on more recent data. This repetition leads to a waste of time and delays the maturation of the field into the next stage of ML development, e. g. in-depth analysis and interpretation of successful predictors to uncover biophysical mechanisms behind better predictions.
User-friendliness in terms of graphical summary and statistics will allow faster monitoring of the structure of the data to reveal biases in real-time. These refer to overrepresented proteins or protein families, amino acids chosen for OptSolMut [68] 137 in total: 59 increased 78 decreased

19
Binary classification for single-and multiple-point mutants from 15 published studies, with PDB IDs provided. Among 105 single-point mutants, 61 decreased and 44 increased solubility compared to wild types. In total, 121 mutants were soluble both before and after the mutation, but the extent of solubility changed. Also 26 mutants have stability changes reported.

19
Binary classification for single-and multiple-point mutants from 4 published studies, with sequences of wild types and mutants provided. Among 40 single-point mutants, 1 decreased and 38 increased solubility compared to wild types.

xls table 2014
Aggrescan3D [70]  mutation or those substituting, locations of the mutations, e. g. with respect to the sequence, secondary structure elements, protein surface, tunnels, active sites, etc. The identification of such biases is of critical importance in ML -a data-driven strategy unable to correct data biases automatically, without additional tweaking. The prediction power of an ML-based model has yet to be explored for the poorly represented substitutions or proteins with low homology or different unfolding patterns than those in the training data. The demand from the ML side also comes for the structure of such a database. The precise identification of mutations, corresponding sequences, and PDB IDs is one ingredient. Another one is adhering to the tidy data principles, [73] i. e. data representation in a clear table format where columns correspond to variables, such as substitutions, protein identifiers, experimental conditions, etc., and each row corresponds to experimental observation. While these principles seem easy to implement, representing multiple-point mutations or new experimental setups will challenge the database developers.
An interesting recent initiative is ProtaBank [74] -the database aimed to collect protein engineering data in one place, including some of the datasets mentioned earlier. The creators opted to target a wide range of assays, an excellent idea given the increasing interest in data generation and lack of any central repository of this kind. They also offered several search tools to analyze comprehensively published results concerning a particular sequence inquiry, including related sequences given by BLAST search. On the other hand, the wide focus and variability of the supported data types come at the cost of increasing the effort required for fetching all the available data, e. g. protein stability or solubility changes, and processing them into ML-friendly format. With this in mind, we are currently working on two manually curated ML-oriented mutational databases to be officially released in 2020: FireProt DB for protein stability (loschmidt.chemi.muni.cz/fireprotdb) and SoluProt-Mut DB (loschmidt.chemi.muni.cz/soluprotmutdb) for protein solubility changes. The preliminary versions include ca. 14 000 single-point mutants in around 270 proteins and over 10 000 data points from 100 proteins, respectively. Interestingly, most of the teams, including ours, have resorted to manual search for data in literature so far. Thus, automated data mining remains largely unexplored in this respect. [71,75] The mutational datasets discussed earlier present significant challenges in this respect since the information about mutations and their effects is usually scattered across the publication, and additional effort is required, e. g., to identify automatically whether a positive value of~~G found in a text means increased or decreased stability.
Regarding the perspectives in generating new data, several recent experimental techniques raise hopes of significantly enhancing the available data on mutational changes. In particular, deep mutational scanning [76] that couples nextgeneration sequencing [77,78] with high-throughput assays, e. g. based on fluorescence-activated cell sorting. [79,80] This approach links genotype to phenotype by synthesizing a large library of mutant sequences, selecting for expressed phenotypes, and sequencing the library before and after the selection to quantify the fitness of each mutant. The screening protocols are being actively developed to represent fitness from various angles, and some of them already approximate protein stability and solubility. [81][82][83] Two major advantages of this approach are the data size and distribution. Data sets generated by deep mutational scanning can easily run into thousands or tens of thousands of mutants, which is terrific news for data-hungry ML. The library generated often covers the space of possible mutations quite uniformly, which compares favorably with more standard low-throughput approaches, in which the selection of variants is usually skewed towards anticipated best performers and negative results are sometimes discarded. Therefore, we expect many new exciting data sets in the near future, which is likely to open up new opportunities for using more powerful ML architectures such as artificial neural networks. Several recent reviews identified the trend in biocatalyst design towards using nonlinear ML models compared to predominantly simple linear predictors in the past. [25,84] And such a transition will lead to more accurate and generalizable tools once a sufficient amount of data is available to steer the flexibility granted by the nonlinear models.
It is also most desirable if the newly collected data are published according to the FAIR principles [85,86] that are created to encourage authors to take data sharing, discoverability, and reuse into account from the outset of preparing their results. These principles stipulate that data should be identified, described, and indexed clearly and unequivocally, should use standard technical and semantic data formats, variables, and ontologies, and should provide clearly defined access procedures, ideally by automated means. Regarding the application of those principles to publications with mutational data, the following guidelines will help promote the collection of highquality data sets for training and validating predictors: * Include and examine protein sequence identifiers, PDB ID's, annotations of mutations, etc. in publications. Any inaccuracies in reporting propagate into databases, require significantly more effort in identifying at later stages, and often lead to discarding the data. This is an undesirable outcome for all the parties involved since the data are not reused and their scientific impact is curtailed.
* Report and upload as much data as possible, even for those mutations that did not lead to the desired outcome. Detailed numerical data is often provided only for several mutants out of all those tested, and the rest are reported in an aggregated format only, such as on a graph or in a table of descriptive statistics, precluding their usage in ML training.
* Publish the data as supplements to the original publication, where they are less likely to be lost. Personal data storages get closed, group pages move to new locations, and departments get restructured, which is why many datasets are now unavailable as their links stopped working. [1] * Add a csv or xls data table, preferably already in the "tidy" format, even when some of the values are already reported in the main text to improve the access to your data, promote your work among the bioinformatics community, and increase its impact. Finally, we would urge companies to release their data sets, which might be difficult for their ongoing projects due to undesirable disclosure but should be feasible for past results. The power of ML-based predictors comes from exploiting all the available data, and while individual gains are not always apparent, the whole protein engineering community will benefit from time, effort, and resources saved using predictors that are more accurate. Data scarcity is now the major bottleneck for developing more precise predictors, and if we want to accelerate the research of human neurodegenerative disorders, metabolic diseases, cancer, produce more efficient drugs, and widen the industrial application of biocatalysts, sharing your data is a small piece of the puzzle that might lead to bigger improvements. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57