Machine learning reveals hidden stability code in protein native fluorescence

Graphical abstract


Introduction
The measurement of conformational stability is crucial in protein folding study, as well as for the engineering and formulation development of protein-based therapeutics [1][2][3]. Proteins are marginally stable with their tertiary structures formed by many weak non-covalent interactions, such that a small change in buffer composition or temperature can lead to unfolding and aggregation. The thermodynamic stability of proteins is often determined by monitoring the fraction of protein unfolding as a function of step-wise increases in temperature or chemical denaturant, giving rise to a transition phase that defines a thermal-or chemicaldenaturation mid-point (T m or C m ) [4]. These measures are often used as rapid screens for improving the conformational stability of proteins through mutagenesis or formulation of the buffer conditions, aiming at a protein variant or formulation with improved kinetic stability to unfolding or aggregation at a given storage temperature. In some cases, a buffer increasing the T m also makes the protein more kinetically stable [5]. Recently, mutations of an antibody fragment led to improved aggregation kinetics in cases that decreased the native ensemble flexibility, yet without altering the T m [6].
The known influence of local unfolding, conformational flexibility and protein-protein interactions within the native structure ensemble, on the conformational stability of proteins, as measured by their T m and their propensity to aggregate, led us to examine whether spectra of proteins under native conditions (at temperatures much lower than the T m ) contain sufficient information to predict their thermal unfolding transitions at higher temperatures. Indeed, it has been observed previously that the intrinsic fluorescence in the baseline of denaturation curves at lower temperatures or lower denaturant concentrations, for thermal and chemical denaturation experiments respectively, is sensitive to changes in the buffer [7]. For example, with increasing guanidinium concentration, the slope of the baseline became steeper which was postulated that the local structure of the tryptophan residues was rearranged prior to the major unfolding of the protein.
Machine learning (ML) refers to a series of algorithms capable of identifying underlying patterns, features and relationships between various variables from complex datasets such that robust prediction models can be built [8][9][10]. In biological studies, ML has shown great potential to analyze the genomics [11,12] and proteomics [13][14][15] data. Recently, ML has been applied also to predict the infrared spectra of proteins [16]  aggregation kinetics [17][18][19][20][21]. However, none of these studies has attempted to use the spectral data of the native protein generated under different buffer conditions as the input variables from which to predict its conformational stability.
Here, we perform an experimental and ML analysis of the temperature-dependent intrinsic fluorescence spectra of a therapeutic antibody fragment for antigen-binding (Fab) over industrially-relevant experimental conditions including six protein concentrations at 1-100 mg/mL, three pH values from 4.5 to 7.0, and three ionic strengths (IS) spanning 30-200 mM (Fig. 1). We obtained the full thermal denaturation curves for each of the 54 experimental conditions, whereby the fluorescence emission intensity was measured over a wide range of wavelengths (300-400 nm), at 0.5 nm intervals. Using a subset of the fluorescence spectra (330-350 nm), which has the maximal changes upon unfolding, a neural network algorithm was applied to model the curves in the transition and denatured regions from only the spectral data in the native baseline regions at lower temperatures. The entire modelled curves were then each fitted to a two-state unfolding model for comparison to equivalent fits obtained directly with the experimental data. In doing so, we interrogated whether the baseline spectral data, which correspond to the native state of the protein structure, contain sufficient information to enable a robust prediction of the denaturation profile and associated conformational stability of the protein. This was based on the hypothesis that the native structure ensemble and hence spectral data represent fingerprints of underlying characteristics such as local unfolding, conformational flexibility and protein-protein interactions. The ML-modelled denaturation curves and the derived T m showed a good consistency with the experimental results. This suggests that it is possible to reveal hidden information within the native-state fluorescence spectra at lower temperatures to predict the transition and denaturation profiles for unfolding.

Materials
The E. coli strain W3110 containing pTTOD plasmid for Fab expression was obtained from UCB (Slough, UK). A C226S variant of Fab was used as described previously, to eliminate any dimerization through disulphide bond formation [22]. All reagents including buffers and inorganic salts are analytical grade and purchased from Sigma-Aldrich (Poole, UK).

Protein preparation
Fab was produced in a pilot-scale 30 L fermenter (BIOSTAT Cplus, Sartorius, Goettingen, Germany) and purified using AKTAbased liquid chromatography as described elsewhere [22]. The purified protein was then exchanged to respective buffers: sodium acetate was used to prepare pH 4.5 and pH 5.5 buffers, and sodium phosphate was used to prepare pH 7.0 buffer, to make the initial ionic strength of the buffers were 30 mM. The protein was then concentrated to make 1, 5, 10, 20, 50, 100 mg/mL solutions with the final ionic strength adjusted to 100 mM or 200 mM using 1 M sodium chloride.

Thermal denaturation measurement
The thermal stability of Fab was measured using a UNit system (UNCHAINED LABS, Pleasanton, US). 9 lL of the protein was loaded into the sample well in a 16-well cartridge. The cartridge was loaded into the instrument and equilibrated to 20°C prior to being step-heated from 20 to 90°C at 30 s per 2°C interval. The intrinsic fluorescence spectrum at each temperature was recorded for 3 independent samples from 250 to 725 nm. The static light scattering (SLS) of the sample was concurrently collected by the instrument for 266 and 473 nm, which corresponds to the formation of small and large aggregates, respectively.

Fitting of the denaturation curve
The fluorescence intensity at 340 nm for each experiment or model-predicted curve was extracted and plotted against temperature, then fitted to a two-state unfolding model to obtain the midpoint of unfolding transitions (T m ), as described previously [6]. In most cases, the fluorescence data below 330 K (57°C) and over 363 K (90°C) were removed to improve the fits to the transition region, by using the most linear portions of the baselines: where K is the equilibrium constant for the transition between the native and denatured state; T is the experimental temperature; I T , I N and I D are the spectroscopic signals of the protein at each given temperature, at the native and at the fully denatured state, respectively. a and b are the baseline slopes of the native and denatured region of the curve.
where T m is the temperature at which the protein is half denatured; DH vn is the van't Hoff enthalpy and R is the gas constant. All temperature terms in this equation are absolute temperatures in Kelvin.
To obtain T m individually from each 1 nm between 330 and 350 nm, the denaturation curves of each wavelength were globally fitted to the two-state model by sharing the DH vh but varying T m values. The obtained T m values were plotted against native slope baseline or initial fluorescence at 20°C ( Figure S6 & S7).

Machine learning of the thermal denaturation data
Artificial neural network (ANN) algorithms are a type of machine learning (ML), inspired by human neural networks, which result in data-driven models that can interpret effectively patterns in multivariate data from non-linear systems [23]. In this study, a common ANN algorithm, Feedforward Neural Network (FFNN) was applied to construct models with one hidden layer of 20 neurons using Matlab (R2017a).
For each epoch, the training set was used to train the neural network model by fitting the weights of connections between neurons while the current model was evaluated by the test set and adjusted according to the test result. The validation dataset provided an unbiased evaluation of the model fit on the training dataset. When the whole training procedure was completed, the model with the best performance from the validation set was selected as the final optimal neural network model.
The maximum number of epochs to train was set to 1000. The performance of the trained network was assessed by the mean squared error (MSE) function and the performance goal as expected MSE of the model was set as 20,000 (based on 1.5% error of the average fluorescence data). To prevent the trained network model from over-training, the training procedure stops if the validation performance degrades for 10 consecutive epochs and the optimal trained network with the best validation performance is selected. The training function used in this work to construct FFNN was the Levenberg-Marquardt algorithm, which is designed to solve non-linear least squares problems [24]. The Levenberg-Marquardt algorithm uses the Jacobian matrix in the following Newton-like model: where J is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases, and e is a vector of network errors. If the scalar m is zero, this is just Newton's method using the approximate Hessian matrix. If m is large, this becomes gradient descent with a small step size. Thus, m is decreased after each successful step and is increased only when a tentative step would increase the performance function. The activation functions for the hidden layer and output layer are the hyperbolic tangent sigmoid transfer function and linear transfer function, respectively. A total of 2268 thermal denaturation measurement data, including protein concentration, pH, IS, wavelength and native state fluorescence intensity at each 0.5 nm of 330-350 nm, were used as inputs to feed the training model. The output of the model was the high temperature (52-90°C) section of the denaturation curve corresponding to the transition and denatured state of the protein.
Cross-validation of the ML models was achieved with one of the simplest and commonly-used techniques known as the holdout method [25]. This method was suitable for this work since it permitted the same training, test and validation sets to be used for building different models and therefore made it easier to compare the predicted T m values across models derived from different input ranges. To avoid the potential risks of overfitting and selection bias with the holdout method, the splitting strategy was designed to split the total thermal denaturation measurement data into training, validation and test sets in the ratio of 66%, 17% and 17% and to guarantee an even spatial distribution throughout the experimental conditions. More specifically, the six protein concentrations were split into two categories, low concentration (1, 5 and 10 mg/ml) and high concentration (20, 50 and 100 mg/ml) (Fig. 1b). Given the two concentration categories, three pH values and three ionic strength values, a total of 18 combinations resulted. Then, from each combination, the method picked two concentration values from a set of three values as the training set, while the other concentration was allocated to either the validation set or test set. In this case, there were 3 runs of cross-validation to make sure all experimental conditions were used for both training and test/validation and that each condition was used for test/validation exactly once. The performances (measured in MSE) for the three runs of cross-validation were 22043, 23778 and 25309 and the splitting strategy with the best performance (as shown in Table 1) was selected for model building in this work.
In order to demonstrate that there was no overfitting bias with the holdout method, the FFNN modelling procedure was repeated using the k-fold cross validation method and their performance compared [25]. The thermal denaturation measurement data was randomly partitioned into k = 6 equal size subsets. Of the six subsets, two subsets were retained as the validation set and test set respectively, and the remaining four subsets used as training data. The cross-validation process was then repeated six times (the folds), with each of the six subsamples used exactly once as the validation data and test data. This work chose k = 6 to make sure the ratio of training, validation and test sets were 66%, 17% and 17% so as to match the ratio in the holdout method. The performance (measured as MSE) of the k-fold cross-validation had an average MSE of 24,333 compared to 22,043 for the holdout method. The similar performance demonstrated that the splitting strategy applied in the holdout method avoided overfitting or bias issues. The holdout method had the additional benefit of saving time to retain a similar performance result.

Thermal denaturation measurement of Fab
The denaturation of Fab results in the exposure of the tryptophan residues that lead to a change in fluorescence intensity as a function of temperature. The fluorescence at 340 nm was selected and fitted to a two-state model for each protein concentration and buffer condition (Fig. 3 a1-6). The temperature at which the protein was half-denatured (T m ) was obtained from the fit and shown in Table S1. Previously, 1 mg/ml Fab was observed to aggregate rapidly at above the T m , whereby the thermal unfolding transition was a convolution of the conformational unfolding equilibrium, and the aggregation kinetics [22]. Thus, the precise T m value obtained is affected by the experimental settings, particularly the ramp rate of the thermal denaturation, which is accordingly kept constant across all experiments.
The change in T m with increasing protein concentration shows stability variation across the experimental conditions, showing a convergence towards a low variability at 100 mg/mL ( Figure S1). This could be partially owing to the quality of data for the fit as the fitting error obtained for T m at 1-10 mg/mL is generally larger than those of 20-100 mg/mL data, which have better unfolding transitions to determine the T m . However, if a comparison is made within the 20-100 mg/mL group, a smaller variation in T m over dif-ferent buffers is still observed for 100 mg/mL suggesting that the T m of Fab becomes relatively insensitive to changes in pH or ionic strength at 100 mg/mL.
The T m has comparatively larger errors for Fab at 1-10 mg/mL than at higher concentrations. The light scattering of Fab shows the aggregates formed from the denatured state at 1 mg/mL rapidly precipitate out of the solution resulting in a drop of scattering intensity, whereas the size of aggregate formed at higher concentrations is still within the measurable range ( Figure S5 & S6). This is consistent with previous results showing that the aggregation of Fab is kinetically (unpublished results) and also thermodynamically stabilized at the higher protein concentrations [26]. Therefore, at lower concentrations of Fab, the greater convolution of aggregation kinetics with the unfolding transition leads to the higher fitting error for T m .  Test  5  Training  50  Training  10  Training  100  Training  1  7  100  Training  20  7  100  Training  5  Test  50  Validation  10  Training  100  Training  1  7  200  Training  20  7  200  Training  5  Training  50  Training  10 Validation 100 Test

Modeling of the denaturation curves from the native state fluorescence
A common artificial neural network algorithm, Feedforward Neural Network (FFNN), was applied and carried out with a splitting of the 2268 experimental datasets into 1512 (66%) for training, 378 (17%) for validation, 378 (17%) for testing ( Table 1). The validation and test datasets were selected to obtain an even spatial distribution throughout the concentration, pH and IS conditions, so as to avoid overfitting or bias in the results (as described in Materials and Methods).The results of the k-fold cross-validation is shown in Figure S2. For each dataset, the fluorescence intensity between wavelength 330-350 nm at 0.5 nm intervals of the emission spectra were selected from the temperatures corresponding to the native state baseline of the protein. In the first model, each wavelength had 16 fluorescence intensity data points, corresponding to the 2°C intervals spanning 20-50°C (293-323 K). The fluorescence intensity data were combined with the protein concentration, pH, IS and wavelength information to form the 20 inputs defined as input range 1 (Table S2).
The outputs of the model were the fluorescence intensities between 330 and 350 nm at 0.5 nm intervals for 52-90°C spanning Fig. 3. Thermal denaturation curves of Fab obtained by fluorescence measurements for (a1-6) 1 mg/mL, 5 mg/mL, 10 mg/mL, 20 mg/mL, 50 mg/mL and 100 mg/mL over 9 buffer conditions. The ML-derived fluorescence (training from 20 to 50°C shown as example) over the same temperature range are shown in b1-6. The change in fluorescence intensity at 340 nm was plotted as filled circles (pH4. 5  the transition region and denatured-state baseline ( Figure S3). The neural network training performance was analyzed in terms of the mean squared error (MSE) of each epoch for the training, test and validation datasets, as shown in Fig. 2b. The MSE decreased with the increase in the number of epochs for all training, validation and test datasets. The optimal model was found at epoch194 with the best validation performance in terms of the lowest MSE value.
The predicted output fluorescence data (52-90°C) were plotted together with the experimental native-state baseline data used as the input (20-50°C), to generate the complete denaturation curves (20-90°C). Their native and denatured baselines were truncated in the same way as for the experimental data, to include only the linear regions of the baselines prior to fitting to the same two-state model (Fig. 3, Table S1).
In three subsequent models, to characterize the impact of input data volume on the quality of the output, we reduced the input flu-orescence intensity range from 20 to 50°C to 20-40°C, and eventually to 20-30°C. The predicted curves were also plotted as above, and then fitted to the two-state model ( Figure S4). Moreover, to investigate whether the feature information in the native state baseline was evenly distributed in temperature or biased to a certain temperature range, we carried out the same modelling procedure but limited the input fluorescence intensity for smaller scanning windows of 20-30°C (as above), 30-40°C or 40-50°C ( Figure S3). Finally, to investigate the reliance of the models upon experimental condition information (concentration, pH and IS), we removed these parameters from the input, and built up a baselineonly model ( Figure S3f) using the fluorescence in the native baseline (20-50°C) only.
The quality of model-predicted curves from different input temperature ranges of the fluorescence was compared with the experimental curves (Fig. 3). The RMSE% shows the model-to- experiment RMSE deviation in fluorescence intensity throughout 52-90°C relative to the fluorescence intensity of the experimental curve at the midpoint (T m ). Most RMSE% falls below 10% over the concentration, pH and IS, with a number of exceptional datasets below 3% found at 5 and 10 mg/mL, regardless of the range of input data (Fig. 4). RMSE% was the greatest at 1 mg/mL, pH 4.5, IS 100 mM / 200 mM. In general, using the input ranges of 20-50°C and 40-50°C gave the lowest RMSE% (in some cases, such as at 50 mg/mL and 100 mg/mL, the RMSE for 20-50°C was better than for 40-50°C), indicating that the modelled curves from these input ranges most closely resembled the experimental curves. For each concentration, using only the baseline fluorescence data as the input gave marginally greater deviations from experimental curves on average.

Conformational stability obtained from experimental and modelpredicted data
The unfolding transition midpoints (T m ) obtained from the experimental and model-predicted curves show a good consistency over the experimental conditions ( Fig. 5 and Table 2). Therefore, the T m derived from model-predicted curves are observed to present similar trends of change over the concentration and buffer conditions, as those of obtained directly from the experimental data fitting ( Figure S1). However, minor differences were observed from the predictive results when modelling from different input ranges. A better consistency with experimental T m was obtained when larger input data volumes were used, as expected. The RMSE values between T m values obtained from 20 to 50°C, 20-40°C, and 20-30°C input data were 1.0, 1.0, and 1.2°C, respectively, for the entire dataset; and 1.1, 1.2, and 1.2°C, for the test dataset alone (Table 2). Interestingly, the quality of model-predicted results improved when using an input range from a higher temperature window, such that the RMSE values for predictions based on 20-30°C, 30-40°C, and 40-50°C input data, were 1.2, 1.1, and 0.8°C, for both the entire dataset, and the test dataset alone.
The linear regression of the test datasets shows larger deviations, compared to those of the validation and the entire datasets. However, the test predictions still fall close to the parity line against experimental values, and so the increased RMSE likely reflects only the relatively small range of test T m values between 78 and 82°C.

Discussion
The thermal unfolding transition of proteins from which the conformational stability is determined, is usually measured by the intrinsic fluorescence (differential scanning fluorimetry), molecular chirality (circular dichroism), or specific heat capacity (differential scanning calorimetry) [27][28][29][30][31][32][33]. Although the native structure of a protein is largely homogeneous, it is not a rigid body but undergoes continual dynamic motion, including frequent unfolding and conformational switching of local regions of structure [34,35]. Under the native state, the intrinsic fluorescence of a protein reflects the average local environment around tryptophan, tyrosine or phenylalanine residues for the whole native ensemble of structures. This fluorescence is sensitive to protein conformational changes induced by changes in temperature, protein concentration and the buffer solution conditions [36]. The local unfolding events, conformational states, and protein-protein interactions within the native ensemble directly influence the stability of the protein to global unfolding, and hence probing fluores- Fig. 4. The deviation of the ML model-derived denaturation curves to the experiment data for different data input range to the model. RMSE % is the ratio between the RMSE of fluorescence intensity for 52-90°C of the experiment and model-predicted values to the fluorescence intensity at the T m . cence under native conditions has the potential to reveal the thermodynamic propensity to globally unfold.
The conformational stability of an antibody fragment, Fab, was studied for 54 combinations of concentration, pH and ionic strength. The fluorescence intensity of Fab increased as unfolding exposed tryptophan residues to the solvent and eventually decreased as Fab was fully denatured. The resulting denaturation profiles throughout the thermal transition were compared to their model-predicted twin datasets predicted from only the spectra obtained below the transition temperature. The model-predicted curves show similar profiles to their experimental twins and the RMSE% indicated that the discrepancy of the fluorescence signals across the temperature range was less than 10%. This provided a high-quality prediction with the model-predicted curves, for further analysis using the same thermal-unfolding model fitted to the experimental data. The T m values obtained from ML models showed a robust agreement with the equivalent experimentally determined values. Altogether, the fluorescence measurement and ML depicted similar behaviours of Fab over changes in protein concentration, acidic to neutral pH, and from low to high ionic strength.
Consistent with previous work, the Fab became apparently selfstabilized at high concentrations [26]. Moreover, T m at 100 mg/mL became very similar across the pH and IS range, suggesting some ''protective" species formed during the unfolding/aggregation process makes the protein stability resistant to changes in buffer environment. The balance of aggregation and denaturation behaviour clearly changes from the lower to the higher concentrations. This complex shift in the protein denaturation pathway was still successfully modelled by ML, indicating the robustness of the approach, and the ability of ML to unpick different outcomes based on the input spectra.
We attempted to consider which specific features in the nativestate fluorescence spectra may have been adopted by the ML algorithm, to predict the unfolding transition. As described above, the prediction improved when using an input range of temperatures closer to the transition region. This was most likely due to a closer correlation between the high-temperature native structure ensem-ble encoded in the spectra and the states that lead to protein unfolding, as expected due to Hammond effects on the freeenergy [37]. However, we could not identify any simple linear relationships between features of the input data and the experimental T m values. We fitted the denaturation curves at each wavelength between 330 and 350 nm to investigate the wavelengthdependent change of features ( Figure S7 & S8, Table S3). For example, the native baseline slope, and the initial fluorescence intensity at 20°C, each varied significantly within each buffer condition, due to changes in protein concentration, but with no clear relationship to experimentally-determined T m values in most cases ( Figure S7 & S8). In addition, the buffer conditions affected the T m values, but these did not correlate with the native baseline slope or initial intensity. Furthermore, the experimental T m values obtained were also dependent on the wavelength used for denaturation curves within each buffer condition, particularly at the higher protein concentrations, but these were not consistently linked to either the initial fluorescence intensity or the native baseline slopes. Instead, the ML model must have relied upon more complex relationships across these various features. The neural network algorithm could grasp these features together with other unknown ones from the native state baseline to make an overall linkage to the spectroscopic signal in the transition and denatured states.
This study provides a predictive approach that could be used to accelerate the development of stable therapeutic protein formulations. Therapeutic proteins, such as antibody and antibody fragments (full IgG, Fab, scFv, sdAb etc), have become a leading class of pharmaceutical product in recent decades [38][39][40][41]. The conformational stability of a therapeutic protein is important for the safety, efficacy and shelf-life of the product. Any enhancement through optimization of the pharmaceutical formulation can bring a reduction of immunogenicity and an increase in storage stability [42][43][44]. Therefore, therapeutic protein candidates under development are usually screened across a wide variety of buffer combinations to identify the most ideal condition to formulate. This approach is highly resource intensive, and uses up protein materials that are often in very short supply at the early stages of development.
The success of the ML predictions show us that the change in thermal stability of a protein in response to different solution conditions, can be entirely predicted from spectra obtained only for the native ensemble. This finding is consistent with the growing understanding that global unfolding events and aggregation mechanisms are critically dependent on local unfolding, conformational states, or protein-protein interactions that occur already within the native ensemble. Therefore, future protein engineering and formulation endeavours can focus on controlling these properties within the native ensemble.
We aim to build this method to accelerate the drug development process by 1) experimentally validating less buffer conditions and 2) only low temperature is needed for screening. Furthermore, the running time of FFNN neural network in this study can process the total 2268 thermal denaturation curves in about 15 min, by obtaining fluorescence spectra in 10 mins over only a 10°C range of the pre-transition region, which greatly accelerated the processing time compared to conventional data analysis methods which usually measure in 70 mins over a 70°C range. An additional advantage is that the protein does not need to be denatured, often irreversibly causing aggregation, and would therefore be re-useable for other experiments. This work shows the potential and capability of the incorporation of ML into the future digital platform for the developability characterization of biopharmaceutical products. In future investigations, we plan to test the generality of this method on a wider range of buffer conditions and on other proteins of different sizes, structure, surface charge and stability. We aim to achieve a better robustness via training the ML algorithm from a larger volume of datasets spanning from different proteins and protein families.

Conclusion
In this work, we applied the Feedforward Neural Network model to study the intrinsic fluorescence of the native state of Fab, a multi-domain therapeutic protein, to derive the fluorescence spectra in the transition and denatured state regions of the thermal denaturation profile. The T m derived from the experiments and from the predictive model were highly correlated, showing that there is sufficient information in the temperature-dependent native state spectra of proteins, to derive their conformational stability. Based on this discovery, a non-denaturing measurement can be developed to make fast prediction of the stability of a therapeutic protein under different formulation conditions.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement
Funding from the UK Engineering & Physical Sciences Research Council (EPSRC) for the Future Targeted Healthcare Manufacturing Hub hosted at University College London with UK university partners is gratefully acknowledged (Grant Reference: EP/P006485/1). Financial and in-kind support from the consortium of industrial users and sector organizations is also acknowledged.

Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.csbj.2021.04.047. Table 2 Goodness of the correlation between the experimental and ML-derived T m values. The slope and R 2 are obtained from the linear fit of the data in Fig. 5. The correlation with experimental data was also compared within each ML group between the validation, test and the entire datasets.