Introduction

Peptides are a growing drug market with over 100 approved drugs, with insulin being the most prominent one1,2,3. Peptide drugs exhibit several advantages over small molecules2. Since they often exhibit low toxicity and may not accumulate in tissue, they can be safe while having high efficacy2. They are also diverse, potent, easy to synthesise2 and have higher specificity, due to their larger size compared to small molecules4. However, peptide drug candidates can suffer from several problems. They tend to have low oral bioavailability and short half-lives1,2,5 caused by high clearance rates and low metabolic stability due to the presence of peptidases1,2,5. Moreover, peptides can have poor membrane permeability, tend to aggregate, can contain immunogenic sequences2,6, and their conformational flexibility may generate problems during drug development as they can adopt more than one structure5.

Taking example from nature, the properties of endogenous peptides and proteins can be modified through post-translational modifications (PTM)7. Typical PTMs include phosphorylation for signal transduction and energy metabolism8,9, and acetylation and glycosylation for regulation10. Other common modifications are amidation, carboxylation, hydroxylation, disulfide bond formation, sulfation and proteolytic cleavage11,12. PTM dysregulation is often associated with disease, including sleeping sickness13, amyloid-associated diseases14 and HIV15. A particular focus in recent years has been put on the impact of PTMs on protein aggregation, and on associated neurodegenerative diseases6,16. Different PTMs have been shown to have varying effects on the aggregation propensity of peptides and proteins6. N-terminal truncation, incorporation of pyroglutamate, phosphorylation and nitration increases oligomerisation of the amyolid-β peptide, while citrullination and backbone modifications also increase oligomerisation but simultaneously decrease aggregation6. In therapeutic applications, examples include the increase in biological activity and improvement of metabolic stability by N-methylation17,18, increasing binding affinity4,19, half-life increase and improvement of tissue penetrating abilities by lipidation and acylation6. Methylation can also increase binding selectivity19.

By adopting strategies that extend the scope of PTMs, the use of modified amino acids (mAAs) has become prominent in biotechnology and drug development3, through a variety of methods to engineer mAAs into proteins20,21,22,23,24,25,26,27,28,29. A selection of the most common mAAs is shown in Table 1, with those used in this work being highlighted in bold. General approaches to improve peptide-based drugs often start with alanine or glutamic acid scanning to identify interaction and cleavage sites5, and continue with the replacement of natural amino acids with modified amino acids (mAAs) to tailor a variety of other properties1,5. These mAAs can contain new functional groups, and alter the backbone or the terminal structure of a peptide5,30. The effects of mAAs are diverse and can counter specific problems inherent in biologics, including by altering immunogenicity31. One of the major issues in peptide drug development is the recognition by proteases and peptidases, which can be attenuated by changing the backbone through incorporation of amide bond mimics, D-isomers, β-amino acids, alteration of the termini or tetra-substituted amino acids1,4,17,19,31,32,33,34,35,36. These mimics also tend to increase bioavailability, another issue which often plagues peptide drugs17 as well as restrict conformation and therefore reduce flexibility1,37,38. Similar effects can also be caused by N-alkylations1,17, incorporation of aminoisobutyric acid39, other constraining amino acids31,40,41 or by cyclisation1,19,36,38. The latter and addition of sterically bulky groups can also reduce T-cell recognition4,19. Bioavailability and stability can also be improved by glycosylation, which enhances protein-protein interactions and makes use of glucose transporters on the cell surface which improves cell permeability31. Permeability can also be improved by increasing hydrophobicity, which can be achieved by methylation, lipidation31, and by adding fluorinated residues19 or modifications to terminal residues42.

Table 1 Selection of the most common modified amino acids (mAAs)

Many applications based on mAAs have been made in materials science, especially with nanotubes and nanofibres43,44,45,46. mAAs can be also used for photoactive, photo- or fluorescent-caged and photo-crosslinking modifications47,48,49,50,51,52,53,54,55,56, fluorescent probes47,48,57,58,59,60, spectroscopic probes47,48,61 and as metal ion chelators47,48. Moreover, they can be used to create redox-active enzymes62, reduce the complexity of NMR spectra63 and can have antimicrobial activity64.

Commercial vendors currently offer hundreds of synthesis-ready mAAs that can be synthesised into peptides and it has been shown recently that this chemical space can be greatly expanded65. At the same time, experimental methods to characterise peptides are often material-intensive and time-consuming. State-of-the-art solubility measurements such as PEG solubility assays, require substantial amounts of material, and have a throughput typically unsuitable for the screening of thousands of candidates66,67,68,69. Therefore, developing computational methods to predict the intrinsic solubility and aggregation propensity of peptides and proteins with mAAs would be highly beneficial. Laborious solubility measurements could be avoided or greatly reduced by incorporating fast and inexpensive in silico screenings in development pipelines. Although there are several accurate protein and peptide solubility predictors available as well as predictors for individual amino acids, to our knowledge no sequence-based method can readily handle non-natural amino acids70,71,72,73,74.

To bridge this gap, here we exploited the CamSol framework for the prediction of intrinsic solubility75,76,77 to develop the CamSol-PTM method, which can handle peptides containing mAAs that are of similar size to canonical amino acids. CamSol-PTM is capable of assessing the effect of any kind of small-size noncanonical amino acid on the intrinsic solubility of peptides in aqueous solution at room temperature by combining a range of different physicochemical property predictors. The absolute solubility of a peptide is the combination of its intrinsic solubility and external factors that impact its solubility such as solvents, ionic strength and pH. By focusing on predicting intrinsic solubility, we aim at creating a general model that can be extended to take external factors into account77. The base model is focusing on the intrinsic solubility in aqueous solutions at room temperature. We experimentally validate this approach on variants of three peptides incorporating different mAAs at most positions. The wild-type peptides, which we include in the validation, are glucagon-like peptide-1 (GLP-1), tyrosine tyrosine (PYY), and 18 A.

GLP-1 is a peptide used to treat several disorders, most notably obesity and type-2 diabetes78,79,80. It reduces appetite, glucagon secretion and slows down gastric emptying80, and has a low risk of inducing hypoglycemia, a common side effect for diabetes drugs78. GLP-1 is a 36 amino acid long peptide that when cleaved at the N-terminus produces its active form: GLP-17-36 amide78. The drawback of GLP-1 in its native form is that, like most peptides, it has a short half-life and fast clearance rate80. The GLP-1 derivatives liraglutatide and semaglutide were developed to overcome this issue80,81. The half-life of these drugs is significantly extended compared to its native form by introducing long fatty acid chains that improves drug half-life primarily by enabling albumin binding82,83,84,85,86,87.

PYY acts similarly to GLP-1 and is sometimes administered in combination with it to treat obesity, as it is co-released by the body when nutrients are detected81. In addition to appetite regulation, it affects energy and glucose homeostasis81,88,89. PYY is a gut hormone with a length of 36 amino acids, although its major form is truncated at the N-terminus to give PYY3-3688. Other truncated variants such as 1-34 and 3-34 are also present but appear to be inactive81. The C-terminus of PYY binds four different receptors of the neuropeptide Y receptor family81,89. It has a similarly short half-life as GLP-1, approximately 10 minutes81.

18 A is a derivative of apolipoprotein A (ApoA-1) which is the major component of high-density lipoproteins (HDLs)2. Apolipoproteins are complexes that contain lipids and proteins, which transport lipids and other hydrophobic molecules through the body90. HDLs can remove cholesterol by decreasing low-density lipoproteins (LDLs) and therefore act against lipid imbalance which is a major cause for cardiovascular diseases2. ApoA-1 is a 243 amino acid-long protein that consists of 10 amphipathic α-helices which interact with lipids2. 18 A is an 18 amino acid long peptide91 that mimics these α-helices2. Since the original 18 A design, many improvements were made to increase its affinity to lipids and homology to ApoA-1 such as acetylating the N-terminus and amidating the C-terminus2,90.

For each of these peptides, we screened computationally over 10,000 variants containing combinations of 5 different mAAs. For validation, we then synthesised 30 of those peptides and measured their solubility for the initial set. A second set of 7 peptides containing 4 new mAAs was used to confirm the generalisability of our approach. Our results show that CamSol-PTM can reliably predict the intrinsic solubility of peptides containing mAAs, showing high correlation between predicted and experimentally measured relative solubility.

Results

Computational predictions

In this work we exploited the CamSol framework for the accurate prediction of the intrinsic solubility of proteins75,76,77 to introduce a method able to predict the effect of mAAs on the solubility of peptides. The original CamSol method predicts the intrinsic solubility of proteins by combining tabulated values of hydrophobicity, charge, and α-helical and β-sheet propensities of the 20 standard amino acids. To extend these tables to a range of different mAAs, information on the physicochemical properties of these mAAs is required (Fig. 1). Because our goal is to estimate the intrinsic solubility of mAA-containing peptides without the need to carry out extensive experimental studies, we build a pipeline in which the physicochemical properties of the mAAs are predicted computationally.

Fig. 1: Workflow for optimising the solubility of peptides containing modified amino acids (mAAs) using CamSol-PTM.
figure 1

A linear combination of ALOGPS96,97 and XLOGP3100 is employed to determine the hydrophobicity values. pIChemiSt suite92 is used to predict the pKa values of mAAs. Structural propensities are calculated using a separate predictor that gives an estimate on the likelihood of finding a mAA in an α-helix or a β-sheet. The predictor employs a combination of the number of hydrogen donors and acceptors, the number of rotational bonds, molecular weight and the topological polar surface area. All this information is fed into the CamSol-PTM algorithm to predict the effect of mAAs on the solubility of a peptide.

pKa values

We calculated pKa values of modified side-chains using the recently developed pIChemiSt suite which calculates ionisation constants using pKaMatcher92. pKaMatcher matches SMARTS patterns of the mAAs with a list of SMARTS patterns with known pKas92.

Hydrophobicity

CamSol uses hydrophilicity values closely related to the inverse of experimental logP values75. Here, to develop a predictor of the hydrophobicity of the mAAs, we used a combination of different hydrophobicity calculators to reduce possible biases. After considering the results of several benchmarks, we selected three hydrophobicity predictors: ALOGPS, XLOGP3 and KOWWIN93,94,95. All these methods are machine learning-based, which train their algorithms on different descriptors. ALOPGS96,97 is based on creating 75 electrotopological-state (E-state) indices trained on the Physprop database (Syracuse Research Corporation. Physical/Chemical Property Database (PHYSPROP); SRC Environmental Science Center: Syracuse, NY. (1994))93,98. XLOGP3 is an atomic-based model99 that uses 87 atomic groups and two correction factors93. KOWWIN is fragment-based, using 150 different fragments and 250 corrections93,100.

Next, we fitted the hydrophobicity values for the 20 natural amino acids as calculated with these predictors to the tabulated CamSol hydrophilicity values. This fit accomplishes two goals. First, the original tabulated values of the 20 natural amino acids do not have to be changed. Second, aligning mAA hydrophilicity values to the original value range bypasses the need to re-fit the parameters used to combine the different biophysical properties in the CamSol framework75. We thus calculated the correlation of each of these individual predictors with the original hydrophilicity values of CamSol for the 20 standard amino acids (Supplementary Fig. 1a–c). Using a linear regression analysis, we obtained a fit function to the target values, which showed a higher correlation than with the individual predictors with a Pearson’s coefficient of correlation of 0.9 (Supplementary Fig. 1d). Although the combination of the three predictors was accurate, KOWWIN was not suited for the automation of the whole process. Since KOWWIN is only available as part of the EPA suite which only runs on Windows and is not open source, it would be very laborious to include this in the process101. However, we found that the accuracy of CamSol-PTM is not significantly affected when using only the other two predictors (Pearson’s coefficient of correlation = 0.88) (Supplementary Fig. 1e).

Secondary structure propensity

We set out to develop a predictor of secondary structure propensity for mAAs based on physico-chemical properties. The values for the 20 standard amino acids are calculated using statistics from the PDB75. However, many types of mAAs are either too rare or altogether absent in the PDB, meaning that a new approach was needed. We considered the following characteristics: molecular weight (MW), number of hydrogen donors (HD) number of hydrogen acceptors (HA) number of rotational bonds (RB) and topological polar surface area (TPSA). The information on these properties for all standard amino acids and the mAAs used in this work were initially gathered from https://pubchem.ncbi.nlm.nih.gov/. The final version of CamSol calculates these values using the python module RDKit. To determine which combination of properties would yield the best predictor, we explored a series of linear equations for different combinations of these five properties, such for example

$${p}_{i}^{{{{{{\rm{\alpha }}}}}}}={{{{{{\rm{\alpha }}}}}}}_{{{{{{\rm{MW}}}}}}} * {{{{{\rm{M}}}}}}{{{{{{\rm{W}}}}}}}_{i}+{{{{{{\rm{\alpha }}}}}}}_{{{{{{\rm{TPSA}}}}}}} * {{{{{\rm{TPS}}}}}}{{{{{{\rm{A}}}}}}}_{i}+{{{{{{\rm{\alpha }}}}}}}_{{{{{{\rm{RB}}}}}}} * {{{{{\rm{R}}}}}}{{{{{{\rm{B}}}}}}}_{i},$$
(1)

where \({p}_{i}^{{{{{{\rm{\alpha }}}}}}}\) is the calculated α-helical propensity of amino acid i and αX are the linear coefficients to be fitted. For each combination of the properties, we fitted a function to the tabulated secondary structure propensity values of the standard amino acids. We excluded glycine and proline, since these two amino acids have unusual secondary structure propensities and would skew the fit. Moreover, we also used the resulting secondary structure propensity values of each of these combinations within the CamSol-PTM framework to predict the solubilities of all peptides. To choose which secondary structure propensity predictor was the most promising we looked at the Pearson’s coefficients of correlation between the predicted secondary structure propensity values and their tabulated counterparts as well as at the correlation between the experimental and predicted solubility data for the 30 peptide variants. The choice of propensities that offered the best combination of high correlation for the secondary structure propensities as well as the high correlation between the predicted and experimental solubilities while simultaneously using as few parameters as possible was HD and TPSA for α-helical propensities (R = 0.59) and MW, RB and TPSA for β-sheet propensities (R = 0.69, Supplementary Fig. 2).

Sequence parser

As a 1-letter alphabet is not available for all possible mAAs, we parsed the input sequence as follows. mAAs are added to the standard protein sequence as a three-letter code in square brackets (e.g. Ala-norleucine-Gly would be denoted as ‘A[NLE]G’). A careful literature research regarding nomenclature for denoting mAAs showed that there is currently no widely used and simultaneously easy-to-read format for coding mAAs. Therefore, we kept the implementation flexible in order for any kind of nomenclature to be used.

Choice of modifications

To decide the set of mAAs for an initial testing, we considered a range of different functionalities. Acetylation of native lysine (NAC) residue is a common PTM with great impact on the properties of a peptide, as it removes a positive charge. Aminoisobutyric acid (AIB) is often used to make peptides more resistant against peptidases as it is not easily recognised79. Norleucine (NLE) is closely related to the natural amino acids leucine, valine and isoleucine, but with its longer non-branched aliphatic chain offers a slightly different functional group; it is also typically used as a non-oxidation labile methionine substitution. Cyclohexylalanine (CHA) offers a unique functionality due to its highly hydrophobic non-aromatic six-membered ring. Citrulline (CIT) offers alternative functionality that resembles arginine. Moreover, we also implemented modifications to the N- and C-termini of peptide scaffolds: N-acetylated aspartic acid, C-amidated phenylalanine and C-amidated tyrosine as these were already included in the base peptides. With this mix of new functionalities and some closely related mAAs we aimed to cover a broad chemical space.

Peptide design

Due to the limit of the number of possible variants that could be synthetised and purified in this study, we wanted to ensure that our designs covered the largest possible chemical space while exploring a broad range of solubility values. For each peptide we designed five variants each containing one mAA. We chose alanine residues as the starting point for single modifications to have a common baseline for all mAAs. Additionally, we screened all possible combinations of double modifications for each peptide. The first step, however, was to define regions for each peptide that allowed for modification without interfering with the binding capabilities and specific folds.

GLP-1 consists of two α-helices separated by a linker. We chose the first alanine in the linker region (residue 24) as the starting point for single-site modifications. For the double-site modifications, we further excluded the following residues due to their essential role in binding: 7His, 8Ala, 9Glu, 11Thr, 12Phe, 13Thr, 14Ser, 16Val, 17Ser, 18Ser, 19Tyr, 20Leu, 21Glu, 26Lys, 28Phe, 29Ile, 31Tyr, 32Leu, 33Val, 34Lys.

PYY consists of a proline-rich α-helix at the N-terminus which forms H-bonds with the α-helix that comprises the rest of the molecule. Hence, we chose an alanine in the proline-rich region to perform the single-site modifications. For the double-site modifications, we excluded all prolines and hydrogen-bonding residues, i.e. R, H, K, D, E, N, Q.

18 A has an amphipathic nature that is convenient to maintain. Therefore, for the single-site modifications, we chose alanine at position 10, located on the edge between the two sides. For the double-site modifications, we ensured that the hydrophilic residues (D, E, K) were only replaced with hydrophilic modifications (CIT, AIB) and hydrophobic residues (W, F, A, V) were only replaced with hydrophobic mAAs (CHA, NAC, NLE).

Given these constraints, we screened over 50,000 mAA variants using CamSol-PTM. From all these possible variants for double modifications, we chose at least one variant where one of the modifications is rather small, e.g., L to NLE, F to CHA, A to AIB or R to CIT. For the remaining three doubly modified variants per peptide, we chose one variant each predicted as either very soluble, very insoluble or average in solubility. The sequences of the designed peptides are given in Table 2.

Table 2 List of peptides designed to verify the CamSol-PTM predictions

Generation of experimental data

Relative solubility was measured using a recently developed PEG precipitation assay66. For all PYY variants the standard assay worked well, and no changes had to be implemented (Fig. 2a). Variants 27 and 28 were completely soluble whereas variant 30 was already insoluble in the absence of PEG, and variant 29 proved to be difficult to produce and purify. Therefore, these four are not reported in Fig. 2. 18 A and its variants proved more complicated, as most variants were completely soluble up to 30% PEG. We therefore switched from PEG to ammonium sulphate (AMS) precipitation (Fig. 2b), as it has been shown that relative solubility measurements with PEG and AMS are correlated102. Moreover, to ensure that the results stemming from the AMS assay are consistent and reliable, we performed the 18 A experiments twice independently on different days. The results confirmed that they are indeed replicable, and we were therefore confident to use them for the validation of our approach (Supplementary Fig. 3). Two variants, namely variant 17 and 18 proved to be completely insoluble and variant 12 was not produced in sufficient amounts. Therefore, these are not reported in the figures. The last set of variants stemming from GLP-1 had the inverse problem, as most variants proved to be very insoluble. Even at final concentrations of 0.33 mg/mL (instead of 1 mg/mL) most variants remained insoluble. We used ultracentrifugation to determine the relative solubilities of the GLP-1 variants (Table 3). To confirm the reliability of this method we replicated the results on a different day with the same stock solutions (Supplementary Fig. 4).

Fig. 2: Experimental solubility data for peptides generated using the PEG solubility assay.
figure 2

Solubility curves determined using a recently developed PEG solubility assay66 for all successfully synthesised variants (all designs except variants 12 and 29) that are neither completely soluble (variants 27 and 28) nor insoluble (variants 17, 18 and 30) for: PYY (a), 18 A (b) and the second batch of PYY variants (c). For 18 A AMS was used instead of PEG. PEG1/2/AMS1/2 values are shown as a vertical line with the shaded region depicting the 95% confidence interval. PEG percentages are mass/volume66. Error bars represent the standard error of the experimental measurements across technical replicates (n = 4 for PYY and PYY – Second Batch, n = 2 for 18 A) where the centre represents the mean. Source data are provided as a Source Data file.

Table 3 Experimental solubility data for the GLP-1 variants generated using ultracentrifugation

Correlation between predicted and experimental solubility values

By comparing the computational predictions with the experimental data, we found high correlations between the two data sets. The Pearson’s coefficients of correlation for the PYY variants are 0.78, 0.81 for the 18 A variants and 0.58 for the GLP1 variants (Fig. 3). To ascertain that these findings were not merely a coincidence, we designed a second set of PYY variants containing four new mAAs and measured their solubilities (Fig. 2c). The results are depicted in Fig. 3a in ochre. Variant 32 is not depicted as it was not possible to measure its solubility with the PEG Assay. The overall Person’s coefficient of correlation for the combined set of PYY variants is 0.6.

Fig. 3: Correlation between experimental and predicted solubility values of the designed peptides containing mAAs.
figure 3

The Pearson’s coefficients of correlation are 0.6 for PYY (0.78 for the initial set) (a), 0.81 for 18 A (b) and 0.58 for GLP1 (c). mAAs that were used are shown in (d). The two designs (12 and 29) that could not be produced in sufficient amounts were removed from the analysis. Error bars in a and b represent the 95% confident intervals of the PEG1/2 values stemming from the sigmoidal function fitted through the experimental measurements shown in Fig. 2 (technical replicates n = 4 for a and n = 2 for b) where the centre represents the mean. Error bars in c represent the standard error of the experimental measurement shown in Table 3 across technical replicates (n = 2) where the centre represents the mean. Source data are provided as a Source Data file.

Encouraged by the results of the experimental validation, we set out to generalise the computational approach to broaden its applicability to more mAA types. We set up a web server under https://www-cohsoftware.ch.cam.ac.uk/index.php/camsolptm for academic user to freely use our method. We automated the process of adding new mAAs by replacing the hydrophobicity predictor with the Crippen tool from RDKit. If a user would like to predict the solubility of a peptide containing a noncanonical amino acid that has not been implemented yet, only the SMILES code is required. By providing this information, the web server will automatically calculate the necessary properties for this mAA in order for the user to include it in the prediction.

To demonstrate the speed of the automation, we incorporated the whole set of non-canonical amino acids that Amarasinghe et al. recently produced through extensive in silico screenings65. CamSol-PTM can calculate about 15 new residues per second on a single CPU core. We then designed 40,000 single mutational variants of a 60 residue-long Nrf2 peptide fragment centred around the mutational sites Leu76, Asp77, Glu78 and Leu84, which were previously identified65. We predicted the intrinsic solubility for each of these variants which took 8 min on a single CPU core (around 80/s) and plotted the distribution of the solubilities (Fig. 4). By analysing the tail ends of the distribution, we found that, in agreement with chemical intuition, mAAs that contain many hydrogen bonding residues such as those containing nitrogen and oxygen atoms are among the most solubility-promoting residues (Supplementary Fig. 5). The mAAs that most negatively affected the solubility largely contain several aromatic rings and often halogens such as chlorine or bromine (Supplementary Fig. 6).

Fig. 4: Solubility distribution of 40,000 variants of the Nrf2 peptide fragment.
figure 4

Single mutants were designed containing one of the recently reported 10,000 mAAs65 at one of four positions (Leu76, Asp77, Glu78, Leu84). Solubility of the wild-type peptide is highlighted with a turquoise line. Analysis of the tail ends of the distribution revealed that mAAs that contain many hydrogen-bonding promoting atoms such as nitrogen and oxygen are predominantly found in the highly soluble region, whereas mAAs with halogens such as chlorine and bromine and aromatic rings are mostly found in the insoluble region. The vertical line depicts the CamSol score for the wild type Nrf2 peptide fragment. Source data are provided as a Source Data file.

Discussion

Peptide intrinsic solubility is one of the most crucial parameters that determine the likelihood of a peptide to be successfully developed into a commercial drug product. Application of automated, predictive technologies with high throughput and low compound requirements are very useful for efficient early profiling and optimization of physico-chemical properties, such as solubility during early discovery program allowing for more comprehensive screenings and faster development times.

Non-canonical amino acids are often used to introduce unique functionalities to drugs such as peptidase resistances1,4,17,19,31,32,33,34,35,36 or increase binding affinities4,19. However, experimental methods to evaluate the developability of peptides containing mAAs are typically costly, and current computational approaches lack the capability of capturing the effects of mAAs on the solubility of peptides. To address this problem, we have presented CamSol-PTM, a software that predicts the intrinsic solubility in aqueous solution at room temperature of peptides and proteins containing non-canonical amino acids based on the physicochemical properties of their amino acid sequences75,76,77.

To test the CamSol-PTM predictions, 30 variants of 3 peptides containing 5 different mAAs were chosen from a preliminary screen of over 50,000 designs. The peptides were produced and purified, and their solubilities were experimentally measured. The comparison between measurements and predictions showed that CamSol-PTM can predict the intrinsic solubility of peptides and proteins containing mAAs with high accuracy (Pearson’s coefficients of correlation 0.72 on average).

We confirmed the generalisability of our approach by designing a second set of PYY variants with four new mAAs and measured their solubility and compared it to our predictions. The high overall Pearson’s coefficient of correlation for the whole set of PYY variants – although being slightly lower at 0.6 - showcases the robust applicability of our method.

Although the wild types of the peptides tested in this study tend to form α-helices, we do not expect our method to be significantly biased towards this type of secondary structure. First, most parameters, including the ones to calculate the solubility score for individual amino acids and the parameters used to determine the overall solubility of a protein are identical to original CamSol method which was trained on a wide range of varying secondary structure. Second, the mAAs tested were not merely α-helical promoting residues and are therefore not biased towards α-helical structures.

It has been recently shown that by creating new unnatural amino acids in silico, it is possible to create effective new compounds, thus demonstrating the potential of incorporating more diverse mAAs into the drug development process65. By automating the process of adding new mAAs to CamSol-PTM, the method is now capable of predicting the effects of small mAAs on the solubility of proteins and peptides. We have demonstrated the speed and versatility of the method by adding all 10,000 mAAs reported recently by Amarasinghe et al. to our method and predicting the solubility of 40,000 mutational variants of a Nrf2 peptide fragment65.

We acknowledge that although our method increases the chemical space that can be covered by solubility predictions by several orders of magnitude compared to the 20 natural amino acids, it is currently restricted to modifications that are of similar size to canonical amino acids. Further developments will be required to assess the effects of larger modifications such as lipids or glycans on the intrinsic solubility of peptides.

We envisage that the CamSol-PTM method will substantially aid in the understanding of the effects of non-canonical amino acids on the intrinsic solubility of proteins and peptides. As with previous versions, it can also be used to identify aggregation hot spots by analysing the solubility profiles. Moreover, we except it to be a valuable tool for drug development as it enables the fast and accurate solubility prediction of peptides containing modified amino acids.

Methods

Materials

N-α-D-Fmoc protected amino acids were sourced from Bachem AG (Switzerland). Synthesis reagents and solvents were all obtained from NovaBioChem, Merck (UK) and used without further purification. Peptide sequences were prepared using automated microwave-assisted solid phase peptide synthesis using the CEM Liberty Blue synthesiser and Fmoc chemistry with standard side chain protecting groups.

Peptide synthesis

All peptides were synthesised as C-terminal carboxamides on Rink Amide MBHA resin (loading 0.23 mmol/g, 100–200 mesh) on a 0.1 mmol scale using DIC/HOBt activation. All amino acids were double coupled for 4 min at 75 °C, with the instrument set to deliver the N-α-Fmoc-amino acid solutions (0.2 M solution in DMF), HOBt (1.0 M solution in DMF) and DIC (1.0 M solution in DMF). Deprotection cycles were performed using 20% piperidine solution (in DMF, + 0.1 mol HOBt) for 1 min at 90 °C following each cycle. Crude peptides were cleaved from the resin using a cleavage cocktail containing TFA (95%), triisopropylsilane (2.5%) and water (2.5%) for 4 hours at room temperature. The resin was removed by filtration and the cleavage solution removed in vacuo. The peptides were precipitated by addition of diethyl ether, isolated by centrifuge at 3500 rpm and dried under a flow of dry nitrogen.

Peptide purification and analysis

Prior to purification, crude peptides were reconstituted in 5% acetonitrile in water (v/v) or dissolved in TFA and diluted with ACN/Water/TFA 50/50/0.1 mixture and filtered (0.4 μm, PTFE). The purifications were performed by preparative HPLC (Waters Fraction Lynx system connected to a PDA detector and Waters SQD mass spectrometer) using a Waters Atlantis T3 OBD column, Waters XSelect CSH Fluoro Phenyl OBD column or a Waters XBridge C18 OBD column with a focused acetonitrile gradient at room temperature. The mobile phases used were either at acidic or neutral conditions. For specific conditions see Supplementary Data 1. Fraction collection was triggered on either a UV threshold or target mass intensity threshold, the UV trace was monitored at 230 nm. The collected fractions were pooled and analysed on a C8 or a C18 column by Waters UPLC system (or Agilent 1200 series gradient HPLC system) using a linear acetonitrile gradient at acidic conditions (Supplementary Data 1). UV purity was estimated to between 82 and 99% at 210 nm or 230 nm on a Waters H-Class UPLC system with a PDA, Waters SQD mass spectrometer (or Waters 3100 system). Target masses were verified against theoretical values on the mass spectrometer operating in ES+ mode.

Solubility assay

Aliquots of 1 mg were prepared from the purified and lyophilised stocks. The solubility of the PYY and 18 A variants was measured using the PEG solubility assay that was developed in this group66. Briefly, a precipitant is titrated in increasing concentration to a fixed concentration of protein to induce precipitation of the protein. The samples are incubated for 48 h at 4° after mixing. The samples are centrifuged and the remaining protein concentration is measured in the supernatant using a plate reader. PYY and 18 A variants were dissolved in 10 mM citrate 10 mM phosphate buffer at pH 7 for a final concentration of 3 mg/mL. The assay was run with 50% 6000 PEG for PYY and with 3.8 M AMS for 18 A. To improve throughput, a multichannel robot was employed to measure several peptides at once with the workflow being kept the same as described previously66. The solubility of the GLP1 variants was measured with ultracentrifugation as follows: The peptides were dissolved in 10 mM citrate 10 mM phosphate buffer at pH 7 for a final concentration of 2 mg/mL. 120 µL of each sample were centrifuged using an OptimaTLX Ultracentrifuge and spinning for 30 min at 500,000 g at 4 °C. The supernatant was removed, and the peptide concentration was measured using a NanoDrop.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.