Classification model of amino acid sequences prone to aggregation of therapeutic proteins

Total body clearance of biological drugs is for the most part dependent on the receptor mechanisms (receptor mediated clearance) and the concentration of antibodies aimed at administered drug – anti-drug-antibodies (ADA). One of the significant factors that induces the increase of ADA level after drug administration could be the aggregates present in the finished product or formed in the organism. Numerous attempts have been made to identify the sequence fragments that could be responsible for forming the aggregates – aggregate prone regions (APR). The aim of this study was to find physiochemical parameters specific to APR that would differentiate APR from other sequences present in therapeutic proteins. Two groups of amino acid sequences were used in the study. The first one was represented by the sequences separated from the therapeutic proteins (n = 84) able to form APR. A control set (CS) consisted of peptides that were chosen based on 22 tregitope sequences. Classification model and four classes (A, B, C, D) of sequences were finally presented. For model validation Cooper statistics was presented. The study proposes a classification model of APR. This consists in a distinction of APR from sequences that do not form aggregates based on the differences in the value of physicochemical parameters. Significant share of electrostatic parameters in relation to classification model was indicated.


Background
Therapeutic proteins are one of the fastest developing types of drug. A pharmacokinetic profile as well as the effect of biological drugs are the result of many complex interactions with the immune system (Mould and Green 2010;Dostalek et al. 2012). One of the key pharmacokinetic parameters of biological drugs is clearance (EMA 2012). Total body clearance of biological drugs is for the most part dependent on the receptor mechanisms (receptor mediated clearance) and the concentration of antibodies aimed at the administered druganti-drugantibodies (ADA) (Datta-Mannan et al. 2007; Wang and Chow 2010). ADA are produced in the organism as a response to most of the biological drugs including humanized molecules and completely human monoclonal antibodies. The issue of immune response to biological drugs is treated by the health authorities (FDA 2014). Apart from binding (binding antibodies), ADA can also neutralize drugs present in the organism (neutralizing antibodies) (Hsu et al. 2014). One of the significant factors that induces the increase of ADA level after drug administration could be the aggregates present in the finished product or formed in the organism (Chennamsetty et al. 2009;Wang et al. 2009). Protein aggregations, depending on their structure, can exhibit different immunogenicity. Even a slight quantity of formed aggregations after an administration of biological drug may induce a significant increase of ADA level (Rosenberg 2006). Insufficient knowledge about the possibility of aggregation process induction at the stage of drug design may endanger safety and efficacy of biological drug application in the clinical phase. However, the impact of aggregation on safety during biological drug research applications may often be difficult to predict. The reason for this is the specificity of the immune response. In extreme cases the dynamics of this response are extraordinarily high and the equivalent is not seen in small molecules.
Numerous attempts have been made to identify the sequence fragments of different proteins that could be responsible for forming the aggregates (Pawar et al. 2005;Tartaglia et al. 2005). In the case of therapeutic proteins, it is known that these fragments are short, hydrophobic sequences (aggregation prone regions -APR) that in favorable conditions initiate the aggregation process (Wang et al. 2009). A large number of APR in protein sequences could be connected with higher ability to form aggregates in vivo. This, in turn, can have a significant impact on the concentration of free form drug in blood and number of side effects (Rosenberg 2006). The aggregates in finished drugs are usually identified with the use of physicochemical methods such as: size exclusion chromatography, analytical ultracentrifugation, electrophoresis, light scattering etc. However, these methods have some limitations (Tatkiewicz et al. 2015). They are not always sensitive enough or they do not determine the share of aggregates of various structures in a single sample. Most of the commonly used methods allow the determination of hydrodynamic size or the molecular weight of an aggregate. These measurements ascertain or exclude the presence of aggregates only after their formation. They do not identify a danger connected with the ability of determined protein to form the aggregates. Hence, the significant tool to assess the risk relating to the ability of forming the aggregates is in silico analysis (Agrawal et al. 2011). Currently, different kinds of software based on phenomenological methods, statistical models, Monte Carlo simulations, scoring matrices, decision trees, Bayesian models etc. (Wang et al. 2009;Tsolis et al. 2013) are used to find APR in the sequences of therapeutic proteins.
The aim of this study was to find physiochemical parameters specific to APR that would differentiate APR from other sequences present in therapeutic proteins.

Sequences selection
Two groups of amino acid sequences were used in the study. The first one was represented by the sequences separated from the therapeutic proteins (n = 84) able to form aggregation bridges -APR (Wang et al. 2009) (Table 1; sequences 1-84). A control set (CS) consisted of peptides that were chosen based on 22 tregitope sequences (Epivax Inc. 2007).
The shortest sequences in the APR group consisted of 5 amino acids. 46.4 % of APR sequences consisted of only 5 amino acids. A CS was created also based on the peptides with the length of 5 amino acids. From each tregitope (n = 22) two sequences were chosen (Table 1; sequences 85-106). The first one was made of the first five amino acids of each tregitope (1-5) and second one was made of the next consecutive five amino acids of each tregitope (6-10). This way, CS sequences (n = 44) of the length of 5 amino acids each were obtained. One of the sequences from the CS was removed from the analysis (VVSVL). This sequence was the same as one of the APR sequences. Another one (VSWYQ) was also removed from the CS group as a result of double selection from the group of tregitopes during conducted procedures. This way, the final number of CS sequences was 42 (Table 1, sequences 107-148).
Tregitopes were used to build CS as they are short amino acid sequences present in the structure of many therapeutic proteins. After protein internalization, these sequences are responsible for the modulation of an immune response by influencing the regulatory T cells. The effect of tregitope presentation by MHC-II is a tolerogenic action (De Groot et al. 2013). The presence of tregitopes in therapeutic protein sequences (except vaccines) is a desired element considering the suppression of immune response in relation to the administered protein.

Physicochemical parameters calculations
In the first phase of physicochemical characterization of analyzed sequences physicochemical parameters of single amino acids were calculated. 16 parameters were taken from PubMed® database (XLogP3, rotatable bond count, heavy atom count, formal charge, complexity, isotope atom count, defined atom stereocenter count, undefined atom stereocenter count, defined bond stereocenter count, undefined bond stereocenter count, covalentlybonded unit count). Analysis of 51 physicochemical parameters of single amino acids was completed using Qik-Prop 3.1 from Schrödinger package (v 31207) software (Grabowski et al. 2012). QikProp was run in the normal mode. Three-dimensional structures of compounds were prepared in LigPrep 2.2 using settings recommended in the QikProp's user manual (Schrödinger 2015). In the initial phase of study, 62 parameters and features of physicochemical structure were used. They were calculated separately for each amino acid that was a part of the examined sequences.
In the second phase, physicochemical parameters for whole sequences were calculated. In this phase arithmetic expression value (AEx) was created with the use of eight clue physicochemical parameters (Table 1). In cases of such parameters as: number of non-conjugated amine groups (AM), number of carboxylic acid groups (AC), number of non-trivial (not CX3), non-hindered (not alkene, amide, small ring) rotatable bonds (ROT), number of ring atoms not able to form conjugated where SP i-iiistructure parameter calculated for particular amino acid, namount of particular amino acid in the sequence, i, ii, iiiparticular amino acids was calculated. In case of predicted apparent Caco-2 cell permeability (QPCaco) and solubility (QPlogS), SP was an arithmetic mean of particular physicochemical parameters calculated according to the formula: where Nnumber of non-replayed amino acids in sequence. This way calculated physicochemical parameters (Table 1) were used to search for the correlation that could differentiate APR from CS.

Creation of arithmetic expressions
The search for differences in each SP between the APR and CS groups (Table 1) did not yield any significant findings. Therefore, an attempt to create an arithmetic expression value (AEx) consisting of several different SP was made. To this end, the method published earlier was used (Grabowski et al. 2012). Many compilations of SP were tested (not published data), as the result of which the arithmetic statement was distincted: Ln(AM − IP + AC × ROT) − (QPCaco − NON). A value QPlogS and this arithmetic statement were used to classify the sequence groups and to exhibit a significance of differences between classes (APR and CS). QPlogS is a physicochemical parameter of a complex character. This focuses the information of solubility in water. However, this information combines many properties linked to a molecule solubility and its electrostatical character. Hence, there was an attempt to use that parameter in presented model.

Statistical analysis and model validation
Statistical analysis was performed with the use of Graph-Pad Prism 6.0 software. All relationships were confirmed by Mann-Whitney test (Z c statistics) and differences with p <0.05 were regarded as statistically significant. Arithmetic mean (M), standard deviation (SD), lower and higher 95 % confidence intervals for M (CI low, CI high), and standard error (SE) was calculated (SD= ffiffiffiffi N p , where -N is total number of sequences (APR and CS)). Sample size of training set (APR) was positively verified by Toplis ratio (ratio of the number of chemicals in the training set to the number of descriptors in the AEx is >5:1) (OECD 2007;ECHA 2008).
Classification model and four classes (A, B, C, D) of sequences were finally presented. Currently, the Cooper statistics is the most widely used method of classification model validation Zambrano et al. 2015). That is why, for model validation Cooper statistics based on Bayesian approach (sensitivity -S n , specificity -S p , accuracy -A c , error rate -E r , positive predictivity -P p , negative predictivity -N p , false positive (over-classification) rate -FP oc , false negative (under-classification) rate -FN uc , proportion of active chemicals in a population -P as ) was presented. Cooper statistics was calculated using equations:

Results
The mean values of SP calculated for particular groups of sequences were presented in Table 2 (Table 2). At the initial stage of study, the differences in SP between APR and CS were searched. As the result, they could differentiate significantly between these two groups. The analysis of single parameters did not yield its expected results. In case of comparative analysis of SP calculated for the tregitope sequences and APR, significant differences were identified (p <0.05), for instance in relation to HBA↔IP. After a selection of shorter sequences (CS) from the same tregitope sequences, though, it turned out that the differences in relation to HBA↔IP were not significant (Fig. 1). The significant differences (p <0.05) in relation to values SP of tregitopes and APR were also stated for correlations: FISA↔AC × DN 0.5/SA , FISA↔Vol, Vol↔HBA, QPlogS↔FISA, where FISAhydrophilic component of the solvent accessible surface area, AC × DN 0.5/SAindex of cohesive interaction in solids, Voltotal solvent accessible volume in cubic angstroms (Å 2 ) using a probe with a 1.4 Å radius.
However, all the same correlations were not significant for the CS selected from the tregitopes. At the next stage of study, SP was used to create arithmetic statement (AEx) that allowed differentiation of APR (n = 84) from CS (n = 42) with a sensitivity of 79.76 %. After statement of a correlation AEx↔QPlogS, the sequences APR and CS were differentiated on 4 different classes (A, B, C, D). The range of classes are characterized with the values of parameters QPlogS and AEx. A definition of class includes the values: QPlogs > 0 and AEx < 0 (class A), QPlogs ≥ 0 and AEx > 0 (class B), QPlogs < 0 and AEx ≥ 0 (class C), QPlogs ≤ 0 and AEx ≤ 0 (class D), (Fig. 2). A range specific for APR illustrates class D on Figure 2. As a result of using AEx, only 20.24 % of APR were incorrectly recognized as sequences not connected with the aggregation process (class B and C on Fig. 2). And only one of 42 CS sequences was recognized as a sequence potentially dangerous and classified to class D. As a result, 97.67 % of CS sequences were classified as not possessing any features connected with forming the aggregates (Table 3)class A, B, C on the Figure 2.
Out of 127 sequences (a sum of APR and CS) only one was present in A class. 67 APR sequences were classified to class D. Other sequences were in classes: B and C. The proposed classification model did not allow total separation of APR from CS. During analyses, it turned out that four out of 42 CS sequences had regions that were repeated in APR. These regions contained hydrophobic amino acids such as F, I, L, M and N. These regions were: LMI, LYL, TDF and QYN. Only one APR (CQQYN) was classified to class B instead of class D. In relation to CS, none of mentioned regions (LMI, LYL, TDF, QYN) impacted on incorrect classification of CS. Every CS sequence possessing the mentioned regions in its structure was assigned to class C (EEQYN) or B (LMIYE; NTLYL; GTDFT).

Discussion
This study attempted to use the physicochemical parameters of single amino acids to detect APR sequences in therapeutic proteins. The introduced method involved analysis using software used previously mostly for calculations of physicochemical parameters of small molecules. This method uses the analysis of physicochemical parameters of single amino acids and bases on a prediction of final parameter (SP). This parameter, in turn, is the basis for creating a sequence or region characterization based on AEx, constructed of many SP (AEx = SP ⟺ SP ⟺ SP…., where ⟺ means mathematic operation). AEx with QPlogS was used to construct a model, where 4 sequence classes were defined. Class D includes APR sequences, and classes A, B, Csequences that do not have the same influence on aggregation bridges forming.
In the course of the study, it was stated that using the long amino acid sequences to verify the presented model implemented false positives. In long tregitope sequences AEx had a value significantly different from AEx calculated for APR. However, this may result from the existence of feature camouflage of the shorter CS (CS derived from tregitopes, n = 5).
The study proposes a classification model of APR consisting in a separation of APR based on the differences  NON). APR aggregation prone regions (○; n = 84), CS control set extracted from tregitopes (•; n = 42), AC number of carboxylic acid groups, AM number of non-conjugated amine groups, IP ionization potential, ROT number of non-trivial (not CX3), non-hindered (not alkene, amide, small ring) rotatable bonds, QPCaco predicted apparent Caco-2 cell permeability, NON number of ring atoms not able to form conjugated aromatic systems in the value of QPlogS and AEx in relation to sequences that do not form aggregates. A value of water solubility or hydrophobicity of APR with reference to APR has been discussed in many studies (Wang et al. 2009;Tsolis et al. 2013;Zbilut et al. 2003;Wu et al. 2014). The significance of this feature in relation to APR was also confirmed in this study. Moreover, it was stated that the charge characterization of particular amino acids present in analyzed sequences has a significant correlation with APR. It is indicated by the presence in AEx of such parameters as: ionisation potential, number of amine groups or number of carboxylic acid groups. At least three parameters used to construct AEx relate to the charge characterization of analyzed sequences. The presence of IP in AEx does not seem to be accidental. The IP value is determined, among other things, in relation to the oxidative potential of amino acids. It is known that IP value is connected with the proton-donating or proton-accepting character of the amino group and carbonyl groups of amino acids (Hirakawa 2014). IP is a parameter indicative of the molecular ability to transfer positive ion. Therefore IP is connected with the oxidative reactions of amino acids (Rooman and Wintjens 2014). On the other hand, oxidation of some amino acids (histidine, methionine, cysteine, tryptophan, tyrosine) may have influence on the increase of aggregation forming dynamics (Li et al. 1995).
Although some sequences were not classified correctly, validation parameters confirm the predictive quality of the model. Based on the calculations, it can be deduced that the finding of APR in the protein structure with the use of parameters used so far for small molecules is possible. The study confirmed previous observations concerning the influence of short, hydrophobic protein sequences on the initiation of the protein aggregation process. Additionally, the significant share of electrostatistic parameters including IP in relation to classification parameters was indicated.

Conclusions
The study proposes a classification model of APR consisting in a distinction of APR based on the differences in the structure in relation to sequences that do not form aggregates. Key parameters for validation of the presented model include: number of non-conjugated amine groups, number of carboxylic acid groups, number of non-trivial (not CX3), non-hindered (not alkene, amide, small ring) rotatable bonds, hydrogen bond acceptors, predicted apparent Caco-2 cell permeability, ionization potential, number of ring atoms not able to form conjugated aromatic systems and solubility.
This presented model allows selection of APR's in the protein sequence in non-clinical drug development process.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable. Groups groups of parameters depicted in Fig. 2, M arithmetic mean, SD standard deviation, CI low lower 95 % confidence interval for M, CI high higher 95 % confidence interval for M, n number of sequences observed in specific class (A, B, C, D), SE standard error, na. not applicable, QPlogs >0 and AEx <0 (class A), QPlogs ≥0 and AEx >0 (class B), QPlogs <0 and AEx ≥0 (class C), QPlogs ≤0 and AEx ≤0 (class D)