Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli

(AA) sequence have a significant influence on protein solubility. Here, we mainly focused on AA composition and explored those that most affected the soluble expression level of human serum albumin (HSA) domain antibody (dAb). The soluble expression and sequence of 65 dAb variants were analysed using clustering and linear modelling. Certain AAs significantly affected the soluble expression level of dAb, with the specific AA combinations being (S, R, N, D, Q), (G, R, C, N, S) and (R, S, G); these combinations respectively affected the dAb expression level in the broth supernatant, the level in the pellet lysate and total soluble dAb. Among the 20 AAs, R displayed a negative influence on the soluble expression level, whereas G and S showed positive effects. A linear model was built to predict the soluble expression level from the sequence; this model had a prediction accuracy of 80 %. In summary, increasing the content of polar AAs, especially G and S, and decreasing the content of R, was helpful to improve the soluble expression level of HSA dAb.


INTRODUCTION
Given the outstanding advantages of Escherichia coli, including fast growth, inexpensive culturing, high-density cultivation, and simple genetic manipulation, it has been suggested that E. coli should be the first host tried for expression of any protein (1). However, most proteins from eukaryotes have low solubility when expressed in E. coli. For instance, over 80 % of non-membrane proteins were unsuitable for structural studies and over 90 % of potential pharmaceutical proteins were terminated at an early stage of clinical development because of their low solubility when expressed in E. coli (2). Several strategies have been used to increase protein production and solubility, for example altering expression system elements (3,4) and optimizing culture conditions (5). These efforts are time-consuming, costly and usually difficult (6) because of a lack of understanding of the correlation between the effect of the expression system components and the characteristics of the expressed protein.
Interestingly, it has been found that primary structure features have a great impact on protein overexpression in E. coli (7,8). Several prediction models have been established (6,9), such as the Harrison prediction model (10), multiple linear regression (MLR) model (11), solubility index-based model (12), support vector machine-based model (13,14), PROSO model (15), SOLpro model (16), cc SOL model (17) and PROSO II model (18). These bioinformatics models can significantly reduce trial and error procedures involved in optimization of expression systems to increase the soluble expression level of heterologous proteins. However, there has been limited application of these prediction models, partly because of the significant differences among the proteins chosen for building them and also because of the adoption of inconsistent culture conditions for expression of proteins (6,8,9).
Domain antibodies (dAbs), which consist of only variable regions of heavy (V H ) or light (V L ) chains (19), have simple tertiary structures ( Fig. 1; 20,21), thus it is helpful to focus on the features that influence dAb expression level on primary structures. There are three hypervariable regions in dAbs, namely complementarity-determining regions (CDRs) I, II and III, where sequence variability is concentrated to determine the antigen-binding activity of an antibody (22). Small variations of amino acids (AAs) within a short region leading to clear variation in soluble expression level, ease of expression in E. coli (23), and a simple tertiary structure make dAbs an ideal model molecule to investigate the connections between primary structure features and the corresponding soluble protein expression levels.
In this study, a single expression system was used to express multiple human serum albumin (HSA) dAb variants with identical culture and detection conditions, to ensure that no other factors such as culture conditions affect the dAb expression. Clustering and stepwise regression were used to explore the correlation between AA sequences and soluble expression levels of HSA dAbs, aiming at building a linear regression model to predict the soluble expression level of HSA dAb based simply on its AA sequence. Such a model may act as a general guide for site-directed mutagenesis of HSA dAbs or other similar dAbs/Abs to improve the soluble expression levels, which benefits further studies such as interaction mechanism and structure research.

Random mutation of AAs in the CDRs of the original HSA dAb
Five amino acids (AAs) were chosen in each complementarity-determining region (CDR) (there are three CDRs, so 15 AAs in total were chosen) to mutate randomly into other AAs, in this way we generated a mutation library consisting of about 10 7 samples. These samples varied little in pI and molecular mass and had the same length, thus it was helpful to focus on the variables of AA composition. Then, 65 mutated HSA dAbs excluding terminator mutants (AUA, CCU, CCC, AGA and AGG) or sequential repeat mutants were chosen randomly as experimental subjects and 10 were chosen as verification subjects. These mutated sequences are listed in Table 1.

Production of recombinant dAb expressing E. coli strains
The dAb fragments were cloned into vector pBY (an efficient expression vector constructed by a coworker in our lab) and introduced into E. coli strain BL21(DE3). The transformed cells were plated onto Luria-Bertani (LB) agar plates (Solarbio® Life Sciences, Beijing, PR China) and incubated at 37 °C overnight. After that, single colonies were selected and inoculated into 25 mL of LB medium (containing 15 μg/ mL of tetracycline (Shanghai Shenggong Co. Ltd., Shanghai, PR China) in 250-mL flasks and incubated at 37 °C for 7 h with shaking at 230 rpm. Stock solutions were prepared by mixing 500 μL of culture with 500 μL of 20 % glycerol (Shanghai Hushi Laboratorial Equipment Co. Ltd., Shanghai, PR China) solution in 1.5-mL tubes, and the cells were stored at −80 °C.

Cultivation of E. coli strains
Cultivation can be divided into three phases: seed culture, growth and induction phase. Forty-eight square multititer plates (48-MTP; Thermo Fisher Scientific, Shanghai, PR China) were used to culture the 66 strains (65 mutated strains and a control strain) to achieve parallel fermentation. In the seed culture phase, 2 mL of LB medium containing 15 μg/mL of tetracycline were added into each well of the 48-MTP. After inoculation with 20 μL of stock cell solution, 48-MTPs were incubated in a shaker at 230 rpm and 30 °C for 16 h. In the growth phase, the seed solutions were transferred to fresh 48-MTPs containing 2 mL of Terrific Broth/Super Broth (TB/ SB; Solarbio® Life Sciences) medium with 15 μg/mL of tetracycline and cultured under the same conditions as described above. The inoculum volume was calculated by the following equation, thus fixing the initial A 595 nm at 0.05: where V is the volume, 0.05 is the initial absorbance (A) at 595 nm and A is the absorbance of seed culture solution.
Seven hours after the second inoculation, isopropyl-β-d-thiogalactoside (IPTG; Solarbio® Life Sciences) was added to each well to a final concentration of 0.1 mM and the culture temperature was lowered to 23 °C simultaneously. The induction phase lasted for 16 h. After centrifugation of the culture broth at 6000×g (centrifuge model Sorvall ST 16R; Thermo Fisher Scientific, Shanghai, PR China), the supernatants were collected, the cell pellets were resuspended in phosphate-buffered saline (PBS; Shanghai Hushi  (21) Laboratorial Equipment Co. Ltd) and lysed using Precellys 24 (Bertin Technologies, Paris, France), and then supernatants were collected.
The whole process of cultivation was repeated six times; batches with small deviation of dAb production by control strain were chosen for further analysis, and in this way, parallel operations were guaranteed.

Detection and quantification of soluble dAb protein and total protein
Two amounts of soluble expression of dAbs were measured by direct ELISA, i.e. soluble dAbs in broth supernatant and in pellet lysate supernatant. Flat-bottomed 96-well plates (Thermo Fisher Scientific) were first coated with 50 μL of supernatant. After blocking with 5 % non-fat milk in PBS with -no alteration of amino acid at that position Tween 20 (PBST; Shanghai Hushi Laboratorial Equipment Co. Ltd), the dAbs were detected using HRP-labelled protein A (Boster Biological Technology Co. Ltd., Beijing, PR China) with the substrate tetramethylbenzidine (Zhengzhou Biocell Biotechnology Co. Ltd., Zhengzhou, PR China). The reactions were stopped by the addition of 100 μL of 2 M sulfuric acid, and the absorbance was measured at 450 nm/620 nm using an EZ Read 800 (Biochrom, Cambridge, UK). The amount of dAb was calculated from a standard curve made using reference sample. Total protein mass fraction was detected using a modified Bradford protein assay kit (Sangon Biotech Co. Ltd., Shanghai, PR China). To avoid the difference caused by different degrees of cell lysis, standardized amounts of dAbs in μg per g of total protein were calculated as follows and used in the data analysis ( Table 2): w(total protein)=m(dAb)/m(total protein) /2/

Data analysis
The software package R (24) was used to analyze the contributions of factors such as AA composition, dAb charge and polarity on dAb soluble expression level. Factors with p<0.05 were considered significant. Categories of AAs based on Vector NTI® (25) are listed in Table 3. Two levels of analysis were run, including dividing expression levels into high and low by Clustal Omega (26), and identifying the factors that had an effect on the expression level by t-test. A linear regression model was constructed, then factors that had a significant influence were removed in turn to identify the most significant ones based on Akaike information criterion (AIC) values (27). We used SWISS-MODEL (20) to get 3D structure, and PyMOL (21) to decorate CDRs in three different colours.

AA composition significantly affects the soluble expression of dAbs
It is widely accepted that AA sequence is significantly correlated with protein production, which was also shown in this study through analysis of the consistency of cluster results based on AA sequences and the corresponding soluble expression levels of dAbs ( Results are expressed as mean value±standard deviation  their effect on the dAb soluble expression level by a stepwise regression analysis, and the results are summarized in Table 5. Stepwise regression was taken to analyse AA effect on dAb soluble expression level in broth supernatant, in pellet lysate supernatant and total soluble dAb. Results showed that the combination of AAs S, R, N, D, Q, Y, F and G had a significant influence on dAb soluble yield in broth supernatant, with the p-value of 0.002. Specifically, S, N, D and Q had positive effects, with p-values of 0.0006, 0.02, 0.03 and 0.05, respectively, which means that the soluble yield of dAb in broth supernatant increased with increasing content of these AAs. However, R had a negative effect (p=0.001), thus dAb would be more difficult to express in soluble form in broth supernatant with a higher content of R. Moreover, the combined composition of G, R, C, N, S, Y, K and A had a significant effect on dAb soluble yield in the pellet lysate (p=0.002). Again, R showed a significantly negative effect on the soluble expression (p=0.02), while G, C, N and S showed significantly positive effects, with p-values of 0.01, 0.02, 0.03 and 0.03, respectively. When analyzing AA effect on total amount of soluble dAb, the combined composition of R, S, G, N, Y, C, Q and F showed a significant influence (p=0.0007). The most significant AAs were R (negative), S (positive) and G (positive), for which the p-values were 0.0008, 0.006 and 0.03, respectively ( Table 2). Additionally, stepwise regression analysis of the features of the dAbs, including charge, polarity, hydrophobicity, acidity and alkalinity, 1 and 2=cluster result of groups 1 and 2 respectively, based on expression levels or sequences of domain antibodies, + and -=consistency and inconsistency of these two cluster results respectively where y indicates the soluble expression score in %, R is arginine, F is phenylalanine, G is glicine, S is serine, C is cysteine, Q is glutamine, Y is tyrosine and N is asparagine. The higher the score, the higher the soluble expression level of dAb. Clustering results divided the sequences of the 65 experimental subjects and the control dAb into high-and low-expression groups; the score distribution is shown in Fig.  2. Twenty out of 25 dAbs belonging to the low-expression group had a score <2.5, while 31 out of 41 high-expression dAbs had a score >2.5. We conclude that dAbs with a score <2.5 are likely to be expressed at a low level in soluble form and the soluble yield would possibly be <(2.4±0.9) μg/g. On the other hand, dAbs with a score >2.5 are likely to be expressed at a high level in soluble form, with the potential soluble yield higher than (4.0±0.5) μg/g.

Verification
Using the same cultivation and detection methods as in the experiments above, expression data were obtained for 10 verification subjects and a control. Comparing the predicted expression levels from the model with the actual soluble yield of these dAbs, the accuracy of the prediction model was 80 % ( Table 6).

DISCUSSION
Since 1990 there have been many researches exploring the correlation between protein sequence and expression level; however, no consensus has been reached. For example, one project studied 81 different human proteins and came to the conclusion that increasing the average charge, decreasing the number of turn-forming AAs, or decreasing the content of cysteine could reduce the amount of inclusion bodies (10), while another studied G-protein-coupled receptors and found that increasing the positive charge encouraged the formation of inclusion bodies (11). Goh et al. (28) discovered that high hydrophobicity was a disadvantage for expressing proteins in soluble form by analyzing 27 267 proteins selected from TargetDB, whereas Luan et al. (29) expressed 10 167 ORFs of Caenorhabditis elegans using a robotic pipeline and found that hydrophobicity was not linearly correlated with the soluble expression level of protein, but proteins with lower hydrophobicity displayed higher levels of soluble expression. These works proved that studies using different subjects could come to different or even opposite conclusions. Here, to avoid the influence of protein properties including molecular mass, length and complex structures, expression system used, or operation bias, first we used dAb as the experimental subject, because this protein has low molecular mass, concentrated regions of variation, is easy to express in E. coli and has a simple tertiary structure. Second, 15 AA mutated in CDRs guaranteed enough variation among dAbs and little variation in pI, molecular mass and length, which helped us to focus on the variable of AA composition. Furthermore, we used consistent cultivation conditions and detection methodology to collect data, and repeated the process three times with constant control strain, which guaranteed the parallelity of operation.  We found that polarity had a significantly positive influence on dAb soluble yield. In other words, the total content of N, S, C, G, T, Q and Y positively correlated with dAb soluble yield. This may be because in this small protein there is a high likelihood of exposure to solvent of polar AAs after folding, which enhances the solubility of the protein through proteinsolvent interaction, thus indirectly increasing the soluble expression level of the protein (30).
We discovered that arginine content had a significantly negative correlation with dAb soluble yield, consistent with a report that positively charged AAs could hinder the process of translation, thus bringing down the expression level (7).
Stepwise regression analysis showed that the glycine content was positively correlated with dAb soluble yield, which may be attributable to the small molecular mass and polarity of G. The significantly positive influence of S supports the conclusion that polar AAs benefit dAb soluble expression. We suggest that increasing the total content of G and S, or decreasing the content of R is helpful to improve the soluble expression level of dAb. Findings from this study may act as a general guide for site-directed mutagenesis of HSA dAbs or other similar dAbs/ Abs to improve the soluble expression levels, which benefits further studies such as interaction mechanism and structure research. Furthermore, considering the attractive advantages of E. coli as a protein expression host, our preliminary observations pave the way towards establishing more efficient E. coli expression strategies for desired proteins.

CONCLUSION
Certain amino acids (AAs) significantly affected the soluble expression level of domain antibody (dAb) in the broth supernatant and in the pellet lysate, and total soluble dAb, with the specific AA combinations being (S, R, N, D, Q), (G, R, C, N, S) and (R, S, G). R displayed a negative influence, whereas G and S showed positive effects. Increasing the content of polar AAs, especially G and S, and decreasing the content of R was helpful to improve the soluble expression level of human serum albumin (HSA) dAb. This linear model had a prediction accuracy of 80 %.