Solubilization of inclusion bodies: insights from explainable machine learning approaches

Walther, Cornelia; Martinetz, Michael C.; Friedrich, Anja; Tscheließnig, Anne-Luise; Voigtmann, Martin; Jung, Alexander; Brocard, Cécile; Bluhmki, Erich; Smiatek, Jens

doi:10.3389/fceng.2023.1227620

METHODS article

Front. Chem. Eng., 07 August 2023
Sec. Computational Methods in Chemical Engineering
Volume 5 - 2023 | https://doi.org/10.3389/fceng.2023.1227620

Solubilization of inclusion bodies: insights from explainable machine learning approaches

Cornelia Walther¹*

Michael C. Martinetz¹

Anja Friedrich¹ www.frontiersin.org

Anne-Luise Tscheließnig^1,2

Martin Voigtmann¹ www.frontiersin.org

Alexander Jung³ www.frontiersin.org

Cécile Brocard¹ www.frontiersin.org

Erich Bluhmki^4,5 www.frontiersin.org

Jens Smiatek^3,6,7*

¹Boehringer Ingelheim RCV GmbH & Co KG, Biopharma Austria, Process Science Downstream Development, Vienna, Austria
²Pharmaceutical Sciences, Baxalta Innovations GmbH, Takeda Group, Vienna, Austria
³Boehringer Ingelheim Pharma GmbH & Co. KG, Global Innovation and Alliance Management, Biberach, Germany
⁴Boehringer Ingelheim Pharma GmbH & Co. KG, Analytical Development Biologicals, Biberach, Germany
⁵Biberach University of Applied Sciences, Biberach, Germany
⁶Boehringer Ingelheim Pharma GmbH & Co. KG, Development NCE, Biberach, Germany
⁷Institute for Computational Physics, University of Stuttgart, Stuttgart, Germany

We present explainable machine learning approaches for gaining deeper insights into the solubilization processes of inclusion bodies. The machine learning model with the highest prediction accuracy for the protein yield is further evaluated with regard to Shapley additive explanation (SHAP) values in terms of feature importance studies. Our results highlight an inverse fractional relationship between the protein yield and total protein concentration. Further correlations can also be observed for the dominant influences of the urea concentration and the underlying pH values. All findings are used to develop an analytical expression that is in reasonable agreement with experimental data. The resulting master curve highlights the benefits of explainable machine learning approaches for the detailed understanding of certain biopharmaceutical manufacturing steps.

1 Introduction

High-level expression of recombinant proteins differs substantially between mammalian and microbial cells. In addition to missing post-translational modifications like glycosylation, cells from Escherichia coli (E. coli) also accumulate proteins at high concentration in aggregated form (Singh and Panda, 2005; Singhvi et al., 2020). Such inclusion bodies are dense particles of amorphous or para-crystalline protein arrangements (Freydell et al., 2007) that accumulate either in the cytoplasma or periplasma. The size of inclusion bodies varies between 0.5 μm and 1.3 μm with an average density of 1.3 mg/mL (Freydell et al., 2007). Despite further processing steps that are required, impurities such as host cell proteins or DNA/RNA fragments are significantly reduced in inclusion bodies. In consequence, nearly 70%–90% of the mass are represented by the recombinant protein, such that inclusion bodies can be considered as an interesting high-level expression system with certain advantages (Valax and Georgiou, 1993; Singh and Panda, 2005; Ramón et al., 2014). However, proteins from inclusion bodies often show missing biological activity due to denatured states, so further bioprocessing steps such as solubilization, refolding, and purification are required for efficient recovery.

Recent articles already studied the microporous structure of inclusion bodies (Bowden et al., 1991; Walther et al., 2013) and proposed a potential solubilization mechanism. Solubilization steps are usually performed in stirred reactors which facilitate the individual solvation of the proteins (Walther et al., 2014). By default, chemical denaturants are often used for mild solubilization conditions. In more detail, standard chemical denaturants like urea or guanidinium hydrochloride usually dissolve the protein from inclusion bodies in combination with reducing agents like mercaptoethanol or dithiothreitol (DTT) (Clark, 1998; Singh and Panda, 2005). The reducing agents are mainly used for the reduction of disulfide bonds in terms of protein refolding aspects (Clark, 1998). Notably, the presence of strongly denaturing agents at high concentrations results in a further loss of the native structure (Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018). The subsequent refolding step usually aims to improve the low yields of bioactive proteins, such that optimal conditions from solubilization and refolding were in the center of research in recent years (Freydell et al., 2007; Walther et al., 2022). Despite rational design of experiments (DoE) approaches in combination with machine learning techniques, the underlying mechanisms and the importance of individual features on the solubilization process are still only poorly understood (Walther et al., 2022).

In recent years, the use of machine learning has shifted slightly from pure prediction to understanding. Hence, more effort was spent into the understanding of feature importances with regard to “explainable machine learning” approaches (Holzinger et al., 2018; Gunning et al., 2019; Kailkhura et al., 2019; Linardatos et al., 2020; Roscher et al., 2020; Belle and Papantonis, 2021; Burkart and Huber, 2021; Pilania, 2021; Oviedo et al., 2022). In agreement with global sensitivity analysis for parametric models (Sudret, 2008), recent methods like Shapley additive explanations (SHAP) (Lundberg and Lee, 2017), local interpretable model-agnostic interpretations (LIME) (Ribeiro et al., 2016; Lundberg and Lee, 2017) or the “explain like I am 5” (ELI5) approach (Agarwal and Das, 2020) provide model-agnostic evaluations of feature importances with regard to the model outcomes. Although the purpose and general goal of these methods is similar, slight differences can be observed in their concepts. Global sensitivity analysis is mostly used for parametric models, whereby the weights of individual parameters are calculated by evaluating Monte Carlo simulations (Sudret, 2008). An extension of the sensitivity analysis is also the application of Polynomial Chaos Expansions (Sudret, 2008). However, these approaches are very computationally intensive, so that these methods are mostly calculated for parametric models with fewer than ten influencing variables. In contrast, SHAP analysis is rooted in game theory, such that it connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (Lundberg and Lee, 2017). In more detail, Shapley values are introduced to quantify the contribution of individual players in a cooperative game. The Shapley values are determined by a weighted average calculation over all possible player orders. For each order of players, the marginal contribution of each player is calculated and multiplied by a weight that depends on the probability of this order. The Shapley values are then the sum of the weighted marginal contributions over all possible orders. In terms of explainable machine learning, the Shapley values are used to quantify the importance of individual features in terms of model predictions. In contrast to the SHAP approach, LIME attempts to explain model predictions at an instance level from the data set. An approximation in terms of a simplified model is developed for each selected instance (Ribeiro et al., 2016; Lundberg and Lee, 2017). After that, the model approximations are weighted based on their similarity to the original instance. All features that are relevant for the prediction of the model are taken into account. Thus, the weighted model approximations are used to generate an explanation for the prediction of the original instance. It thus enables a local interpretation of the predictions and helps to understand the decisions of a model on an instance level. In addition, ELI5 uses various techniques to determine the importance of each feature for the model predictions (Agarwal and Das, 2020). This can be done, for example, by calculating feature weights or by analyzing the feature contributions. Based on the identified feature importances, a comprehensible explanation for the model predictions is then evaluated. By avoiding Monte Carlo methods as used in sensitivity analysis, SHAP analysis, LIME and ELI5 can evaluate significantly more input features and are more flexible in terms of their usage for non-parameteric machine learning models.

In general, all explainable machine learning approaches can be applied for data-driven and non-parametric approaches which are systematically evaluated in order to understand the feature-target value correlations. Although it has to be noted that biopharmaceutical modelling is still dominated by standard parametric models (Smiatek et al., 2020), recent machine learning approaches already revealed the benefits of non-parametric evaluations for certain process steps (Yang et al., 2020; Smiatek et al., 2021a; Montano Herrera et al., 2022; Walther et al., 2022). Hence, it can be expected that the model-agnostic interpretation of correlations between feature and target values may provide some further insights into the molecular mechanisms of the solubilization process.

In this article, we present an explainable machine learning approach to study the solubilization of inclusion bodies in terms of molecular mechanisms. A series of experiments with systematic parameter variations for the total protein concentration, certain co-solute concentrations and pH values were conducted in order to evaluate their impact on the final yield values. The corresponding data are used for the training of different machine learning models. We show that the best model with the highest predictive accuracy can be further used for feature importance analysis in terms of SHAP values. The corresponding results provide meaningful insights for the development of an analytic theory which is in reasonable agreement with the experimental outcomes.

2 Experimental and computational details

The experimental data set included 188 values for the protein yield after solubilization with systematically varied feature values. A detailed description of the data set and the experimental protocols was already presented in Walther et al. (2022). In contrast to the previous publication, we explicitly focus on one unit operation. Hence, the consideration and optimization of coupled unit operations as studied in Walther et al. (2022) is not the purpose of this work. Moreover, we aim to provide a reliable description and further understanding of the molecular mechanisms underlying the solubilization process due to explainable machine learning approaches. More details on the experimental procedures can be found in the Supplementary Material.

The protein yield y is defined as

y = \frac{c_{s}}{c_{t}} (1)

where c_t denotes the total protein concentration and c_s the concentration of solubilized proteins. As varying parameters in a DoE approach (Politis et al., 2017), we chose the pH value, the urea concentration c_U, the total protein concentration c_t, the DTT concentration c_D and the guanidinium hydrochloride concentration c_G. The corresponding parameters were independently varied from c_t = (2–6) mol/L, pH = 6–12, c_D = (0.00–0.01) mol/L, c_G = 0–1 mol/L and c_U = (4.0–8.5) mol/L. The resulting yield values showed a range of y = 0.143–0.996. The pH values were transformed according to the relation (Landsgesell et al., 2017)

q (p H - p I) = 1 - \frac{1}{1 + 1 0^{pI - p H}} (2)

which denotes the ratio of protonated titrable groups over the total number of titrable groups. The considered protein was an antibody fragment with an isolelectric point of pI = 8.4.

For the choice of the best model, we used different regression approaches. More detailed information can be found in the Supplementary Material. We then performed hyperparameter optimization for the histogram gradient boosting model (HGB) (Blaser and Fryzlewicz, 2016) which showed the highest prediction accuracy. The correspondng hyperparameter optimized settings were a learning rate of 0.1, a maximum number of iterations of 90, meaning the number of individual decision trees and a minimum number of leaves of 20 in accordance with the nomenclature of scikit-learn 1.0.1 Pedregosa et al. (2011). The corresponding results for the hyperparameter optimization procedure are presented in the Supplementary Material.

The source code was written in Python 3.9.1 (Van Rossum and Drake, 2009) in combination with the modules NumPy 1.19.5 (Harris et al., 2020), scikit-learn 1.0.1 (Pedregosa et al., 2011), XGBoost 1.6.0 (Brownlee, 2016), Pandas 1.2.1 (Wes McKinney, 2010) and SHAP 0.40.0 (Lundberg and Lee, 2017). If not noted otherwise, all methods were used with default values.

3 Theoretical background: machine learning and feature importance analysis

3.1 Machine learning and regression algorithms

The considered machine learning approaches can be divided into individual classes. An important model class includes the decision tree based models like Decision Trees (DT), Extra Trees (ET), Random Forests (RF), Gradient Boosting (GB), AdaBoost (ADA), Histogram-Based Gradient Boosting (HGB), Bagging (BAG) and Extreme Gradient Boosting (XGB). In general, decision tree-based models can be seen as non-parametric supervised learning methods which are often used for classification and regression. The value of a target variable is approximated by introducing simple decision rules based on arithmetic mean values for regression approaches as inferred from the data that represent the independent variables. The hierarchy of decision criteria forms different branches in terms of a tree-like structure. The various methods differ in their assumption on their underlying models (Wakjira et al., 2022; Feng et al., 2021). In contrast to a single weak learning model like DT, ensemble methods like ET, RF, GB, XGB, ADA, BAG and HGB consider an ensemble of different weak learning models. The main purpose of ensemble models is the combination of multiple decision trees to improve the overall performance. In more detail, ET and RF are both composed of a large number of decision trees, where the final decision is obtained taking into account the prediction of every tree. In contrast to ET, RF uses bootstrap replicas and optimal split points for decision criteria whereas ET consider the whole original data sample and randomly drawn split points. Overfitting is decreased by randomized feature selection for split selection which reduces the correlation between the individual trees in the ensemble. In addition, in other tree-based ensemble methods, a distinction can also be made between boosting and bagging approaches. Boosting approaches such as GB, XGB and ADA generate a weak prediction model at each step which is added sequentially to the full ensemble model. Such a weighted approach reduces variance and bias which improves the model performance. In contrast, bagging methods such as BAG and HGB generate a weak single DT models in parallel. As follows, Bagging, which stands for boostrap aggregating trains multiple weak learners in parallel and independent of each other. The final individual models are added to the ensemble by a deterministic averaging process which depends on the weights of accurate or inaccurate predictions. The individual models also differ in their definition and consideration of loss or objective functions for predictions in the training phase.

Statistical estimates for the predictive accuracy of the models are usually computed by the root-mean-squared error (RMSE) or the normalized root-mean-squared error (nRMSE) of predictions. The corresponding predicted values ${\hat{Y}}_{n}$ are compared with the actual experimental values Y_n where n denotes the running index in terms of the associated RMSE value $R M S E (\hat{Y}, Y)$ as defined by

R M S E (\hat{Y}, Y) = \sqrt{\frac{\sum_{n = 1}^{S} {({\hat{Y}}_{n} - Y_{n})}^{2}}{P}} (3)

with the number of samples P = 188 in our data set. For estimating the model accuracy in comparison with the standard deviation of the target values, one can compute the normalized RMSE values $n R M S E (\hat{Y}, Y)$ in accordance with

n R M S E (\hat{Y}, Y) = \frac{R M S E (\hat{Y}, Y)}{σ (Y, \bar{Y})} (4)

with the experimental standard deviation $σ (Y, \bar{Y}) = \sqrt{1 / P \sum_{n}^{P} {(Y_{n} - \bar{Y})}^{2}}$ and the mean experimental target value $\bar{Y} = 1 / P \sum_{n}^{N_{p}} Y_{n}$ .

3.2 Feature importance analysis

SHAP value analysis as developed by Lundberg and Lee (Lundberg and Lee, 2017) is closely related to Shapley values as introduced in game theory (Shapley, 1953). In short, Shapley values provide estimates for the distribution of gains or pay-outs equally among the players (Molnar et al., 2020). Thus, SHAP analysis aims to rationalize a prediction of a specific value by consideration of the feature contributions. The individual values of features in the data set can be interpreted as players in a coalition game (Lundberg and Lee, 2017; Štrumbelj and Kononenko, 2014). In more detail, the algorithm works as follows (Lundberg and Lee, 2017). First, a subset S is randomly selected from all features F. The selected model is then trained on all feature subsets of S ⊆ F. To estimate the feature effects, one model f_S∪{i} is trained with and another model f_S is trained without the feature. The predictions of the two models are compared using the current input f_S∪{i}(x_S∪{i}) − f_S(x_S), where x_S are the values of the input features in S. Since the effect of withholding a feature depends on other features in the model, the above differences are calculated for all possible subsets S ⊆ F {i}. The Shapley values are then calculated and assigned to the individual features. In more detail, this can be considered as a weighted average of all possible differences with reference to

ϕ_{i} = \sum_{S \subseteq F \{i\}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} (f_{S \cup \{i\}} (x_{S \cup \{i\}}) - f_{S} (x_{S})) (5)

which shows that this approach assigns each feature an importance value that represents the impact of including that feature on the model prediction.

In addition, the Gini feature importance analysis focuses on the mean decrease in impurity of decision-tree based models. For this, one calculates the total decrease in node impurity for a decision tree based model weighted by the probability of reaching that node. The probability of reaching a node can be approximated by the proportion of samples averaged over all trees of the ensemble (Breiman et al., 1984). The Gini index then estimates the probability for a random instance being misclassified when chosen randomly. The higher the value of this coefficient, the higher is the confidence that the particular feature splits the data into distinct groups.

4 Results

In the first subsection, we study the correlations between the target and feature values for the experimental data and present the outcomes of the machine learning approaches. A detailed analysis of the feature importances in terms of explainable machine learning approaches is presented in the second subsection. The corresponding insights allow us to rationalize a molecular mechanism in combination with an analytic expression in good agreement with the experimental results.

4.1 Correlation analysis and machine learning

A heatmap of all experimental correlations in terms of the Pearson correlation coefficient r between the target value ln y and all feature values for 188 data points is presented in Figure 1. In addition to diagonal elements, notable correlations in terms of |r| > 0.5 can only be identified for the logarithm of the total protein concentration ln c_t. All other values are smaller than |r| < 0.5 in accordance with negligible correlations. Thus, it can be concluded that cross-correlations between the feature values are of minor importance. The corresponding correlation coefficients for the individual feature correlations with ln y are r = −0.69 (ln c_t), r = 0.02 (c_D), r = 0.09 (c_G), r = 0.18 (q(pH − pI)) and r = 0.28 (c_U). All values are also visualized in the Supplementary Material. In consequence, non-vanishing positive correlation coefficients for ln y can only be identified for the actual urea concentration and the fraction of protonated titrable groups. In contrast, the total protein concentration c_t shows a strong negative correlation and the correlations for guandidinium hydrochloride and DTT are negligible. Thus, it can be concluded that the total protein concentration dominates the final yield values. The corresponding p-values from a Spearman rank-order correlation coefficient analysis are listed in the Supplementary Material. As can be seen, all values regarding the correlation between ln y and all features are less than p < 0.05 for q(pH-pI), c_U and ln c_t. The corresponding values for c_G and c_D are quite high, which can be explained by the low concentrations, so not many settings can be chosen independently. However, one can conclude that c_G and c_D have a minor effect on the values of ln y. In addition, higher p values between the individual feature correlations are noticeable. This can be understood in terms of the preparation of the dataset obtained from a design of experiment study, which focuses solely on the individual effects of the characteristics on the target values.

FIGURE 1

FIGURE 1. Pearson correlation coefficients r between feature values and ln y from the experimental data. The colors of the individual entries highlight the corresponding value of the Pearson correlation coefficient from r =1 (blue) via r =0 (white) to r =−1 (red).

As a next step, the corresponding supervised non-optimized machine learning and standard regression methods are assessed in order to predict the corresponding ln y values with regard to a k-fold cross-validation scheme including successive permutations of the training set (Gareth et al., 2013; Wong, 2015). As can be seen in the Supplementary Material, the highest predictive accuracy for a qualitative assessment of the non-optimized models is achieved for gradient boosting and decision tree-based methods. In more detail, histogram gradient boosting (HGB), extra trees (ET), gradient boosting (GB) and random forests (RF) show low nRMSE values between 0.19 and 0.21. The corresponding coefficients of determination between predicted and actual values vary between R² = 0.72–0.75. Noteworthy, ensemble boosting methods are ideally suited for non-linear regression problems like solubilization processes. Potential reasons for the high accuracy of decision tree-based models were recently published (Grinsztajn et al., 2022). In more detail, decision-tree based models do not overly smooth the solution in terms of predicted target values. Moreover, it was shown that uninformative features do not affect the performance metrics of decision-tree based models as much as for other machine learning approaches. In terms of such findings, it becomes clear that decision-tree based models often outperform kernel-based approaches in terms of predictive accuracies. It has to be noted that ensemble-based methods are also often robust against overfitting issues (Dietterich, 2000). Hence, time-consuming hyperoptimization tunings and validation checks are not of utmost importance.

In contrast to the good predictions of boosting models, standard linear regression methods like least-angle regression (LARS) or Lasso regression (LAS) show a rather poor performance (Supplementary Material). Interestingly, also standard artificial neural networks (ANNs) show a low predictive accuracy. It can be argued that further optimization of hyper parameters may improve the results. All other approaches show a reasonable or even good performance with nRMSE values between 0.26–0.22. As already discussed, the best performance can be observed for advanced decision tree-based ensemble models which reveals the underlying influence of non-linear contributions.

The corresponding predicted and experimental values for ln y from the HGB method are presented in Figure 2. We used a k-fold cross validation approach, where the training data consists of each N-1 data samples with one test data point from the total data set including N samples (Gareth et al., 2013; Wong, 2015). In more detail, each model M_j is trained with the feature data including the samples X = [x₁, x₂, …, x_j−1, x_j+1, …, x_N] and the associated target data Y = [y₁, y₂, …, y_j−1, y_j+1, …, y_N] and predicts the corresponding test target sample Y_j from X_j. As can be seen, X_j and Y_j are not part of the corresponding training data for model M_j. This procedure is repeated for all N models M₁, M₂, …, M_j and the corresponding predictions. Notably, a good agreement between the predicted and experimental values for all 0 > ln y > − 1.6 becomes obvious. All values are located within the experimental standard deviation σ(ln y) as denoted by the dashed blue lines. The corresponding good agreement can be rationalized by the large amount of training data in this range. Notable deviations in terms of one outlier with nRMSE $> 1$ can only be observed for ln y ≤ −1.9 due to missing reference training data. Moreover, the good predictive accuracy is also highlighted by the reasonable value R² = 0.75 for the coefficient of determination. As shown in the supplementary material, similar conclusions can also be drawn for the prediction of a training data set of 38 samples using an 80/20 ratio split between training and test data. The RMSE values are of comparable quality. In consequence, it can be concluded that solubilization mechanisms and final yield values can be predicted with acceptable accuracy using machine learning approaches.

FIGURE 2

FIGURE 2. Predicted (HGB) and experimental values (Exp) for the logarithm of the yield ln y from the optimized HGB method. The straight line has a slope of one and the corresponding blue dotted lines reveal the experimental standard deviation σ(ln y). The lighter blue shaded regions demark a standard deviation of 2σ(ln y).

4.2 Feature importance analysis and explainable machine learning

Due to the reasonable predictive accuracy of certain machine learning models, it can be concluded that the evaluation of feature importances might provide some further insights into the underlying mechanisms of solubilization. In terms of such considerations, we evaluated all training data with the HGB and the ET method in accordance with the SHAP values. The results for the HGB model are shown in the Supplementary Material. As expected, the accuracies of the HGB and the ET model for training data with R² = 0.91 (HGB), R² = 0.99 (ET), nRMSE = 0.30 (HGB) and nRMSE = 0.10 (ET) are higher when compared to the predictions from the k-fold cross-validation approach which rationalizes the validity of the following feature analysis.

As a first step, the corresponding Gini feature importance values as calculated from the ET method are presented in Figure 3. As can be seen, the results confirm the dominant influence of the total protein concentration c_t in accordance with the correlation coefficients shown in Figure 1, followed by the urea concentration c_U and the fraction of protonated titrable groups q(pH-pI). With regard to rather low values, the DTT and guanidinium hydrochloride concentrations have negligible influences. It should be noted that the protein’s native structure is characterized by only a very small number of disulfide bonds. This property explains the vanishing influence as well as the relatively low concentration of DTT as a reducing agent in this context. Furthermore, it is well-known that guanidinium hydrochloride is a very potent destabilizing agent. However, in combination with urea it shows a rather complex aggregation behavior around the protein (Oprzeska-Zingrebe and Smiatek, 2021, 2022; Miranda-Quintana and Smiatek, 2021). Accordingly, higher concentrations of guanidinium hydrochloride could probably exert a stronger influence in terms of feature importance. However, it should be noted that this effect is represented here by the somewhat milder denaturation conditions in the presence of urea. In consequence, it can be concluded that the low feature importance of the guanidinium hydrochloride concentration can be rationalized by its corresponding low concentration. These results are also confirmed by the values from the ELI5 analysis which are shown in the Supplementary Material.

FIGURE 3

FIGURE 3. Gini feature importance values for predictions of the extra trees (ET) model for the protein yield ln y.

Similar conclusions can also be drawn in terms of example decision pathways for an extra trees model with a maximum tree depth of 4 as shown in the Supplementary Material. It becomes obvious that the first decision criterion defines a total protein concentration of ln c_t = 1.39. This value separates between high and low yield branches whereas further criteria based on individual values of q(pH-pI) and c_U are of minor importance and thus lead to different subclassifications.

The corresponding results for the SHAP value analysis with regard to the final yield values ln y from a HGB model are shown in Figure 4. The color-coded beeswarm plot with the associated contributions of SHAP values are depicted in the top panel. The beeswarm plot aims to provide an information-dense summary for the most important features in terms of model predictions for each data instance. Each data instance is represented by a single dot on each feature row. The horizontal position of the dot is determined by the SHAP value of the corresponding feature. In addition, the color illustrates the original value of a feature. Thus, the beeswarm plot highlights the importance or contribution of the features for the whole dataset. As already discussed, it becomes obvious that the largest influence is represented by the total protein concentration, followed by the urea concentration and the amount of protonated titrable groups. Moreover, it can be seen that the total protein concentration has a negative influence on the final yield value. Thus, low values of c_t lead to positive SHAP values and vice versa. Such findings are unique for the total protein concentration while all other factors show a positive correlation. In terms of a molecular understanding, it has to be noted that the solubilization process itself is a rather complex process due to contributions from interfacial phenomena, the composition of the solution as well as further intermolecular mechanisms. Moreover, the individual features and their contributions on the model predictions are ordered in terms of their importance from top to bottom. This ordering is calculated from the mean absolute SHAP value for each feature. In general, such an approach is strongly determined by the broad average impact of the feature while rare maximum or minimum values do not contribute significantly. The corresponding mean absolute SHAP values are presented in the bottom of Figure 4 for reasons of consistency. It clearly can be seen that the total protein concentration dominates the feature importance, followed by the urea concentration and q(pH-pI). The contributions from c_G and c_D are of minor importance in agreement with the p values from the previous correlation analysis.

FIGURE 4

FIGURE 4. Color-coded heatmap for a beeswarm plot of individual SHAP values for the optimized HGB model (top panel) and mean absolute SHAP values as calculated for the target value ln y.

For a more detailed analysis, we present the corresponding SHAP value dependency plots for the total protein concentration in Figure 5. In general, SHAP dependency plots as shown in Figures 5, 6 highlight the effect of a single feature on the model predictions. Each dot corresponds to a single prediction from the dataset and the position on the horizontal axis denotes the corresponding actual value of the feature. In contrast, the vertical axis shows the SHAP value for that feature and its impact on the prediction. In general, a slope along the points in the dependency plots enables to identify positive, negative or no correlations with the corresponding target parameter. It clearly can be seen in Figure 5 that the aforementioned negative correlation (Figure 4) can be interpreted as an inverse relation between the total protein concentration and the final yield value ln y. Hence, for increasing total protein concentrations, one can observe a linear decrease of the SHAP values. The corresponding mechanism can be explained as follows. In general, the yield is defined by

y = \frac{N_{f}}{N_{t}} (6)

which corresponds to the ratio between the number of free (solubilized) protein chains N_f and the total number of chains N_t. With regard to the fact that inclusion bodies show a rather poor solubility, one can assume that individual inclusion bodies aggregate in order to form larger moieties. Hence, the dissolution of free chains mainly occurs at the solvent accessible surface area of these compounds in agreement with previous assumptions (Walther et al., 2013).

FIGURE 5

FIGURE 5. SHAP value dependency plot between the total protein concentration ln c_t and the final yield value ln y as calculated from the optimized HGB method.

FIGURE 6

FIGURE 6. SHAP value dependency plot between the urea concentration c_U and the final yield value ln y from the optimized HGB method.

The number of free protein chains can be written as $N_{f} = c_{p} R_{I}^{2} d$ with the local concentration of proteins in the aggregated inclusion body c_p, the corresponding spherical radius R_I of the aggregate and the penetration or dissolution depth d. Moreover, it is assumed that the inner region of the compound remains unaffected by dissolution such that $N_{t} \approx c_{p} R_{I}^{3}$ with R_I ≫ d. In consequence, we can rewrite Eq. 6 according to

y \sim \frac{R_{I}^{2} d}{R_{I}^{3}} \sim \frac{d}{R_{I}} (7)

which reveals that aggregated inclusion bodies with larger radii result in lower yield values. Furthermore, it is assumed that d is constant after a fixed time interval. With regard to the fact that the radius scales as $R_{I} = N_{t}^{1 / 3} / c_{p}^{1 / 3}$ , one obtains after insertion into Eq. 7 the following relation

y \sim \frac{d}{R_I} \sim \frac{d c_{p}^{1 / 3}}{N_{t}^{1 / 3}} (8)

which clearly shows that the final yield value depends inversely on the total protein concentration N_t ∼ c_t as represented by ln y ∼ − ln(c_t) in agreement with Figure 5. Here, we assume that d is constant after a fixed time interval and that aggregates show a highest packing fraction leading to fixed values of c_p.

Moreover, it is well known that the presence of urea induces a structural destabilization of proteins. Recent articles rationalized this phenomena with preferential binding and exclusion mechanisms (Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018; Miranda-Quintana and Smiatek, 2021). As can be seen in Figure 6, one observes increasing SHAP values and thus a positive trend of ln y for increasing urea concentrations. Moreover, it can be seen that for urea concentrations larger than 8 mol/L, a saturation behavior becomes evident. As an explanation, we refer to co-solute induced destabilization effects as discussed in Oprzeska-Zingrebe and Smiatek (2018); Smiatek (2017). In more detail, it is assumed that the ratio of destabilized and stable proteins K_cs can be written as

K_{c s} = K_{0} \exp (a_{33} Δ ν_{23}) (9)

with the derivative of the thermodynamic activity a₃₃, the difference in the preferential binding coefficients Δν₂₃ and the ratio of destabilized and stable proteins K₀ in absence of any co-solutes (Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018; Smiatek et al., 2018). For certain proteins, it was discussed that the partial molar volumes and the solvent-accessible surface area upon unfolding do not change significantly according to Δν₂₃ = c_UΔG₂₃ (Smiatek et al., 2018; Krishnamoorthy et al., 2018a), such that the previous relation can be approximated by

K_{c s} \approx K_{0} \exp (a_{33} c_{U} Δ G_{23}) (10)

with the difference in the Kirkwood-Buff integrals ΔG₂₃ (Oprzeska-Zingrebe and Smiatek, 2018). In addition to stable and destabilized proteins, one can apply the same relation for the fraction of dissolved and bound proteins (Smiatek et al., 2018). Under the assumption that N_f ≪ N_t, it thus follows that K_cs ≈ y and K₀ ≈ y₀. For high co-solute concentrations, it was further discussed that a₃₃ → 0 due to stability conditions (Krishnamoorthy et al., 2018b). Hence, the exponential factor in Eq. 10 can be linearized according to

y \approx y_{0} (1 + a_{33} c_{U} Δ G_{23}) (11)

which highlights the linear contribution y ∼ c_U between the urea concentration and the final yield value in good agreement with Figure 6.

In our previous discussion, it was also highlighted that the fraction of protonated titrable groups has a positive influence on the final yield values. In accordance with Eq. 2, it becomes clear that q(pH-pI) decreases with increasing pH values. Moreover, it is known that the isoelectric point with pI = 8.4 corresponds to a net-uncharged protein. The clear distinction between pH values below and above the pI value in terms of the SHAP dependency plot can be seen in Figure 7. In more detail, q(pH − pI) < 0.5 as represented by pH $< 8.4$ results in negative SHAP values and thus lower yields and vice versa. The reason for this observation might be related to electrostatic repulsion between the charged groups of the protein (Smiatek et al., 2020). Thus, we assume that for high or low pH values, depending on the net amount of basic or acidic groups in the protein, the electrostatic repulsion fosters the dissolution of the protein in order to minimize non-favorable interactions. As also illustrated in Figure 7, significant contributions for a net-uncharged protein at pH = 8.4 leading to q(pH − pI) = 0.5 are absent.

FIGURE 7

FIGURE 7. Dependency plot of SHAP values for the fraction of titrable groups q(pH − pI) and the final yield value ln y for the optimized HGB model.

The combination of the previous considerations in terms of Eqs 2, 8, 11 results in

y \sim y_{0} (1 + a_{33} c_{U} Δ G_{23}) \frac{q (p H - p I)}{c_{t}^{1 / 3}} (12)

which condenses all previous results in one analytic expression. In order to assess the validity of Eq. 12, we plotted all experimental data points onto a master curve as shown in Figure 8. In order to minimize fluctuating electrostatic repulsions, we chose nearly constant and high pH values with pH $\geq 10$ as well as high urea concentrations with c_U ≥ 7.55 mol/L. The corresponding 27 data points are then scaled in terms of a normal distribution and compared to Eq. 12. As can be seen in Figure 8, the theoretically predicted values nicely follow the proposed scaling relation. It has to be noted that the corresponding influences for different proteins and modalities upon solubilization may vary, such that the obtained results are not generally transferable without proper assessment. Nevertheless, we have proven strong evidence that explainable machine learning approaches provide deeper insights into the molecular mechanisms and correlations of solubilization processes.

FIGURE 8

FIGURE 8. Theoretical estimates from Eq. 12 and actual values: The corresponding data points were selected for high urea concentrations and high pH values. All values are scaled in terms of normal distributions.

5 Discussion of results

In the previous sections, we developed a machine learning model to predict yield values based on some input features. We were able to show that an optimized Histogram Gradient Boosting (HGB) model enables the most accurate predictions. The underlying data for training and testing of the model were obtained from a Design of Experiments study. Overall, the model predictions show sufficient accuracy. The general trends are reproduced despite some minor inaccuracies for certain outliers. Previous statistical analysis of the experimental data already showed that the correlation coefficients between the feature and the target value do not reveal a particularly significant correlation. Accordingly, we could already assume in advance that the models would only allow meaningful predictions to a certain extent. To compensate for this drawback, we applied some methods of explainable machine learning. Although such approaches cannot increase the accuracy, the results provide fundamental insights into the feature importance for the model predictions. We observed that in particular the total protein concentration as well as the urea concentration and the degree of charge of the protein due to the adjusted pH value are of decisive importance for the yield value predictions. The influence of other co-solutes is negligible due to their low concentration. As part of the explainable machine learning approach, we were therefore able to examine data-driven models on a scientific basis. Accordingly, we were able to set up scientific hypotheses for the underlying mechanisms (Smiatek et al., 2021b), which provided a rationale for the observed feature importance values. These scientific hypotheses were then merged into Eq. 12, which allows for a formal mathematical description of the influence of various parameters on yield values.

In general, it cannot be assumed that a single machine learning model is suitable to predict different yields for different proteins in different solubilization procedures. The differences in the charges and the interaction with co-solutes are sometimes so significantly different that the trends can sometimes even reverse. Accordingly, the general transferability of machine learning models is usually not given, so that the gain in knowledge is often very small. Accordingly, experimental data must be recorded again for new proteins and their inclusion bodies, so that the reduction in laboratory activities is usually not given. However, basic principles can be recognized by means of the analytical equation and the corresponding scientific hypotheses. These principles may differ slightly for individual proteins, but can now be estimated in advance using the analytical description. Hence, important influencing factors can be postulated, especially when planning the experiments. We were also able to show that the purely data-driven models can be subjected to scientific hypothesis formation, which makes the results more robust and more straightforward to understand.

6 Summary and conclusion

We studied the potential influences of certain process parameters on the solubilization of inclusion bodies in terms of explainable machine learning approaches. The corresponding final yield values after solubilization are crucially affected by the total protein concentration, the urea concentration and the amount of protonated titrable groups as affected by the actual pH value. The models with highest predictive accuracies are boosting ensemble-based approaches with nRMSE values around 0.19. The corresponding SHAP values show that the total protein concentration, the actual urea concentration as well as the fraction of protonated titrable groups dominate the final yield value. All other contributions like the DTT and guanidinium hydrochloride concentration are of minor importance. A more detailed analysis of SHAP dependecies also highlights an inverse relation between the total protein concentration and the yield values in contrast to the urea concentration and the amount of protonated titrable groups.

Based on these explainable machine learning observations, we proposed an analytic expression to rationalize these findings. The inverse relation for the total protein concentration can be understood with regard to surface solvation effects which inversely scale with the total protein concentration. The growing SHAP values for the urea concentration can be understood by the preferential binding and exclusion mechanisms for co-solutes. The direct interaction of urea molecules with the inclusion body thus favors dissolution mechanisms and hence larger yield values. Finally, larger fractions of protonated titrable groups result in stronger electrostatic repulsion effects between the proteins which facilitate the dissolution of the inclusion body. The corresponding assumptions can be summarized in terms of an analytic expression which shows a reasonable agreement with the experimental data. Although it has to be noted that the corresponding dependencies crucially rely on the nature of the protein and the inclusion bodies, our approach demonstrates a meaningful pathway towards a deeper understanding and optimization of solubilization conditions. It can be assumed that the underlying mechanisms vary through the individual contributions of the influencing factors for different proteins. Typical examples would be the influence of the pH value on different pI values of proteins as well as the importance of reducing reagents such as DTT at different amounts of disulfide bonds. Nevertheless, it can be expected that the presented machine learning models in combination with feature analysis can make these slightly varying relationships interpretable with similar accuracy as in this study. Accordingly, our work highlights an exemplary and generic approach to understand in detail the phenomena of solubilization for the individual proteins and solutions. The use of explainable machine learning approaches thus allows us to develop models with high predictive accuracy but also to gain deeper insights into the underlying correlations of the mechanisms. Hence, it has to be mentioned that the use of explainable machine learning does not increase the prediction accuracy of the model. However, there is the possibility that the results of non-parametric models can be assessed and evaluated in their justification with regard to individual feature correlations. This procedure corresponds to the scientific method, so that the results of purely data-driven models can be translated into scientific hypotheses and made correspondingly falsifiable (Smiatek et al., 2021b). In this context, the use of explainable machine learning has allowed us to derive an analytical equation (Eq. 12). As we have shown, this analytical equation can also be derived separately from fundamental principles, with the SHAP analysis being able to contribute profitably to the identification of these mechanisms. The advantage of such an equation lies in its falsifiability and its potential for transfer to other projects. From this it can be assumed that new projects can be started with prior knowledge, so that material can be saved and the work can be reduced. In consequence, explainable machine learning provides a deeper process understanding and knowledge which is beneficial for different unit operations in biopharmaceutical manufacturing.

Data availability statement

The data analyzed in this study is subject to the following licenses/restrictions: jens.smiatek@boehringer-ingelheim.com. Requests to access these datasets should be directed to cornelia.walther@boehringer-ingelheim.com.

Author contributions

JS, CW, and MM conducted the study and analyzed the results. All authors contributed to the article and approved the submitted version.

Funding

This research was funded by Boehringer Ingelheim Pharma GmbH & Co. KG and Boehringer Ingelheim RCV GmbH & Co KG.

Acknowledgments

The authors thank Christina Yassouridis, Hermann Schuchnigg, I-Ting Ho, Milena Matysik, Liliana Montano Herrera, Ralph Guderlei, Michael Laussegger, and Bernhard Schrantz for valuable discussions.

Conflict of interest

Authors CW, MM, AF, A-LT, MV, and CB were employed by Boehringer Ingelheim RCV GmbH & Co. KG. Author A-LT was employed by Baxalta Innovations GmbH, Takeda Group. Authors AJ, EB, and JS were employed by Boehringer Ingelheim Pharma GmbH & Co. KG.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fceng.2023.1227620/full#supplementary-material

References

Agarwal, N., and Das, S. “Interpretable machine learning tools: A survey,” in Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, December 2020, 1528–1534.

Google Scholar

Belle, V., and Papantonis, I. (2021). Principles and practice of explainable machine learning. Front. Big Data 4, 688969. doi:10.3389/fdata.2021.688969

PubMed Abstract | CrossRef Full Text | Google Scholar

Blaser, R., and Fryzlewicz, P. (2016). Random rotation ensembles. J. Mach. Learn. Res. 17, 126–151. doi:10.5555/2946645.2946649

METHODS article

Solubilization of inclusion bodies: insights from explainable machine learning approaches

1 Introduction

2 Experimental and computational details

3 Theoretical background: machine learning and feature importance analysis

3.1 Machine learning and regression algorithms

3.2 Feature importance analysis

4 Results

4.1 Correlation analysis and machine learning

4.2 Feature importance analysis and explainable machine learning

5 Discussion of results

6 Summary and conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

This article is part of the Research Topic

People also looked at