Improving N-Glycosylation and Biopharmaceutical Production Predictions Using AutoML-Built Residual Hybrid Models

N-glycosylation has many essential biological roles, and is important for biotherapeutics as it can affect drug efficacy, duration of effect, and toxicity. Its importance has motivated the development of mechanistic models for quantitatively predicting the distribution of N-glycans during therapeutic protein production. Here we present a residual hybrid modeling approach that integrates mechanistic modeling with machine learning to produce significantly more accurate predictions for production of monoclonal antibodies in batch, fed-batch, and perfusion cell culture. For the largest dataset, the residual hybrid models have an average 736-fold reduction in testing prediction error. Furthermore, the residual hybrid models have lower prediction errors than the mechanistic models for all of the predicted variables in the datasets. We provide the automatic machine learning software used in this work, allowing other researchers to reproduce this work and use our software for other tasks and datasets.


Introduction
Glycosylation is a protein co-translational and post-translational modification that involves adding a glycan or glycans to proteins.N-linked glycosylation, a subtype of glycosylation, occurs when a glycan is added to the nitrogen of an asparagine or arginine.N-glycosylation contributes to many essential functional and structural roles [1,2,3].Highlighting the importance of N-glycosylation, improper glycosylation or deglycosylation is associated with multiple diseases, including cancers [4], infections [5], and congenital disorders [6].
There is great interest in N-glycosylation from the biomedical and pharmaceutical industry, physicians, and patients because of its high therapeutical and diagnostic relevance.In many types of carcinoma, increases in fucosylation, branching, and sialylation occur [7].Disialoganglioside is expressed by almost all neuroblastomas, and Phase I-III studies have shown that anti-disialoganglioside monoclonal antibodies can be successful against those [8,9].Poly-α2,8-sialylation, for example, increases the half-lives of antibodies but does not lead to tolerance problems [10].Conversely, the presence of glycans foreign to humans can be detrimental to a therapeutic.N-glycolylneuraminic acid is immunogenic to humans [11] but is present in some CHO-cell-derived glycoproteins [12].Despite these critical functions of N-glycosylation in biotherapeutical contexts and multiple developments in this field, such as the genetic engineering of CHO cells to increase glycoprotein sialylation [13], some challenges persist.Proteins can be glycosylated in multiple locations, so any investigations need to elucidate not only the glycan compositions but also where each glycan is located [7].This structural and regional diversity makes it challenging to determine specific functions of N-glycans [3].
In a diagnostic context, analyzing patient glycosylation samples is also challenging due to the complex equipment needed, which has limited the progress of personalized medicine [7].
To assist in better understanding and predicting N-glycosylation, many computational models have been created, which may be subdivided into mechanistic and data-driven models.Mechanistic models use physical knowledge, typically in the form of differential equations, to make predictions.They require little-to-no process data and always output answers that are constrained by physics; however, they also demand significant understanding of the system and can be slow [14], a problem alleviated by model simplifications [15,16].Data-driven models directly leverage experimental data to make predictions.
They require zero domain-based knowledge (although they can benefit from it), are typically fast once trained, and the more complex data-driven models are universal function approximators [17,18,14]; however, they require high amounts of high-quality data due to their high variance, can produce nonphysical outputs, and are typically not interpretable [14,19].Most works focused on N-glycosylation, particularly those focused on predicting the distribution of N-glycans, use mechanistic models due to a lack of high-quality data and data-driven modeling knowledge.The literature on mechanistic models for the prediction of N-glycan distributions is extensive; some examples can be found in the reviews of Refs.
An alternative to mechanistic and data-driven models are hybrid models.Hybrid models seek to combine these two types of models to create something with the advantages of both and the disadvantages of neither [14].Although of high interest, the construction of hybrid models is not straightforward.First, they require knowledge of both mechanistic and data-driven modeling to be successfully implemented.
Second, hybrid learning encompasses many architectures and model integration methods, and there have not been systematic studies on how to determine the best hybrid learning method for a given problem or system.Multiple examples of these architectures can be found in Refs.[14,24,25].One of those architectures is the Residual Hybrid Model,1 in which a data-driven model learns the residuals (prediction errors) of a mechanistic model [26].Despite the simplicity of this hybrid architecture, it has found success in many scientific problems [26,27,24,14].An illustration of this architecture is available in Fig. 1; other illustrations are also available in Fig. 1-A1 of Ref. [24] and Fig. 3 of Ref. [14].In this work, we construct residual hybrid models by combining mechanistic models from the literature with Lasso-Clip-EN (LCEN) [28] and artificial neural network (ANN) models trained by us.The datadriven models are trained using an efficient automatic machine learning (AutoML) software developed by us that completely eliminates the need for the end-user to have any knowledge in data-driven modeling.This software is free and open-source.These hybrid models are first used to predict the distribution of N-glycans attached to antibodies produced by Chinese hamster ovary (CHO) cells under different culture conditions.Then, the models are used to predict metrics that are relevant to biopharmaceutical production in CHO cells, including the titer, the galactosylation index of the products, and the levels of different chemicals in the culture medium over time.These CHO cultures were done in perfusion, fedbatch, and batch bioreactors depending on the dataset.Our hybrid models reduce the average prediction error by 149-fold on independent test sets when compared to the mechanistic models, and always lead to a reduction in the average test-set prediction error on the datasets investigated in this work.For the largest dataset, residual hybrid models have a test-set error that is 736-fold smaller than that of the mechanistic model.

Materials and Methods
This section describes the datasets and the residual hybrid models training methods.Instructions on running our AutoML software or recreating this article's results are provided in Supplementary Information and in our GitHub repo, available at https://github.com/PedroSeber/SmartProcessAnalytics.

Datasets
Four previously published works provide the data used to train the hybrid models in this work.The first dataset, from Ref. [29], comprises the levels of N-glycans on monoclonal antibodies produced in a CHO perfusion culture as a function of viable cell density and the concentration of galactose and manganese ions.The second, from Ref. [30], comprises the levels of N-glycans on monoclonal antibodies produced in a CHO fed-batch culture, but as a function of pH, galactose concentration, and manganese ion concentration under different conditions and feeding strategies.This dataset also includes a timebased component, and galactose or manganese supplementation at specific time points was performed in some samples.The third, from Ref. [31], comprises the titer and galactosylation index of monoclonal antibodies produced in CHO fed-batch culture as a function of feed galactose and uridine.Finally, the fourth, from Ref. [32], comprises the levels of different chemicals (such as ammonium, lactate, or aspartic acid) in the culture media over time.The culture was done in a batch bioreactor as detailed in Refs.[32] and [33].
Data are split between cross-validation (CV) and test sets.For the Ref. [29] dataset, the same split used in that work is used here, and 3-fold CV is used.For the Ref. [30] dataset, the same split used in that work is used here, and 4-fold CV is used.For the Ref. [31] dataset, points FS2, FS3, and FS7 are used as the test set; the first two because they are outside of the design spec set by Ref. [31], and the last because the model of Ref. [31] had the highest prediction error on that sample.4-fold CV is also used.For the Ref. [32] dataset, all points in the death phase (t > 135 hrs) are used as the test set, and 5-fold timeseries CV is used.These choices of test sets avoid test set leakage by ensuring the test set is sufficiently different from the training/cross-validation set.For robustness, the CV procedure is repeated 10 times for each combination of hyperparameters. 2

Data-driven and residual hybrid model creation
Ordinary least-squares (OLS), elastic net (EN), LCEN, and ANN models are directly trained on the data using our AutoML software to serve as a baseline (in addition to the respective mechanistic models).
OLS and EN models are constructed with scikit-learn [34], LCEN models are constructed as per Ref.
[28], and ANN models (specifically, multilayer perceptrons [MLPs] and recurrent neural networks [RNNs]) are constructed with PyTorch [35] within our AutoML software.Furthermore, LCEN and ANN models are trained on the residuals of the mechanistic models to create residual hybrid models.A list of the hyperparameters used for each model architecture is available in Section S2.The best combination of hyperparameters for each model and task is determined by grid search, and the combination with the lowest cross-validation average loss (averaged over 10 repeats) is selected.Errors on an independent test dataset are then reported.

Residual hybrid models improve N-glycan distribution predictions
This section includes the results of the models trained with the datasets of Refs.[29] and [30].
Ref. [29] featured two forms of models: a mechanistic model that used differential equations (named "Mechanistic" in this work) and a response surface methodology (RSM) model with intercept, linear, and 2nd-degree interaction terms.These models provide good fits even on an independent test set (Figs. 6 and 7 of Ref. [29]), with the mechanistic model providing a better fit.Nevertheless, these models still had high errors for some non-minor glycan forms; for example, the mechanistic model had a 47.9% and a 9.31% relative error when predicting the amount of antibodies with high-mannose (HM) and FA2G2 glycosylation respectively (Table 1).As per Section 2.2, we trained data-driven models to expand the models serving as a baseline and an MLP-based residual hybrid model.The OLS and EN models are very similar to the RSM model, but they lack the interactions present in the RSM model.Despite that, they provided similar (slightly worse) performance, indicating that the contribution of the interaction terms is limited yet not insignificant.LCEN had higher prediction accuracy than RSM for all four glycans, whereas the MLP model achieved the same on 3/4 glycans and had a 3.9% worse performance on the remaining glycan (Table 1).These results indicate that nonlinear terms can be important if they are not binary interaction terms, which is corroborated by how LCEN frequently selected features of the form √ x, log x, and 1/(x j x k ).Nevertheless, with the exception of the MLP model to predict levels of HM, all of these data-driven models were inferior to the mechanistic model.It is likely that the chief reason for these higher percent relative errors (PREs) is the scarcity of data, as only 8 points are available for model training.Despite this shortage of data and the high accuracy of the mechanistic model for most glycans, a residual hybrid model composed of the mechanistic model followed by an MLP always achieved a lower PRE than the mechanistic model (Table 1).On average, the residual hybrid model led to 2.45-fold average reductions in the relative errors of the mechanistic model.These great results highlight how residual hybrid models are useful in predicting N-glycan distributions even when few data points have been collected and a strong mechanistic model is already in use.They also highlight the effectiveness of our AutoML method for training an MLP to succeed the mechanistic model in this hybrid architecture.
To further validate the ability of residual hybrid models to achieve higher accuracy than mechanistic models, a second work with this type of data was used.Ref. [30] included only a mechanistic model, again using differential equations and named "Mechanistic" in this work.Ref. [30]'s mechanistic model had worse fits than that of Ref. [29], as the former had low errors only for FA2G0, average errors for FA2G1, and high errors for the other two glycans (Table 2).Once again, we trained data-driven models to expand  Mechanistic + MLP residual hybrid model also consistently made predictions with lower errors than the pure mechanistic model, reducing its errors by 1.2-fold on average.Surprisingly, the residual hybrid model was not as good as a pure MLP model in this dataset.We hypothesize this difference occurs because the mechanistic model for this dataset is not as accurate as the one in Ref. [29], because there are additional data available for training, and potentially because the different features and culture settings are more challenging for models that include mechanistic parts (including the pure mechanistic model).
These results again confirm the ability of our AutoML method to train strong-performing MLPs, as both models that included MLPs surpassed the other architectures.

Residual hybrid models also improve other important predictions for antibody production
Although predicting the distribution of N-glycans is an important task, there are also other values and metrics of interest in the production of antibodies.The dataset of Ref. [31] comprises antibody titer and galactosylation index measurements for CHO-cell produced antibodies under different feed conditions (FS1-7).Ref. [31] also trained a mechanistic model based on differential equations, named "Mechanistic" in this work.Their model had medium-low errors for half of the titer and most of the galactosylation index predictions (Table 3, top half).However, these errors are train-set errors, as Ref. [31] used all of their data points to train their model, so comparisons with independent test points are not available.We separated points FS2, 3, and 7 to form a test set; the first two because they are the out-of-specification points [31], and FS7 because it was the point for which the mechanistic model of Ref. [31] had the highest prediction errors.As per Section 2.2, we trained data-driven models to expand the models serving as a Table 2: Test-set percent relative errors (PRE) for different model architectures predicting the levels of each major N-glycan on the dataset of Ref. [30].Model "Mechanistic" is from Ref. [30]; its PREs are obtained from the published data within.Models "OLS", "EN", "LCEN", and "MLP" are data-driven models from this work used as baselines.4) and higher number of features (8) than samples, the OLS, EN, and LCEN models had overfit results (despite the use of regularization) when predicting antibody titers, with zero error in all training-set predicitions, but higher errors than the mechanistic model for the test-set predictions. 3he MLP model performed considerably better and even surpassed the Mechanistic model on all test-set predictions (Table 3, top half).Finally, the residual hybrid model (again, the mechanistic model followed by an MLP) achieved the best overall performance, surpassing both the "Mechanistic" model of Ref. [31] and the purely data-driven MLPs.The residual hybrid model reduces the average test-set error of the mechanistic model by 1.8 fold on the antibody titer prediction task (Table 3, top half).
On the galactosylation index prediction task, the overfitting issue was less present, affecting only the OLS model.Nevertheless, this task was more challenging, as all data-driven models had a lower performance than the mechanistic model (Table 3, bottom half).As before, the MLP architecture achieved the best results, but its test-set errors were still higher than those of the mechanistic model.
Only the residual hybrid model returns more accurate predictions than the mechanistic model, and it does so for all data points (Table 3, bottom half).On average, the residual hybrid model also reduces the test-set error of the mechanistic model by 1.8-fold on the galactosylation index prediction task.These two results further corroborate the potential of residual hybrid models for tasks beyond predicting the distribution of N-glycans, and highlight the ability of our AutoML software to train powerful MLPs even in a data-scarce setting.
The final task investigated involves predicting the concentration of nutrients and metabolites in the medium used to cultivate CHO cells.The dataset of Ref. [32] comprises the levels of 19 chemicals and the amount of biomass in the medium over a culture period of 215 hours.The chemicals include  ) for each feed strategy on the dataset of Ref. [31].Model "Mechanistic" is from Ref. [31]; its PREs are obtained from the published data within.Note that all data points, including FS2, 3, and 7, were used to train the "Mechanistic" model in Ref. [31], so the PREs are train-set values for this model only.Models "OLS", "EN", "LCEN", and "MLP" are data-driven models from this work used as baselines.29.9 7.9 0.3 12.4 47.2 29.9 33.9 37.0 multiple amino acids, sugars, and ammonium.In addition to the large amounts of data gathered by Ref.
[32], this dataset is also distinct because each measurement of interest was collected over multiple time points, allowing time series modeling to be done.Ref. [32] trained a flux-based kinetic mechanistic model (named "Mechanistic" in this work) consisting of 103 chemical reaction and transport equations (Table 1 of that work).This mechanistic model had varying performance depending on the prediction task.For example, it performed very well when predicting levels of glucose, lactate, valine, or isoleucine; however, it performed poorly when predicting levels of ammonium, alanine, and biomass.We trained data-driven and residual hybrid models based on the LCEN and RNN architectures to predict levels of lactate, ammonium, biomass, glutamate, aspartate, and asparigine.These dynamic models were trained to output next-hour levels4 based on the levels 1-5 hours prior, with the cutoff depending on the architecture and task.
All data-driven and residual hybrid models had more accurate test-set predictions than the mechanistic model (Table 4).In addition, the residual hybrid models trained with a given architecture surpassed the pure data-driven model with the same architecture in all but one case.Overall, the best residual hybrid architectures reduced the test-set prediction error by 736-fold on average (Table 4) and were able to follow the experimental measurements with significant accuracy for both the training and testing periods (Fig. 2).These tests further validate the capabilities of residual hybrid models in yet another context and even despite the fact that the mechanistic model they were based on had limited performance for some metabolites.

Discussion
This study constructs residual hybrid models from literature data on the distribution of N-glycans, properties relevant for antibody production, and concentrations of metabolites in CHO cell culture.These datasets not only were built for different tasks (but all relevant for the production of biopharmaceuticals) but also consist of different features, including culture conditions and even a purely autoregressive dataset.
The hybrid models combine data-driven models trained in this work and mechanistic models from the literature.As a comparative baseline, purely data-driven models are also trained and tested on the same datasets.The residual hybrid models significantly and consistently had higher prediction accuracy Figure 2: Experimental values and predictions for the mechanistic model (from Ref. [32]) and the best residual hybrid model ("Mechanistic + LCEN" or "Mechanistic + RNN", this work) for selected metabolites from the Ref. [32] batch culture dataset.The vertical black dotted line separates the training period (t ≤ 135 h) from the test period.
over the mechanistic models, and outperformed the data-driven models in most tasks.Among the four datasets used in this work, residual hybrid models reduce the test-set prediction error of the corresponding mechanistic models by 149-fold on average and always lead to reductions in test-set error for all the predicted variables in the datasets.
The first two datasets (from Refs.[29] and [30]) involve predicting the distribution of (major) Nglycans based on the viable cell density, levels of metabolites (such as Mn and Gal) in the culture, and pH (Section 3.1).Two models were trained by Ref. [29] on their dataset: a mechanistic and an RSM model.
The mechanistic model obtained much better results on that dataset.As a baseline, other data-driven models were trained by us.Only an MLP model was able to surpass the mechanistic model, and only for one of the four N-glycans (Table 1).On the other hand, a residual hybrid model consisting of the mechanistic model of Ref. [29] followed by an MLP trained by us was able to reduce the relative errors by 2.45-fold when compared to the mechanistic model, and reductions in prediction errors occurred for all N-glycans (Table 1).Ref. [30] trained a mechanistic model on another dataset for the same task, which we also use to further validate the ability of residual hybrid models to lower prediction errors.

Figure 1 :
Figure 1: The residual hybrid model architecture.ε Mech refers to the residuals (errors) of the predictions made by the mechanistic model.ŷ refers to the predictions made by a certain model architecture.
the models serving as a baseline and an MLP-based residual hybrid model.Despite the slightly increase in the amount of training data, the OLS and EN models performed poorly -their predictions were worse than predicting with the average of the training set.The only exception was the EN model trained to predict FA2G2 levels, which displayed a good performance, which even surpassed the mechanistic model.LCEN had mixed results: in 2/4 tasks, it surpassed the training mean, and in 2/4 tasks, it surpassed the mechanistic model.Once again, these results highlight the importance of nonlinearities to predict the distribution of N-glycans.The final data-driven model, an MLP model, had the best performance out of all models in this task.All of the MLP predictions surpassed the train mean and the mechanistic model, reaching test-set prediction errors 1.5-fold smaller on average than the mechanistic model.The MLP-based residual hybrid model.For all architectures, one set of models for the titer prediction task and another for the galactosylation index task were trained.Due to the low number of training samples (

Table 1 :
[29]-set percent relative errors (PREs) for different model architectures predicting the levels of each major N-glycan on the dataset of Ref.[29].Models "Mechanistic" and "RSM" are from Ref.[29]; their PREs are obtained from the published data within.Models "OLS", "EN", "LCEN", and "MLP" are data-driven models from this work used as baselines.Model "Mechanistic + MLP" is a residual hybrid model from this work."Train Mean" is the mean of the training data.The lowest PREs are highlighted in bold.
Model "Mechanistic + MLP" is a residual hybrid model from this work."Train Mean" is the mean of the training data.The lowest PREs are highlighted in bold. [30]m[30])

Table 3 :
Test-set percent relative errors (PRE) for different model architectures predicting titers (top table) or galactosylation indices (bottom table Model "Mechanistic + MLP" is a residual hybrid model from this work."Train Mean" is the mean of the training data.The lowest mean test PREs are highlighted in bold.

Table 4 :
[32]age test-set percent relative errors (PRE) for different model architectures for predicting metabolite and biomass levels on the dataset of Ref.[32].Model "Mechanistic" is from Ref.[32]; its PREs are obtained from Fig.6of that work.Models "LCEN" and "RNN" are data-driven models from this work used as baselines.Models "Mechanistic + LCEN" and "Mechanistic + RNN" are residual hybrid models from this work."Train Mean" is the mean of the training data.The lowest PREs are highlighted in bold.