Comparing mechanism-based and machine learning models for predicting the effects of glucose accessibility on tumor cell proliferation

Glucose plays a central role in tumor metabolism and development and is a target for novel therapeutics. To characterize the response of cancer cells to blockade of glucose uptake, we collected time-resolved microscopy data to track the growth of MDA-MB-231 breast cancer cells. We then developed a mechanism-based, mathematical model to predict how a glucose transporter (GLUT1) inhibitor (Cytochalasin B) influences the growth of the MDA-MB-231 cells by limiting access to glucose. The model includes a parameter describing dose dependent inhibition to quantify both the total glucose level in the system and the glucose level accessible to the tumor cells. Four common machine learning models were also used to predict tumor cell growth. Both the mechanism-based and machine learning models were trained and validated, and the prediction error was evaluated by the coefficient of determination (R2). The random forest model provided the highest accuracy predicting cell dynamics (R2 = 0.92), followed by the decision tree (R2 = 0.89), k-nearest-neighbor regression (R2 = 0.84), mechanism-based (R2 = 0.77), and linear regression model (R2 = 0.69). Thus, the mechanism-based model has a predictive capability comparable to machine learning models with the added benefit of elucidating biological mechanisms.

The dysregulation of cellular energetics is one of the hallmarks of cancer 1 . The altered metabolic activities (e.g., consumption and utilization of glucose by tumor cells), varying between aerobic glycolysis and oxidative phosphorylation, demonstrates tumor cell adaptation to the microenvironment to fulfill its need for energy and materials for biosynthesis 1 . Therefore, glucose plays a central role in driving tumor metabolism and development. Tumor cell proliferation in response to the availability of glucose has become an intense area of investigation to determine the underlying mechanisms of tumor cell metabolism and development, as well as identify potential therapeutic targets for altering the uptake and consumption of glucose. Targets of interest include (for example) glucose transporters 2 , lactate transporters 3 , and the enzymes hexokinase and pyruvate kinase of the glycolytic pathway 4,5 . In the present study, we are particularly interested in the glucose transporters as they directly influence intracellular glucose concentration which is the starting material for both glycolysis and oxidative phosphorylation.
Mechanism-based mathematical models are playing an increasingly important role in cancer biology and oncology, as they can describe observations from experiments and clinical trials, provide predictions of tumor development and response to therapy, guide experimental design, identify new hypotheses, and optimize treatment delivery [6][7][8][9][10][11][12][13][14][15] . There have been a number of attempts to construct models to simulate and understand the development of tumor subpopulations, the dynamics of extra-and intra-cellular nutrient concentrations, nutrient

Materials and methods
Cell culture and treatment with glucose uptake inhibitor Cytochalasin B. We obtained the triple negative breast cancer cell line, MDA-MB-231 33 , from the American Type Culture Collection (ATCC, Manassas, VA) and maintained them in culture according to manufacturer recommendations. We seeded the MDA-MB-231 cells in ninety-six well-plates covering multiple initial confluences (30-90%) in Dulbecco's modified eagle medium (DMEM without glucose, sodium pyruvate, HEPES, L-glutamine and phenol red, Thermo Fisher Scientific, Waltham, MA) the day prior to image acquisition began. The cells were incubated in the IncuCyte S3 live cell imaging system (Essen BioScience, Ann Arbor, MI) at 37 °C for 4 days, supplied with 5% CO 2 and air. Before image acquisition began, we changed the cell culture medium to DMEM with designated glucose (0.5 mM, 1 mM, 2 mM, 5 mM, and 10 mM, Sigma-Aldrich, St. Louis, MO) and Cytochalasin B (0 μM, 2 μM, or 10 μM, Sigma-Aldrich, St. Louis, MO) concentrations. (Cytochalasin B competitively inhibits the transport of glucose with high efficiency as a glucose uptake inhibitor 34 .) After the medium change, we added Cytotox Red (Essen BioScience, Ann Arbor, MI) to the medium to monitor cell death. (Cytotox Red enters the cell plasma through the disintegrated membrane of dead cells and then binds to DNA to emit fluorescent signals.) Analysis of the time-lapse fluorescent images allowed the quantification of cell death over time. We prepared four replicates for each initial condition.
With the IncuCyte S3 live cell imaging system, we collected phase-contrast and red fluorescent (excitation wavelength: 585 nm and emission wavelength: 635 nm) images of whole-wells every 3 h for 4 days. While details Image processing for cell segmentation. We used Matlab (The Mathworks, Inc., Natick, MA) to perform cell segmentation and generate the data for our modeling, including the time-resolved confluence curves from each well for both live and dead cells. Briefly, we masked the region of interest (ROI) and converted RGB (red, green, blue) images to grayscale for images from both channels. For phase-contrast images we binarized the image based on pixel-wise signal intensity. For fluorescent images we applied a sliding block for edge detection and a Gaussian filter for area smoothing then normalized and binarized the images. Complete details are provided in our previous work 26 . Mechanism-based model. We begin with the most parsimonious model selected from our previous study, which used a function of glucose level (details on model selection and derivation are provided in ref. 26 ) to model the temporal change of tumor cell number and extend it to include the effect of treatment with Cytochalasin B. Our complete model is described by a series of coupled ODEs to calculate the number of live, N(t), and dead cells, D(t), at any given time point, t, as shown below: All model functions, variables, and parameters are described in Table 1. Additionally, the reader is encouraged to consult Figs. 1 and S1 for a brief and detailed, respectively, overview of our approach.
As Cytochalasin B is a glucose uptake inhibitor, we introduced G acs (t) to describe the glucose level accessible to cells (i.e., the effective glucose concentration within the extracellular environment, due to the impact of the glucose uptake inhibitor) at time t, in comparison to G total (t), which is the actual glucose level in the system (i.e., the extracellular glucose concentration) at time t. The three terms on the right-hand side of Eq. (1) describe logistic tumor cell growth, tumor cell death induced by glucose depletion, and tumor cell death induced by the bystander effect 26,35,36 , respectively. Death can be induced in the remaining live cells with factors 35,36 released by dead cells into the environment. We considered this phenomenon as a manifestation of the bystander effect and included it in the model 26 as cell death was consistently underestimated when only considering glucose depletion. Equation (2) describes the accumulation of dead cells from both sources of death (i.e., glucose depletion and the bystander effect). Equation (3) describes the change of total glucose concentration due to the consumption Machine learning models. The prediction of tumor cell growth can be formulated as a supervised learning task; that is, as a regression problem given a set of input features. In this study, we considered the number of both live and dead tumor cells at time zero (i.e., N(t = 0) and D(t = 0)), initial glucose concentration (i.e., G total (t = 0)), Cytochalasin B dose, and the time of measurement as input features. The number of tumor cells at any given time point for both live and dead cells (i.e., N(t) and D(t)) are the prediction targets. We employed four commonly used machine learning algorithms: multivariate linear regression, k-nearestneighbor regression, decision tree regression, and random forest regression. Multivariate linear regression seeks to estimate the result of a variable of interest, using a number of explanatory variables 37 . It is a parametric approach as it assumes a linear functional form mapping from the input variables to the output variable(s). While parametric approaches are generally easy to fit with a small set of coefficients, they rely on a strong assumption about the form of the linear function, which may not always be correct. Therefore, we also explored nonparametric approaches that do not assume an explicit functional form, thereby allowing greater flexibility. The first non-parametric approach we introduced was the k-nearest-neighbor method 38,39 . This algorithm calculates a weighted average (equal to the inverse of the distance) of the k-nearest-neighbor as an estimate for numerical variables. While the non-parametric approaches are more flexible, they can often be difficult to understand and interpret biologically. Therefore, we included a decision tree regression approach 40 , which takes observations about a sample (represented as the branches) to estimates of the target variable (represented as the leaves). Decisions trees are more straightforward to interpret as they follow simple decision rules constructed depending on the data features. Finally, we investigated random forest regression 41,42 which is an ensemble learning method that trains multiple decision-tree models to achieve better performance in prediction compared to results from a single model [43][44][45] . This algorithm employs bootstrap aggregating (or bagging 46 ) to generate multiple random samples with replacement to train multiple decision trees and then uses the average prediction from each tree to yield a final prediction. Random forest regression can correct for overfitting [47][48][49] resulting from a decision trees analysis 50 , but at the expense of a more difficult to interpret single decision tree.
All machine learning models were implemented in Jupyter Notebooks with the Python library Scikit-learn. To implement the models, the number of cells were first normalized to a value between 0 and 1 (i.e., confluence) by scaling to the carrying capacity (Table 1). Splitting the timeseries data into a training and validation set requires special consideration to avoid data leakage 51 (i.e., when future information is included in the training set, but similar data is not available when the model is used for prediction, resulting in unrealistically high performance in training and potentially poor generalization on unseen data). Our data was split into a training set containing 75% of the samples and a separate validation set containing the remaining 25% (more details can www.nature.com/scientificreports/ be found in section "Training and validation"). Models were trained only on the training set to estimate model parameters. The trained model was then evaluated on the validation set. As our interest lies in comparing with the mechanism-based model, it is crucial to be consistent on the features we use, properties that we can measure or control in experiments. Therefore, we only used the five features described at the beginning of this section (i.e., N(t = 0), D(t = 0)), G total (t = 0), Cytochalasin B dose, and the time of measurement) and did not apply further feature selection or dimension reduction. Default hyperparameters as provided by the Scikit-learn library were used for these algorithms and no parameter tuning was performed, allowing us to evaluate their basic performance on our dataset and establish a baseline comparison with the mechanism-based model.

Model calibrations.
Throughout the following section, the reader is encouraged to refer to Figs. 1 and S1. Figure 1 provides an overview of the methodology used in the mechanism-based model, while Fig. S1 provides a more detailed schema of the whole calibration and prediction process. We used two datasets for our model calibration. Dataset A was obtained from our previous work 26 and consists of 120 pairs of confluence time courses for live and dead cells from each sample. In Dataset A, the cells were seeded at low, intermediate, and high initial confluences with 10 initial glucose concentrations (0, 0.1, 0.2, 0.5, 0.8, 1, 2, 5, 8, and 10 mM) and were not treated with Cytochalasin B. Dataset B was obtained from new experiments and consist of 180 pairs of confluence time courses for live and dead cells. In Dataset B, the cells were seeded at low, intermediate, and high initial confluences with five initial glucose concentrations (0.5, 1, 2, 5, and 10 mM). Of the 180 wells imaged, 60 wells were not treated, 60 wells were supplied with 2 μM Cytochalasin B, and 60 wells were supplied with 10 μM Cytochalasin B. In both datasets, there were four replicates for each initial condition. Note that for the mechanism-based model, the cell confluences (of both live and dead cells) are multiplied by the carrying capacity θ (i.e., the maximum number of cells that can physically fit in the area, Table 1) and converted into the number of live and dead cells (N(t) and D(t), respectively).
Calibration of the mechanism-based model with Dataset A. The model described in section "Mechanism-based model" (i.e., Eqs. (1)-(6)) was first calibrated to experimental data (i.e., time-resolved measurements of tumor cell number for live and dead cells) from Dataset A, using the initial glucose level and confluence as the initial conditions ( Fig. S1, blue arrows, step 1). It was assumed (based on the results from previous work 26 ) that the proliferation rate, k p , the consumption rate of glucose, v, and the glucose depletion induced death rate, k d , were characteristics of the cell line, while the bystander effect induced death rate, k bys , was dependent on initial glucose level. Therefore, we obtained estimates for three global parameters (i.e., a set of parameters shared and applicable across the entire dataset; k p , k d , and v) and a local parameter (i.e., a parameter specific for a given sample; k bys ) for each of the 120 time courses in Dataset A. The confidence intervals for all three parameters were also computed.
Establishing relationships between mechanism-based model parameters and initial glucose level. The estimates of k bys obtained from section "Calibration of the mechanism-based model with Dataset A" were fit to Eq. (7): where k bys,0 is the maximum k bys rate when there is sufficient glucose, α represents the dependence on the initial glucose level, G 0 represents the initial glucose level, and β represents a baseline offset (Fig. S1, orange arrows, step 2). This follows the approach developed in our previous work 26 , assuming k bys presents an exponential decay depending on the initial glucose concentration. With the estimates of these three parameters, Eq. (7) can be used to estimate k bys for a new initial condition (i.e., Dataset B) not contained in the training dataset (i.e., Dataset A). With new initial conditions from Dataset B, Eq. (6) was used to represent the initial accessible glucose level G acs,0 , and it is substituted for G 0 in Eq. (7). k bys is therefore represented as a function of G in for each sample in Dataset B (Fig. S1, green arrows, step 3).
Calibration of the mechanism-based model with Dataset B to reduce the parameter space for G in . The model described in section "Mechanism-based model" was calibrated to experimental data (180 pairs of time-resolved measurements of tumor cell number for live and dead cells) from Dataset B, using the initial glucose level and confluence as the initial conditions (Fig. S1, red arrows, step 4). In this calibration, the three global parameters k p , k d , and v, were replaced with the estimates from section "Calibration of the mechanism-based model with Dataset A", while the local parameter k bys was obtained from step 3 as described in section "Establishing relationships between mechanism-based model parameters and initial glucose level". In our modeling, we have the ability to model G in as a global parameter (i.e., one value for all samples included in the experiments) or a local parameter (i.e., one value for each sample), or somewhere in between those two extremes. Through our simulation experiments, we found having two values, one for each Cytochalasin B concentration, yielded the best model performance. With this approach we can avoid having too much flexibility in the model and risk overfitting, but also make sure the model is capable of capturing the trends apparent in the data. Therefore, in step 3 we only estimate the inhibition constant G in as a global parameter, assuming k bys still follows the patterns observed in our previous study 26 ; i.e., k bys decreases with increasing initial glucose level following an exponential decay. We obtained an estimate of G in for cells treated with 2 μM and 10 μM of Cytochalasin B, respectively, assuming the inhibition is dose dependent. In this step k bys was considered as a known parameter (fixed for each sample, assigned with values from Eq. (7)), instead of a free parameter that needs calibration, to reduce the model flexibility and therefore the parameter space of G in (i.e., to narrow down the potential values of G in ). Confidence intervals of G in were also obtained. www.nature.com/scientificreports/ Calibration of the mechanism-based model with Dataset B to estimate k bys and G in . The model described in section "Mechanism-based model" was calibrated to experimental data (180 pairs of time-resolved curves of tumor cell number for live and dead cells) from Dataset B, using the initial glucose level and confluence as the initial conditions (Fig. S1, purple arrows, step 5). The three global parameters k p , k d , and v, were replaced with their estimates from section "Calibration of the mechanism-based model with Dataset A". However, as opposed to the approach described in section "Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin", k bys was treated as a free local parameter and estimated together with the global parameter G in , with the confidence intervals obtained from section "Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin" serving as the bounds for estimating G in .
Validation of the relation between local k bys and accessible glucose concentrations. The estimates of G in obtained from the last calibration performed according to section "Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin", along with the initial total glucose level G total (t = 0) and number of live cells N(t = 0) were used to calculate the initial accessible glucose level G acs (t = 0) using Eq. (6). The estimates of k bys as a local free parameter from the same calibration were compared to both G total (t = 0) and G acs (t = 0) to determine if they follow an exponential decay as we discovered in our previous work 26 . If the pattern is again observed, then the estimates of k bys were fit to Eq. (7) to update the estimates of k bys,0 , α, and β ( Fig. S1, brown arrows, step 6).
Evaluation of the performance of the mechanism-based model. We introduced five criteria, namely the coefficient of determination (R 2 ), mean percent error over the time course, percent error at the end of experiment, mean error over the time course, error at the end of experiment (more details on these errors are described in Table S1) to evaluate model performance and compare across different approaches.
Prediction with the mechanism-based model. Given new initial conditions of glucose level, confluence, and dose of Cytochalasin B, we can predict tumor cells with the calibrated model (Fig. S1, cyan arrows, step 7). The initial accessible glucose level G acs (t = 0) can be calculated using Eq. (6) and then used in Eq. (7) to calculate k bys . With the initial confluence and initial accessible glucose level, we can solve Eqs. (1)-(6) to run the model forward and obtain time-resolved curves of live and dead tumor cells.

Validation of the accessible glucose level.
We sought to validate the assumption that the glucose level accessible to the tumor cells was lower than the actual glucose level in the system, and then provide an estimate of the accessible glucose level for the cells treated with Cytochalasin B. To perform this test, we compared the confluence time courses of the tumor cells with different initial glucose levels but with no exposure to Cytochalasin B. The concordance correlation coefficient (CCC) was employed as a measurement of the agreement between the two sets of time-resolved curves. We hypothesized that cells growing in environments with similar glucose levels would present similar growth curves. Therefore, we compared time-resolved cell number curves of tumor cells with treatment to time-resolved cell number curves of tumor cells without treatment and with similar initial confluences, but with different initial glucose levels. The growth curve without treatment that yielded the highest CCC when compared to the growth curve with treatment provided the best estimate of the accessible glucose level for the treated cells.

Training and validation.
Training and test sets. The data set was randomly split into training (75%; n = 225) and validation sets (25%; n = 75). The training set was used to develop the predictive models and the testing set was used as an independent data set for model validation. The splitting for Datasets A and B was performed separately. The training set A train (75%; n = 90) and the validation set A valid (25%; n = 30) were divided from Dataset A, while the training set B train (75%; n = 135) and the validation set B valid (25%; n = 45) were divided from Dataset B.
Training and validation for the mechanism-based model. The model described in section "Mechanism-based model" (Eqs. (1)- (6)) was calibrated to A train to obtain estimates for the global parameters k p , k d , and v, as well as estimates for the local parameter k bys via the approach described in section "Calibration of the mechanism-based model with Dataset A". The estimates of k bys were fit to Eq. (7) with the associated initial glucose concentration from A train via the approach described in section "Establishing relationships between mechanism-based model parameters and initial glucose level" to yield estimates of k bys,0 , α, and β. Equations (1)-(6) were first calibrated to B train with the approach described in section "Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin" to obtain a preliminary estimate of G in with its confidence interval. Equations (1)-(6) were calibrated to B train again with the approach described in section "Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin" to obtain estimates of both G in , as a global parameter, and k bys , as a local parameter, with the confidence interval obtained from first calibration serving as the bounds for estimating G in . The estimates of k bys were fit to Eq. (7) to estimate k bys,0 , α, and β with the approach described in section "Validation of the relation between local kbys and accessible glucose concentrations". The initial conditions of each sample from A valid and B valid were used in the forward model to predict the tumor cell growth over time with the approach described in section "Prediction with the mechanism-based model" and compared with the experimental measurements in dataset A valid and B valid . The training and validation were repeated 50 times following the steps described above; a new random train-test split was performed each time to generate the data sets needed. Each time, we calculated the R 2 and errors listed in section "Evaluation of the www.nature.com/scientificreports/ performance of the mechanism-based model" for model performance evaluation. We report the average model performance over the 50 rounds of training and validation.
Training and validation for machine learning models. The training and validation process were repeated 50 times. In each round, A train and B train (generated as described in section "Training and test sets") were combined as the training set for the machine learning models. Similarly, A valid and B valid (generated as described in section "Training and validation for the mechanism-based model") were combined as the validation set for the machine learning models. In each round, each of the four machine learning models were trained on the combined training set. Predictions were obtained using the initial conditions in the combined validation set, after the machine learning models were trained. The predictions from the initial conditions given in the validation set were compared with the observations in the validation set. In each round, we calculated the coefficient of determination (R 2 ) to for model performance evaluation. We report the average model performance (with 95% confidence interval) over 50 rounds of training and validation.
Learning curves for the machine learning models. For each round of training and validation, the impact from the size of the training set was studied. The four machine learning models were trained on training sets with size increasing from 5 to 100% (with a step size of 5%) of the complete training set (randomly selected) and the model performance on both training and validation sets were evaluated. This process was repeated for 50 rounds.
Statistical analysis. Tumor cell growth time courses were obtained from four experiments for each set of initial conditions, and each point in each time course consisted of a mean ± 95% confidence interval (a onesample Kolmogorov-Smirnov test confirmed normality). Two-way ANOVA was used to compare the average number of live cells for each experiment at the end of day 4 between the groups with different initial conditions. A Student's t-test was used to test for the statistical difference of estimates for G in between the two Cytochalasin B concentrations. One-way ANOVA was used to compare the performance (R 2 ) of all five models. The explicit error metrics used to evaluate the mechanism-based model performance can be found in Table S1. The CCC was used to measure the agreement between two time-resolved curves. Bonferroni correction was applied to control the family-wise error rate ("FWER") when comparing the performance between any two models, where each hypothesis is tested at a significance level of the selected alpha divided by the number of hypothesis tests to guarantee that the probability of having one or more false positives is less than alpha.

Results
Observations of tumor cell growth from time-resolved microscopy. We present representative time-resolved confluence curves of live and dead cells ( Fig. 2A,E,I), along with a set of time-resolved images showing tumor cell growth and death on day 0 (Fig. 2B,F,J), day 2 ( Fig. 2C,G,K), and day 4 (Fig. 2D,H,L). The cells were untreated ( Fig. 2A-D), treated with 2 μM Cytochalasin B (Fig. 2E-H), or treated with 10 μM Cytochalasin B (Fig. 2I-L). Notice that the tumor cells in the first row (untreated) expanded with minimum cell death (the confluence of dead cells increased by 4.1%) over 4 days, while cells treated with Cytochalasin B (the second and third rows) expanded more slowly with more cell death (the confluence of dead cell increased by 9.9% and 15.4% for 2 μM and 10 μM of Cytochalasin B, respectively). Maximum cell death was observed for the highest dose. Observe how, in the third row, the confluence of dead cells at the end of experiment was 21.2% with the highest dose compared to 6.0% with no treatment and 13.4% with a lower dose.
Tumor cell growth with different initial conditions and treatment. Figure 3 displays time courses of the MDA-MB-231 cells with different initial confluences (i.e., seeding density) and glucose levels (0.5, 1, 2, 5, and 10 mM). Samples with low initial confluence ( Fig. 3 first row), intermediate initial confluence (Fig. 3 second row), and high initial confluence ( Fig. 3 third row), are shown in separate rows. In each panel, the time courses of cells not treated with Cytochalasin B (blue), treated with 2 μM of Cytochalasin B (orange), and treated with 10 μM of Cytochalasin B (green) are plotted. For each initial condition, the change (mean ± 95% confidence interval) on the number of live cells (as percentage) from day 0 to day 4 are shown in Table 2. Cells that did not receive treatment displayed a negative percentage change in the number of live cells for high initial confluence (80.3 ± 1.7%) or low initial glucose concentration (0.5 mM and 1 mM) at the end of day 4. Conversely, positive percentage changes in the number of live cells were observed for cells with low initial confluence (45.5 ± 1.2%) and higher initial glucose concentration (2 mM or higher), or cells with intermediate initial confluence (67.1 ± 0.2%) and higher initial glucose concentration (5 mM or higher). Cells treated with Cytochalasin B, regardless of the initial glucose concentration or initial confluence, also showed a negative percentage change in the number of live cells. For cells with the same initial confluence and initial glucose concentration, those treated with a higher dose (10 µM) displayed a higher percentage decrease in the number of live cells, in comparison to cells treated with a lower dose (2 µM). For cells with the same initial confluence, the average number of live cells at the end of experiment was significantly different between the groups with different initial glucose level or treatment conditions (p < 1e − 4, two-way ANOVA). Local k bys trends versus accessible glucose concentrations. As described in section "Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin", the estimates of k bys as a free local parameter were obtained by fitting the experimental data (time-resolved curves of tumor cell number for both live and dead cells) to the model (i.e., Eqs. (1)- (6)). With the estimates of the inhibition constant G in reported in the previous section, the accessible initial glucose concentrations were calculated using Eq. (6) given the initial confluence and initial total glucose level. The death rate due to the bystander effect, k bys , was found to decrease as the total glucose level increased with a Pearson partial inverse correlation of − 0.41 (p < 1e−4). When we consider the accessible glucose level, k bys was found to decrease as the accessible initial glucose level increased (Fig. 4B), with a partial correlation coefficient of − 0.73 (p < 1e − 4). We also confirmed the relation between k bys and the accessible initial glucose level discovered in previous work 26 (Fig. 4A).

Model calibrations. Estimates of inhibition constant for Cytochalasin
Fitting quality of the extended model. As described in section "Model calibrations", Eqs. (1) -(6) were fit to the experimental data. The errors listed in section "Evaluation of the performance of the mechanism-based model" are reported in Table 3. The model was able to provide an accurate description of the time-resolved curves of live and dead tumor cell number across various initial conditions covering a wide range of glucose levels and seeding densities. Both the mean percent error over the entire experiment and the percent error at the end of experiment, were < 1% for both live and dead cells. Both the mean error over the entire experiment and error at the end of experiment, were < 0.3% for both live and dead cells. Example results of model fittings to the measured data are shown in Fig. 5. www.nature.com/scientificreports/ Accessible glucose level. With the model calibrated, we were able to calculate the total glucose level in the system as well as the glucose level accessible to the tumor cells. The accessible glucose levels were reduced compared to the total glucose levels, depending on the administered dose of Cytochalasin B. We compared the difference between the total glucose level and the accessible glucose level over time (Fig. 6). For cells with the same initial glucose level, the differences were always higher in cells treated with the 10 µM of Cytochalasin B compared to cells treated with 2 µM of Cytochalasin B. For cells treated with the same dose, the differences were always higher for cells with higher initial glucose level. The differences decreased over time as the cells consume glucose. For cells treated with the higher dose, the differences decreased at a lower rate. For example, the difference for cells  www.nature.com/scientificreports/ with 10 mM initial glucose and treated with 10 µM Cytochalasin B decreased from 9.6 to 6.2 mM, while cells with 10 mM initial glucose and treated with 2 µM Cytochalasin B decreased from 9.5 to 5.2 mM. Without a direct measurement of the accessible glucose level, we compared the confluence time courses of cells supplied with a high glucose level but treated with Cytochalasin B to cells supplied with different glucose levels but not treated with Cytochalasin B (Fig. 7). The reported CCCs quantify the similarity between two confluence time courses. The confluence time courses of cells initially supplied with 10 mM glucose and treated with 2 μM Cytochalasin B exhibit the highest CCC with cells initially suppled with 0.5 mM glucose (CCC = 0.88) and cells supplied with 1 mM glucose (CCC = 0.89). Conversely, cells supplied with 2 mM and 10 mM glucose had much lower CCC values (0.37 and 0.05, respectively). This suggests the accessible glucose level for cells supplied with 10 mM glucose and treated with 2 μM Cytochalasin B was close to 1 mM. The confluence time course of cells initially supplied with 10 mM glucose and treated with 10 μM Cytochalasin B presents the highest CCC with cells suppled with 0.5 mM glucose (CCC = 0.96). Conversely, cells initially supplied with 1 mM, 2 mM, and 10 mM glucose present lower CCC values (0.71, 0.28, and 0.04, respectively) when compared to cells initially supplied with 0.5 mM glucose. This suggested the accessible glucose level for cells supplied with 10 mM glucose and treated with 10 μM Cytochalasin B was close to 0.5 mM. Using this approach, we confirmed that the accessible glucose level for cells treated with the 10 μM of Cytochalasin B was lower than cells treated with 2 μM of Cytochalasin B.
Evaluation of the performance of the mechanism-based model through training and validation. Table 4 summarizes the model performance in the training stage. The average mean percent error over the entire experiment span, and the average percent error at the end of experiment for live cells were both < 3%. Although the average mean percent error over the entire experiment span, and the percent error at the end of experiment for dead cells were over 18%, the average mean errors over the entire experiment span and average error at the end of experiment for both live and dead cells were < 3.5%. This suggests the higher percent error  7)) using Dataset A. The light blue (no treatment), orange (treated with 2 μM Cytochalasin B), and green (treated with 10 μM Cytochalasin B) dots with error bars represent the average of estimates for k bys (with 95% confidence intervals), calibrated with the extended model using Dataset B. The same pattern of k bys decreasing with increasing initial (accessible) glucose level was identified. The fitted curve in panels A and B, respectively, is used to assign k bys as a function of initial confluence and (accessible) glucose concentration. Table 3. Summary of fitting quality with the extended model (Eqs. (1)- (7)).

Live Dead
Mean percent error 0. 25  www.nature.com/scientificreports/ of dead cells can be explained by the small number of dead cells in comparison to the number of live cells; for example, an error of 1% in the confluence of dead cells would result in a 20% percent error for a sample with confluence of dead cells at 5%. Table 5 reports the model performance in the validation stage. The average mean percent error over the entire experiment, and the average percent error at the end of experiment for live cells were both < 2.5%. Although the average mean percent error over the entire experiment, and the percent error at the end of experiment for dead cells were over 19%, the average mean error over the entire experiment and average error at the end of experiment for both live and dead cells were < 4.5% for both live and dead cells. Again, this indicates that the higher percent error of dead cells can be explained by the small number of dead cells in comparison to the number of live cells.
Comparison of model performance between mechanism-based and machine learning models. The coefficient of determination (R 2 ) was used to compare the performance across the mechanism-based  Table 6. The model performance for predicting live cells alone, dead cells alone, or both live and dead cells were compared. The R 2 value of the mechanism-based model calibrated on the training data set was 0.9627 ± 0.0000, while the values for the decision tree model, random forest model, k-nearest-neighbor regression model, and the linear regression model were 1.0000 ± 0.0000, 0.9991 ± 0.0000, 0.9283 ± 0.0014, and 0.7027 ± 0.0019, respectively. The random forest model performed the best at predicting tumor cell growth on the validation set with an R 2 of 0.9260 ± 0.0028, followed Figure 6. Comparison of total and accessible glucose levels from model calibration. Panel A presents the change of glucose level over time for cells without treatment. In this case, the total glucose level is the same as the accessible glucose level. Panels B and C present the differences between the total and accessible glucose levels for cells treated with 2 μM and 10 μM Cytochalasin B, respectively. In each panel, the same treatment options were applied, and cells supplied with different initial glucose levels (0.5, 1, 2, 5, and 10 mM) are shown in different colors. The differences decreased over time as the cells consume glucose. For cells treated with a higher dose, the differences decreased at a lower rate. www.nature.com/scientificreports/ by the decision tree, k-nearest-neighbor regression, mechanism-based, and linear regression models with R 2 values of 0.8892 ± 0.0046, 0.8408 ± 0.0075, 0.7696 ± 0.0130, and 0.6903 ± 0.0064, respectively. The differences of R 2 values from any two models are significant (p < 1e − 5, one-way ANOVA). Bonferroni correction was applied to control the family-wise error rate ("FWER"); since all comparisons showed p < 1e − 5, we could guarantee the probability of having one or more false positives is less 1e − 4. With this conservative Bonferroni correction, we conclude with confidence that the difference between any two models is significant. Figure 8 presents representative prediction results from the mechanism-based model and four machine learning models ( Fig. 8 first three rows), as well as the scatter plot of prediction against measurement for each of the five models ( Fig. 8 last row). The random forest model performed the best for both live and dead cells, regardless of whether the confluence was low, intermediate, or high. The mechanism-based model presented an intermediate performance compared to the other models, with higher error when the confluence was very low or very high. Learning curves (Fig. 9) were used to evaluate the performance of machine learning and check for potential overfitting. When we increased the size of training set from 5 to 100% of the complete training set, the score on the validation set increased from 0.4863 ± 0.0849 to 0.6903 ± 0.0051 (+ 41.9%) for the Linear Regression model, from 0.2552 ± 0.0259 to 0.8408 ± 0.0061 (+ 229.5%) for the K-Nearest Neighbor regression model, from 0.4149 ± 0.0398 to 0.8892 ± 0.0037 (+ 114.3%) for the Decision Tree model, and from 0.5798 ± 0.0266 to 0.9260 ± 0.0023 (+ 59.7%) for the Random Forest model.

Discussion
This study sought to extend the mathematical model we had previously developed 26 that predicts tumor cell growth depending on the availability of glucose so we can apply it to investigate the effects of interventions by study design to perturb glucose accessibility. Specifically, we applied Cytochalasin B, an inhibitor for glucose uptake, as the treatment to investigate. We targeted the uptake of glucose in an extension of our model to avoid further complicating the system with the mechanism describing glucose consumption post-treatment. To achieve this goal, we proposed to model the total glucose level in the system and the accessible glucose level to the tumor cells. We assumed the consumption of glucose and all the metabolism-related tumor cell growth only depends on the intracellular glucose availability and not the total glucose level in the environment. Therefore, the accessible glucose level at time t, G acs (t), replaced the glucose level at time t, G(t), in the original system of coupled ODEs. As a result, we needed an additional equation to relate the accessible and total glucose levels which we achieved by introducing an inhibition constant depending on the dose of the inhibitor. In the resulting (extended) model (i.e., Eqs. (1)- (7)), the proliferation rate, the glucose depletion induced death rate, the consumption rate of glucose, and the newly introduced inhibition constant were considered as global parameters, while the death   Table 6. Comparison of performance between the mechanism-based and machine learning models. www.nature.com/scientificreports/ rate induced by the bystander effect (quantifying the effects of dead cells accumulated in the environment) was considered as a local parameter that would vary as a function of initial conditions. Time-resolved curves of tumor cell number for both live and dead tumor cells with different initial conditions, with or without treatment, were generated from time-resolved microscopy images for model calibration and validation. While we would expect the effect of Cytochalasin B to be stronger at lower extracellular glucose levels, this is not always clearly reflected in the confluence curves of treated tumor cells. Recall that the cell confluence is a net effect of both cell proliferation and death-with two sources (glucose depletion and bystander effect) contributing to cell death. When the extracellular glucose level is lower, there is (in general) more cell death (regardless of Cytochalasin B level) compared to high glucose levels and, therefore, more contribution from the third term on the right-hand side of Eq. (1). Thus, the effect of Cytochalasin B, although higher, is diluted by the contribution from the bystander effect when we compare the net change in the cell confluence. We investigated the relationship between the bystander effect induced death rate (k bys ) and the initial glucose concentration. A negative correlation (partial correlation coefficient = − 0.41) between k bys and the total initial glucose level was found. However, when we investigated the correlation between k bys and the accessible glucose level, a stronger negative correlation (partial correlation coefficient = − 0.73) was found. This pattern matched our previous finding 26 (i.e., a partial correlation coefficient of − 0.72 between k bys and the initial glucose level), with the initial glucose level replaced by the initial accessible glucose level.  www.nature.com/scientificreports/ Numerous mathematical approaches have been developed to model tumor metabolism and growth, but few investigated a nutrient perturbation resulting from a mechanism-directed therapy; e.g., the application of an inhibitor of nutrient transport. In the present study, we not only extended our original model to account for the effect of mechanism-directed therapy, but also validated the model with experimental data obtained from time-resolved microscopy. To accomplish this, we introduced only one extra parameter to quantify the effect of the treatment, without the need for modeling complicated signaling pathways which would lead to a large set of free parameters. The approach for model calibration described in section "Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin" reduced the potential range of the parameter values for G in , so we could obtain a more reliable estimate for the constant inhibition from the calibration described in section "Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin". Importantly, by allowing the bystander effect induced death rate to be a local parameter the model had the flexibility to optimally describe the data. The resulting model was validated in a triple-negative breast cancer cell line (MDA-MB-231). With the model calibrated from the training data, we could predict tumor cell growth with our model with confidence, using the initial nutrient condition, seeding density, and dose of Cytochalasin B in the validation data. This experimental-mathematical approach presents a framework that enables one to test and quantify the effects of drugs designed to perturb intracellular glucose level in any number of cancer cell lines. In particular, we note that the original model was also tested in a HER2 + breast cancer cell line (BT-474) 26 .
We compared the performance of the mechanism-based model with four common machine learning models. The random forest model, featuring ensemble learning, presented the highest accuracy with the validation set. The decision tree model performed the best with the training set but not with the validation set, suggesting it may be overfitting. The linear regression model, with the least flexibility and lowest complexity, performed the worst in both the training and validation sets. The mechanism-based model presented an intermediate accuracy, with R 2 lower than the k-nearest neighbor model and higher than the linear regression model. We further evaluated the performance of the machine learning models with learning curves. Though both decision tree and random www.nature.com/scientificreports/ forest models present seemingly unrealistic high performance on the training set, they do learn better from more data, and the gap between the training performance and validation performance is reduced and the validation performance is improved. With the complete training set, they become well generalized to make predictions on unseen data. In addition, the gap between the model performance on the training and validation sets of the Decision Tree (0.11) and Random Forest model (0.07) is between the KNN (0.09) and mechanism-base models (0. 19). As neither the learning curves nor the training/validation performance comparison indicates these two models (i.e., decision tree and random forest models) are overfitting, their high scores on the training set do not seem to be a cause for concern. Although the mechanism-based model did not beat the best machine learning models as quantified by the R 2 , it brings the important benefit of easy interpretation as it explicitly accounts for underlying biological mechanisms, entities not typically found in machine learning approaches 26 . Having a mechanism-based model provides guidance for further experiments and analyses that are not directly possible by interpreting the results from a machine learning approach. These studies include, for example, identifying optimal treatment protocols and experimental designs. It is possible that improved predictive accuracy could be achieved using more advanced techniques like feature selection, dimension reduction, or hyperparameter tuning to further enhance the model performance.
There are several opportunities to extend the results presented in this effort. As in our previous study 26 , we assume that the effect on glucose availability, as a result of both cell consumption and inhibition on glucose uptake, was entirely reflected by the time-resolved tumor cell number, through either growth or death. This could be an oversimplification, considering the complexity of the phenomenon. Therefore, we compared confluence time courses of tumor cells supplied with the highest glucose concentration and treated with Cytochalasin B, against tumor cells supplied with different glucose concentrations and no Cytochalasin B, to estimate the accessible glucose level in the treated cells. A high CCC suggested the glucose level in the treated cells was close to the glucose level in the cells supplied with a different glucose level but no treatment. While this comparison provides an indirect measurement of the glucose available in the treated cells, future studies should seek a direct measurement of the nutrient dynamics. In particular, we did not explicitly model the effect of Cytochalasin B as an inhibitor of actin polymerization 52 , but only as an inhibitor of glucose uptake. We simplified our model to be a function of accessible glucose level, driven by metabolism, where the effect of actin polymerization was implicitly captures as "glucose depletion". This could be an oversimplification resulting in an overestimation of the effect of glucose metabolism and would require further studies to separate the effect of Cytochalasin B's dual role. Additionally, we only tested two doses of Cytochalasin B and would require additional measurements to confirm a relationship between the inhibition constant and the dose. Knowing such a relationship would enable the model to perform predictions at doses not yet tested. Another area for improvement would be to employ time-resolved, non-invasive imaging to track the nutrient level in the cells; we have attempted to develop such a technique based on FRET reporters transfected into the MDA-MB-231 cell line 53 , but it only provides a relative measure of glucose and lactate concentrations. The absolute measurement of the dynamics of lactate concentration would also allow us to include more details in tumor cell metabolism in the model.

Conclusion
We have developed a mechanism-based model that predicts how glucose accessibility influences tumor cell growth by accounting for the effect of metabolism-directed therapy. The model was calibrated and validated in a triple-negative breast cancer cell line. We were able to quantify the dose-dependent inhibition constant and confirm that the dependence of the bystander effect death rate on the initial glucose level was highly correlated to the accessible glucose level. This mechanism-based model presents predictive capability comparable to complicated machine learning models, but is much easier to interpret and provides the opportunity to directly guide further experiments and analysis in a way not possible with a machine learning approach.

Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.