Mathematical modeling of proteome constraints within metabolism

Genome-scale metabolic models (GEMs) are widely used to predict phenotypes with the aid of constraint-based modeling. In order to improve the predictive power of these models, there have been many efforts on imposing biological constraints, among which proteome constraints are of particular interest. Here we describe the concept of proteome constraints and review proteome-constrained GEMs, as well as their advantages and applications. In addition, we discuss a key issue in the field, i.e., low coverage of enzyme-specific turnover rates, and subsequently provide a few solutions to solve it. We end with a discussion on the trade-off between model complexity


Introduction
Genome-scale metabolic models (GEMs) are organismspecific knowledge bases that account for genes, proteins, and biochemical reactions, which enable systematic analysis of metabolism [1]. GEMs have been used to predict phenotypes with the aid of optimization algorithms when mathematically converted to constraintbased models [2], where two types of constraints are used, i.e., mass conservation and flux bounds. In order to improve the predictive power, it is common to impose additional constraints, e.g., exchange fluxes. Additionally, an increasing number of efforts have focused on adding biologically meaningful constraints, e.g., proteome constraints.
The proteome constraint relies on the fact that cells have to optimally allocate finite proteome resources to various biological processes due to limited space, e.g., membrane area and cell volume. This is related to the concept of cellular resource allocation, which means that an increase in the requirement of proteome resources of an enzyme or a pathway would be a trade-off for other functions. Experimental evidence indicates that resource re-allocation could be an effective strategy for various organisms in response to perturbations [3], e.g., nutrient shift [4] and growth shift [5,6], which demonstrates the biological significance of proteome constraints. Accordingly, this suggests that proteome constraints could be a valuable addition to GEMs to improve model predictions.
Here, we describe how proteome constraints can be integrated into GEMs and review proteome-constrained GEMs (pcGEMs) that were constructed in recent years. Subsequently, we discuss one of the current challenges on implementing proteome constraints, i.e., unavailability of turnover rates, and provide potential strategies to deal with the issue.

Frameworks to integrate proteome constraints with GEMs
Over the past years, different frameworks have been proposed to integrate proteome constraints into GEMs with two directions, i.e., coarse-grained and fine-grained frameworks [7]. The coarse-grained frameworks, e.g., MOMENT [8] and GECKO [9], impose phenomenological constraints, e.g., enzyme concentrations and activities, on metabolic enzymes (Figure 1a), which do not change the gene number of the original GEM. These models are sometimes referred to as enzymeconstrained GEMs (ecGEMs). In contrast, the finegrained frameworks, e.g., ETFL [10], ME-model [11,12] and RBA [13], introduce additional biological processes, especially protein expression processes (Figure 1a), with very detailed descriptions, which can expand the scope of the original GEM and results in a self-replicating system [14]. Fine-grained pcGEMs are therefore able to predict extra biological layers beyond metabolism when more biological data, e.g., ribosome efficiencies, are available [11,12].
Despite differences of the frameworks, a commonly imposed constraint is on metabolic flux (Figure 1a), formulated as enzyme kinetics v k cat $E, in which v is metabolic rate, E is the concentration of the enzyme that catalyzes the reaction, and k cat is the turnover rate of the enzyme. This means that each flux is constrained by the catalytic capability of the enzyme and the availability of the enzyme. In addition to this, fine-grained pcGEMs can constrain protein synthesis rates based on catalytic capabilities of machinery, e.g., ribosomes, and their abundances (Figure 1a). Figure 1b introduces the common equations across various frameworks when proteome constraints are used. It should be noted that proteome constraints will be active only after imposing upper limits on individual or total enzyme abundances (Equation (4) in Figure 1b). The total limit can also be imposed on enzymes that compete within shared compartments, e.g., membrane enzymes and mitochondrial enzymes. The upper bound for the total enzymes can be estimated by summing up the abundances of all involved enzymes based on proteomics data or databases such as the PaxDb [15]. Due to the inclusion of protein expression processes, fine-grained pcGEMs have additional equations related to the maximal catalytic rate of machinery in the processes (Equation (5) in Figure 1b), e g., ribosomes and chaperons, and balance of the components in the processes (Equation (6) in

Genome-scale models integrated with proteome constraints
Over the past decade, a number of pcGEMs have been developed using various frameworks (Table 1). All the models were derived from the corresponding base GEM of the organism, including bacteria, fungi, and human cell lines, suggesting that proteome constraints are relatively easy to integrate when a base GEM is available.
With the integration of proteome constraints in either a coarse-or fine-grained formalism, model predictions have been improved. In particular, variabilities of simulated metabolic fluxes of pcGEMs are much lower than those of the corresponding base GEMs [9, 11,18,19,22,23], which allows more accurate prediction of fluxes [12,18,23]. Additionally, pcGEMs are capable of predicting phenotypes that base GEMs cannot unless some specific rates are constrained a priori, e.g., maximal growth at unlimited conditions [9,12], overflow metabolism in Escherichia coli [12] and Clostridium ljungdahlii [33], the Crabtree effect in Saccharomyces cerevisiae [9], and metabolic shift of arginine catabolism in Lactococcus lactis [27].
pcGEMs can also serve as frameworks for integrated analysis of transcriptomics and proteomics data [22,38]. Overview of proteome constraints. (a) Illustration of the integration of proteome constraints into GEMs based on coarse-grained and fine-grained approaches. The former only imposes phenomenological constraints on enzymatic reactions while the latter adds protein expression reactions. The rate (v met,i ) of a metabolic reaction (blue arrow) is constrained by the turnover rate (k cat,i ) and abundance (E met,i ) of the enzyme. The rate of a protein expression reaction (red arrow) is also constrained by such a type of equation. For example, the total rates of all protein synthesis reactions (Sv syn,i ) are constrained by the catalytic rate (k rib ) and the abundance (E rib ) of ribosome, as well as the production rate of all precursors, e.g., amino acids and energy. (b) Formulation of constraints. Equation (1) is the flux balance constraint, in which S is stoichiometric matrix and v is a vector of fluxes. Equation (2) imposes lower and upper bounds on each flux. Equation (3) imposes phenomenological constraints on enzymatic reactions. Equation (4) imposes upper limits on the total abundances of enzymes in certain compartments. Equation (5) imposes constraints on the reaction rate of protein expression processes by catalytic rate and abundance of the machinery. Equation (6) represents the mass balance of components in protein expression processes, e.g., enzymes and ribosomes. At a steady-state, the synthesis rate of a component (v syn,i ) is equal to degradation rate (v deg,i ) plus dilution rate (v dil,i ), in which degradation rate is equal to degradation constant (k deg,i ) multiplies abundance of the component (E i ) while dilution rate depends on growth rate (m) and also the abundance. Table 1 Published genome-scale models integrated with proteome constraints.

Model
Organism Type Turnover rate of metabolic enzyme ec_iYO844 Bacillus subtilis Coarse-grained (GECKO) Only 17 reactions were assigned with turnover rates, which were manually collect BRENDA [16]and SABIO-RK [17] databases, and literature. ecYeast8 a Saccharomyces cerevisiae Coarse-grained (GECKO) Turnover rates were automatically retrieved from the BRENDA database using the G or manually collected from the literature. ec_iML1515 Escherichia coli Coarse-grained (GECKO) Turnover rates were automatically retrieved from the BRENDA database using the G or manually collected from the SABIO-RK database and the literature.

ec-iBag597
Bacillus coagulans Coarse-grained (GECKO) Turnover rates were automatically retrieved from the BRENDA database using the toolbox.

EcSco-GEM
Streptomyces coelicolor Coarse-grained (GECKO) Turnover rates were automatically retrieved from the BRENDA database using the G or manually collected from the literature. Cell line-specific ecGEMs

Homo sapiens
Coarse-grained (GECKO) Turnover rates were automatically retrieved from the BRENDA database using the toolbox.

E. coli MOMENT model Escherichia coli
Coarse-grained (MOMENT) Turnover rates were retrieved from the BRENDA and SABIO-RK databases.

E. coli MOMENT model Escherichia coli
Coarse-grained (MOMENT) Turnover rates were retrieved from the BRENDA and SABIO-RK databases. Appa rates [24] were used as additional input. S. elongatus model Synechococcus elongatus Fine-grained Turnover rates were retrieved from the BRENDA database or manually collected fro pcLactis Lactococcus lactis Fine-grained Turnover rates were automatically retrieved from the BRENDA database using the G or manually collected from the literature. E. coli model Escherichia coli Fine-grained Turnover rates were retrieved from the BRENDA and SABIO-RK databases.

E. coli ETFL model
Escherichia coli Fine-grained (ETFL) Turnover rates were obtained from the study [29] and SABIO-RK database. yETFL Saccharomyces cerevisiae Fine-grained (ETFL) Turnover rates were automatically retrieved from the BRENDA database using the toolbox. T. maritima ME Thermotoga maritima Fine-grained (ME-model) Turnover rates were globally assumed to 15 reactions per second per protein com iJL1678b-ME b Escherichia coli Fine-grained (ME-model) Effective catalytic rates obtained from the previous study [31] were adopted, in wh workflow that integrates proteomics data with the ME-model was used to infer t catalytic rates. iJL965-ME Clostridium ljungdahlii Fine-grained (ME-model) Turnover rates were globally assumed to 25 s −1 .
Additionally, pcGEMs have provided mechanistic insights into cellular resource allocation under environmental changes [39e41]. Furthermore, fine-grained pcGEMs can predict more biological parameters, e.g., growth-condition dependent biomass composition [11,42] and translation machinery content [11,12], which cannot be predicted by coarse-grained pcGEMs. The coarse-grained pcGEMs are, however, more suitable for the prediction of metabolic engineering targets [18,20] due to their simple formalism, which makes simulation cheaper than for fine-grained pcGEMs [43].
Current challenge: turnover rates are essential parameters Despite improved predictions and applications, issues remain with constructing and utilizing pcGEMs [37,44], of which defining the turnover rates is key as they are essential parameters in any type of pcGEMs but are assigned in distinct manners (Table 1). The simplest one is to assume a global turnover rate for all enzymes, which is the case in the first version of the E. coli MEmodel [12], the Thermotoga maritima ME-model [11], and even the very recent C. ljungdahlii ME-model [33]. Although these models succeed in capturing phenotypes such as overflow metabolism, the predictions deviated from experimental measurements [12]. This could be circumvented, as done in other pcGEMs (Table 1), by adopting organism-specific and enzyme-specific turnover rates, which can be retrieved in databases such as BRENDA and SABIO-RK, and literature. Given that the retrieved turnover rates were mostly measured in vitro, it may be questionable whether in vitro turnover rates can be used in pcGEMs to simulate in vivo fluxes, but for E. coli there is a good correlation between in vitro and in vivo enzyme kinetics [29]. However, measured turnover rates normally only cover a small fraction of metabolic reactions even in well-studied organisms [45]. Here we discuss a few solutions that could be potentially helpful to address this challenge.

Reduced models circumvent uncertain turnover rates
When it is impossible to assign turnover rates to all enzymes in a genome-scale, model reduction can be an effective way to improve the coverage of high-quality turnover rates. A recent study proposed such a concept to reduce GEMs to only include reactions of Correlation analysis on log10 transformed in vivo turnover rates. (a) Dataset is from. 31 various growth conditions of E. coli [29]. (b) Dataset is from 21 various E. coli strains [47]. The codes are available here (https://github.com/Yu-sysbio/correlate_kapp), where the information of numbered conditions in panel (a) is also available. energy metabolism so that most of the reactions have manually curated turnover rates, which can capture overflow metabolism of E. coli and the Crabtree effect of S. cerevisiae [46]. This indicates that the energy metabolism is sufficient to interpret key phenotypes. The concept of reduced models has also been applied to investigate metabolic shifts in L. lactis, and it was found that the shift in arginine catabolism could abide by the resource allocation principles [27]. Therefore, we argue that genome-scale models might have more uncertainties and be unnecessary for specific studies where reduced models, that focus on particular pathways or organelles, e.g., mitochondrion, may be better suited.
2. Proteomics data enable estimation of in vivo turnover rates on a large scale It is possible to infer in vivo turnover rates, also called apparent catalytic rates (k app s), from genome-wide absolute proteomics and fluxomics data. These can be calculated simply by dividing the flux through each enzyme by the abundance of the enzyme [45]. The in vivo turnover rates have been estimated for E. coli under diverse growth conditions [29] and for various E. coli strains [47]. The inclusion of genome-wide data can improve the coverage of turnover rates of pcGEMs and even result in condition-or strain-specific pcGEMs. By analyzing the two datasets, we can see strong correlations of the in vivo turnover rates across conditions and strains ( Figure 2). This means that the in vivo turnover rates estimated from one condition or strain could be applicable to others in case data is unavailable. However, it should be noted that this has not been observed in other organisms especially eukaryotes that have more complex regulation. Another approach that also uses proteomics data to infer in vivo turnover rates implements an iterative workflow within a ME-model of E. coli by minimizing the difference between simulated and measured proteomics data [31]. Altogether, these approaches show the possibility to estimate genome-wide in vivo turnover rates, which can be directly adopted in pcGEMs.
As in vivo turnover rates become available another question arises: whether in vitro or in vivo turnover rates outperform in pcGEM simulations? It was shown that using maximal in vivo turnover rates predicted proteomics data much better [47], and we, therefore, encourage adopting in vivo turnover rates in pcGEMs. However, in vivo data are currently only available for E. coli and Bacillus subtilis [13] under specific conditions due to the limited availability of absolute proteomics data. In most cases with the lack of in vivo enzyme kinetics, in vitro data have to be used in pcGEMs, which are in principle not applicable to cases with changed in vivo turnover rates. This is therefore a general weakness of current pcGEMs.

Machine learning for predicting turnover rates
Although data-driven estimation of in vivo turnover rates improves the coverage, there are still a lot of enzymes that cannot be estimated due to coverage issues of proteomics experiments. For example, only hundreds of turnover rates have been estimated compared with thousands of reactions in the model [47]. To improve further the coverage, machine learning approaches [48,49] have been utilized [47], which enable predictions of genome-scale turnover rates based on enzyme biochemistry, structure, and network context [50]. With the integration of predicted turnover rates, pcGEMs predict unseen proteomics data with higher precision [47], which demonstrates that machine learning can effectively solve the coverage issue of turnover rates in pcGEMs.

Conclusion
As more kinetic parameters become available, we believe that an increasing number of pcGEMs will be constructed in the future. A key direction will thus focus on the trade-off between model complexity and predictive capability. From our overview of published pcGEMs (Table 1), we find that coarse-grained pcGEMs, especially GECKO models have been adopted by various research groups, probably because they are easier to build. In contrast, fine-grained pcGEMs do not spread widely in the community, even though toolboxes for automated construction have been provided, e.g., COBRAme [32] and RBApy [36]. Also, recent efforts focus on the simplification of pcGEMs, including the sMOMENT framework [25] and simplified proteome-constrained model of E. coli [28]. Altogether, simple models seem to be gaining the most attention. However, fine-grained models are valuable due to the established connections between various biological processes, which can serve as mathematical frameworks for cross-evaluating various types of data. Recently, a whole-cell model of E. coli was developed [51], which includes even more biological processes than finegrained pcGEM. The model is able to link heterogeneous data and therefore confirms that most of the data are cross-consistent.

Conflict of interest statement
Nothing declared.

10
. Salvy P, Hatzimanikatis V: The ETFL formulation allows multiomics integration in thermodynamics-compliant metabolism and expression models. Nat Commun 2020, 11:30. This study proposes a formulation called ETFL to integrate expression, thermodynamics, and growth-dependent variables into metabolism.