BioMOL: a computer-assisted biological modeling tool for complex chemical mixtures and biological processes at the molecular level.

A chemical engineering approach for the rigorous construction, solution, and optimization of detailed kinetic models for biological processes is described. This modeling capability addresses the required technical components of detailed kinetic modeling, namely, the modeling of reactant structure and composition, the building of the reaction network, the organization of model parameters, the solution of the kinetic model, and the optimization of the model. Even though this modeling approach has enjoyed successful application in the petroleum industry, its application to biomedical research has just begun. We propose to expand the horizons on classic pharmacokinetics and physiologically based pharmacokinetics (PBPK), where human or animal bodies were often described by a few compartments, by integrating PBPK with reaction network modeling described in this article. If one draws a parallel between an oil refinery, where the application of this modeling approach has been very successful, and a human body, the individual processing units in the oil refinery may be considered equivalent to the vital organs of the human body. Even though the cell or organ may be much more complicated, the complex biochemical reaction networks in each organ may be similarly modeled and linked in much the same way as the modeling of the entire oil refinery through linkage of the individual processing units. The integrated chemical engineering software package described in this article, BioMOL, denotes the biological application of molecular-oriented lumping. BioMOL can build a detailed model in 1-1,000 CPU sec using standard desktop hardware. The models solve and optimize using standard and widely available hardware and software and can be presented in the context of a user-friendly interface. We believe this is an engineering tool with great promise in its application to complex biological reaction networks.

The difficulty in developing truly fundamental kinetic models of chemical and biological processes in cells can be traced to the extreme complexity of the underlying chemistry of living organisms. Here we must deal with a multitude of species and reaction paths and the complexity of biological catalysts, the enzymes. Recent advances in cellular and molecular biology and genetics unveiled complex interactions of signaling pathways, and gene and protein expression profiling associated with many biological processes. This knowledge added even greater complexity to already daunting biological systems, yet provides an opportunity to integrate engineering tools and analyses into biomedical research. As As another testimony of the importance of computer technology, Venter, as quoted by Butler (2), indicated . . . If we hope to understand biology, instead of looking at one little protein at a time, which is not how biology works, we will need to understand the integration of thousands of proteins in a dynamically changing environment. A computer will be the biologist's number one tool. . . .
To that end, we have been engaged in the development of a modeling methodology, which we have named BioMOL, that encompasses and formalizes a systematic detailed kinetic modeling approach and a system of chemical engineering software tools to delineate and reduce the essential elements of complexity in the modeling of complex reaction systems. In this article, we first provide the details of the historical development of the predecessor of BioMOL and its application in petroleum industry; we then explain the fundamentals of the different modules of BioMOL to introduce the inner workings of this software. On the surface, this article might appear to be simply an application of software development. However, as suggested earlier, if one draws a parallel between an oil refinery and a human body, the individual processing units in the oil refinery may be considered equivalent to the vital organs of the human body. In that sense, the complex biochemical reaction networks in the organs may be similarly modeled and linked in much the same way as the modeling of the entire oil refinery through the modeling and linking of the individual processing units. We believe it is possible to create a virtual cell or a virtual human through the linkage of larger and larger reaction networks in our body through such a modeling approach. Present efforts are directed at description of enzymatic metabolism of families of substrates for cytochrome P450 (CYP) enzymes, particularly CYP1A1 and CYP2E1.

Background and Historical Development
The engineering foundation on which this approach was born provides some relevant context. Most traditional and even current engineering process models have implemented lumped kinetic schemes, where the underlying molecular information and the molecules themselves are grouped by global properties such as boiling point or solubility. This is largely because of the limitations in the associated analytical chemistry and information technology (IT) requirements, i.e., computer hardware and software capability. Pseudomolecules (lumps) of the order of 10 are generally used to represent the reaction systems.
The true molecular information is obscured in this approach because of the multicomponent nature of each lump. This unavoidably leads to the absence of specific chemical properties, save the literal definition of each lump, because of the absence of chemical structure. These lumped and nearly chemistry-free kinetic models are specific in nature and cannot be extended to new situations requiring true predictions.
In recent years, both increasing technical and environmental concerns have focused attention on the molecular composition of hydrocarbon conversion processes. For example, recent environmental legislation has placed restrictions on the maximum allowable benzene content in gasoline and sulfur content in diesel fuel. Thus, the new modeling paradigm is to track every molecule throughout the process stream. We are exploiting this more explicit approach in the development of BioMOL.
A molecular description of reaction processes is compelling. Molecules are the common foundation for feedstock composition, property calculation, process chemistry, reaction kinetics, and thermodynamics. Molecule-based models can incorporate multilevel information from surface science experiments and quantum chemical calculations to process issues. They can thus serve as a common fundamental form for both process modeling and scientific research and development.
However, the molecular modeling approach carries with it a considerable scientific and IT burden for the understanding and organization of molecular information, respectively. Two technological advancements have helped modeling at the molecular level become achievable. First, recent developments in analytical chemistry now permit the direct or at least indirect measurement of the molecular structures in complex mixtures. Second, the advancement in IT, especially the explosion of computational power, supports the necessary bookkeeping to track the fate of all the molecules during both reaction and separation processes. Collectively, both the economic/environmental forces on rigorous models and the enabling analytical and computational advances motivate the development of molecule-based detailed kinetic models of complex mixtures.
The above discussion of the conceptual development from lumping to delumping to finally consider molecular interactions of chemical species in computer modeling is not unique in chemical or petroleum engineering. An analogy in the biomedical field is the classic pharmacokinetic models and the more modern physiologically based pharmacokinetic (PBPK) models. With classic pharmacokinetic models, our body is often represented by two or three compartments without any anatomical or physiological significance. PBPK models are based on mammalian physiology, and our body is represented by organs of interest plus lumped tissues or organs. Both of these pharmacokinetic modeling approaches have served the biomedical community well; however, neither gets down to the fundamental biochemical reaction network level. The recent explosive growth of genomics, proteomics, and related bioinformatics in biomedical field again parallel the availability of analytical and IT technologies to chemical engineering. A logical question is whether biologically based modeling can be advanced to the biochemical reaction network level by using proven chemical engineering modeling technology.
The discipline of chemical engineering provides a rigorous framework for the construction, solution, and optimization of detailed kinetic models for delivery to process chemists and engineers. The goal of this article is to describe the required technical components for detailed kinetic modeling of toxicology-related biochemical reaction pathways, namely, the modeling of reactant structure and composition, the building of the reaction network, the organization of model parameters, the solution of the kinetic model, and the optimization of the model. This integrated general chemical engineering software package is the Kinetic Modeler's Toolbox (KMT; University of Delaware, Newark, DE USA), and we have developed a similar system, BioMOL, for application to biological processes at the molecular level. An example of application to benzo[a]pyrene (B[a]P) reaction network modeling is given in Liao et al. (3) in this monograph.

The Integrated Kinetic Modeler's Toolbox
As shown in Figure 1, KMT comprises five modules that together automate the entire kinetic modeling process: the molecule generator (MolGen), the reaction network generator (NetGen), the model equation generator (EqnGen), the model solution generator (SolGen), and the parameter optimization framework (ParOpt). Conceptually, a modeling project might begin with the molecular structure building software MolGen, which uses Monte Carlo simulation techniques to assemble a molecular representation of complex mixtures from analytical information, e.g., elemental hydrogen to carbon (H/C) ratio, boiling point distribution (simulated distillation fractions, SimDis), nuclear magnetic resonance (NMR) spectroscopy. Graph theory techniques are then used to generate  NetGen. Reaction family concepts and quantitative structure reactivity correlations (QSRCs) are used to organize and estimate rate constants. The computer-generated reaction network, with associated rate expressions, is then converted to a set of mathematical equations by EqnGen, which can be solved for different reactor systems by SolGen. Solving these equations in once-through mode provides a prediction, whereas solving in an iterative mode allows optimization to data to obtain best-fit values of rate parameters (ParOpt) or operation in a goal-seeking mode to obtain the required conditions to achieve a desired outcome for a petroleum refinery.
The above overview of the stepwise modeling process of the KMT software that will be the basis of BioMOL is, of course, for a potential feedstock for an oil refinery. However, this exact sequence may be followed, for example, for the generation of air pollutants from a variety of sources (internal combustion engines, oil or coal burning in power plants, etc.), and the possible chemical reaction networks can be studied. From a different angle, the biological application of reaction network modeling may be pursued with a bottom-up approach such as that we have used with B[a]P and some of its metabolites in the overall B[a]P reaction network. We are mining the literature or generating kinetic data on some of the metabolic processes experimentally for this modeling effort. Some preliminary information may be seen in Figures 2 and 3 in this article and in Liao et al. (3) in this monograph. Expanding on the bottom-up approach, once B[a]P reaction network modeling is complete, we will link it with other reaction networks of other carcinogenic or noncarcinogenic polycyclic aromatic hydrocarbons. The goal of such a program is to model reaction pathways in our body of air pollutants from internal combustion engines, using BioMOL linked with PBPK modeling.
In the following sections, each module of KMT software is explained in detail. It is important to realize that although the description appears to be engineering or computer software oriented, all the operations within each module can also be easily applied to biochemical reactions or processes.

The Molecule Generator
MolGen is a combination of stochastic modeling techniques for molecular structure and compositions of complex mixtures. From a modeling point of view, there are two kinds of inputs to MolGen-adjustable and fixed. Analytical chemistry information, such as the H/C ratio, average molecular weight, boiling point distribution, compound class distribution, and NMR, are fixed characterizations for a given mixture. Conceptually, this is converted into the adjustable information, i.e., those parameters of the probability distribution functions (PDFs) used to stochastically sample the molecular attributes of the molecules in the mixture, such as the number of aromatic rings, the number of naphthenic rings, the length of a side chain, or the number of side chains.
MolGen can be run in once-through mode with fixed PDFs, but it is normally used within an optimization framework (ParOpt), as depicted in Figure 1, to adjust candidate PDF parameters in the search for the optimal PDF parameters to match the analytical observations for the mixture. The mixture characterizations can be readily calculated or estimated from the explicit molecular structures and compositions via the quantitative structure property relationships. The outputs from MolGen include both the representative molecular structures and their optimized compositions for the mixture. The reactant molecular structures are then converted to the corresponding bond-electron (BE) matrices in the grammar of NetGen and serve as its input. The composition or concentrations of these reactant molecules will serve as the initial conditions for the quantitative model later in the SolGen stage.

The Reaction Network Generator
The first version of the NetGen, developed by Broadbelt et al. (4,5), was targeted to small model compounds and simple reaction networks, such as ethane pyrolysis (6). The current version of the NetGen module in BioMOL is a significantly enhanced new generation of software with many more features and functionalities. There are two major categories of enhanced capabilities. In terms of molecules, it has been enhanced from single-ring compounds only to handle much more complex molecular structures, such as multiring polynuclear aromatic, hydroaromatic, sulfur, and nitrogen-containing compounds. It has also been enhanced to handle more complex multifunctional heterogeneous chemistries, such as biological pathways, metal chemistry, and acid chemistry. As this modeling approach advances, it is necessary to consider the three-dimensional configurations of enzyme-active sites in relation to molecular size, diameter, and orientation of the substrates.
The fundamental mechanism for the generation of the model on the computer is based on graph theory. Molecules and intermediates (ions and radicals) are represented as BE matrices. Reactions are carried out via simple matrix addition operations between the reactant matrix and the reaction matrix for each reaction family to generate the product matrix. This process is illustrated in Figures 2 and 3.
There are two categories of inputs to NetGen. One category is the reaction molecules in the format of BE matrices. The other category of inputs to NetGen is the reaction chemistry, including both the reaction families in the chemistry and reaction rules to control the model size based on the user's experience and understanding. NetGen takes in both inputs and applies the reaction matrix for each reaction family to each reactant molecule, guided by the reaction rules for each reaction. This process builds the complete reaction network. The generated reaction network is written into a file in the required format of EqnGen and serves as its input.

The Model Equation Generator
EqnGen has its origins in the early work of Broadbelt et al. (6), who developed ODEGen to convert the list of reactions produced by NetGen into the solvable code for the corresponding ordinary differential equations (ODE). EqnGen is a mathematical converter that parses the reaction network and generates the corresponding mass balance equations for each species in the system. The system of generated mass balance equations forms the kernel of the kinetic model in the context of a reactor model. The code for the model equations can be generated either in C or FOR-TRAN. EqnGen extended the ODEGen capability to allow for the formulation of the types of mathematical equations that correspond to various chemical rate laws and reactor types. reactors, or fixed bed reactors, EqnGen produces either ordinary differential equations or differential algebraic equations (DAEs), depending on whether the kinetic steady-state approximation is used. For continuous stirred tank reactors (CSTR), EqnGen writes the corresponding algebraic equation systems. EqnGen can also write rate law types such as Lagmuir-Hinshelwood-Hougen-Watson or Michaelis-Menten formalisms.

The Model Solution Generator
SolGen is the coupling of the model equations generated from EqnGen with an equation solver system, such as LSODE (Livermore solver for ordinary differential equations) and its variants (7,8) for solving ODEs, or DASSL (differential algebraic system solver) (9,10) for solving DAEs. A driver file organizes all the input/output (I/O) files and reactor configurations and calls the model equations, which can be compiled and run to produce the final solution. The process driver file, the model equations file, and a collection of files for supporting functions combine with various I/O files to form the model deliverable, which can be compiled and solved to produce the model results.

Parameter Optimization
As discussed earlier, the Monte Carlo simulation in the MolGen module is normally run within an optimization framework by adjusting the PDF parameters to allow the stochastically generated molecular structures to match both the structures and compositions of the original complex mixture. Similarly, the developed kinetic models can also be solved within an optimization framework to determine the kinetic rate parameters by matching the model results with the experimental observations in the reactor. Both of these ParOpt problems are designated with dashed rectangles in Figure 1.

Optimization Algorithms
Several optimization algorithms have been tested and used for the composition and parameter estimation problems noted above. The three algorithms used most frequently in this work are the simulated annealing (SA) method (11)(12)(13), the GREG routine (14)(15)(16), and the multilevel single-linkage (MLSL) method (17,18). These routines were used in an essentially off-the-shelf manner, and their strategies and details will not be discussed here, as further details are available (19).
MolGen uses SA to find the optimal O(10) (i.e., order of 10) PDF parameters by minimizing the objective function that matches the representative molecular structures and compositions with the feedstock characterizations. SA is very accurate; a typical MolGen optimization normally took hours to days on an IBM 560 Workstation (66 MHz CPU; IBM, Armonk, NY, USA) in the past in our laboratory. This actually met the purpose of MolGen, where an accurate molecular representation was needed and time was not an issue, as this optimization was normally a one-time, offline effort in the model development process. With presentday computational power, the CPU time is even less of an issue. The identification of the molecular structures is needed only once; after a library of analyzed feedstocks is built, the optimization of the composition (only) of a slightly changed feed would be much faster.
GREG was used with the SolGen module to optimize the O(10) kinetic rate parameters by matching the model results with experimental observations. Because of its local optimization nature, GREG is quite fast. The model itself needs to be solved repetitively in every iteration of the optimization process, and the speed of the optimization is thus crucial. Once again, in our laboratory in the past, a typical round of SolGen optimization normally took from 0.5 hr for a pathwayslevel model to many hours for a mechanistic model on a 400 MHz Pentium II computer (IBM).
MLSL was used in KMT, both in MolGen and SolGen, as developed by Stark (20,21). In the development of the detailed kinetic models, the MLSL global optimizer was used in parallel with other optimization routines to generate the initial guess values and the parameter ranges for the GREG local parameter optimizer, to speed up the overall optimization process.

The Objective Function
The objective function used to optimize the molecular representation with the actual feedstock characterizations was the chi-square statistics (χ 2 ). This is a ratio of deviation to measurement precision. More specifically, the numerator is the square of the difference between the model prediction and the experimentally determined properties. The denominator is a weighting factor equal to the standard deviation of the experimentally determined value.
To optimize the rate parameters for a developed kinetic model, the objective function is normally defined as the square of the difference between predicted and experimental yields weighted by the experimental standard deviation, as shown below: [1] where i is the experiment number and j is the species or lump number, and ω j is the weighting factor (generally the experimental measurement deviation). The assignment of ω j is very important for the success of the optimization. Generally, the ω j should be the same magnitude as y or less to make sure F/MN ≤ 1. The smaller the ω j , the more important in the objective function that associated term would be.

Property Estimation of Mixtures
The solution of a kinetic model provides, as the most fundamental output, the concentrations of the model species as a function of the process variable (e.g., time or space velocity). Generally, the end user will be interested in a higher-level output, such as a set of properties of the model composition (e.g., boiling point distribution, air pollutant species, composition) These mixture properties can be calculated based on the collective properties of the product molecules and appropriate mixing rules.
The usual strategy is to develop or use any quantitative structure-property relationships (QSPRs) to correlate the molecular structures and compositions with the mixture properties. In any case, the fine-tuned kinetic model, with the supporting structure-property correlations, can be used to predict the product properties directly.

Summary and Conclusions
The software package that the KMT comprises provides the framework to automate the modeling of molecular structures and kinetics of complex reaction systems. KMT includes a total of five modules: MolGen for molecular structure and composition modeling; NetGen for automated reaction network generation; EqnGen for automated code/equation generation; SolGen for model solution; and ParOpt, a parameter optimization framework. This automated molecule-based paradigm delineates and reduces the essential elements of complexity in complex reaction systems to a manageable level. The molecular structures are reduced to O(10) PDF parameters (5-10 PDFs × 2-3 parameters in each PDF); the reaction network building is automated through O(10) reaction matrices (one for each reaction family); and molecular reactivities are correlated with O(10) kinetic parameters (5-10 LFERs [linear free energy relationships] × 2-3 parameters each). The brute force description of these complex systems could have been of the order of ten billions [O(10 5 × 10 5 )] because of the O(10 5 ) species and O(10 5 ) reactions in these complex reaction systems.
The integrated KMT in Figure 1 gives a clear picture of the goal of this kinetic modeling approach: product properties are estimated from only the input of feedstock characterization. The set of the PDF parameters is akin to the fingerprint of the complex feedstock and can be optimized to assemble its molecular representation. A solid understanding of the reaction chemistry provides the reaction families and reaction rules, and the associated understanding of the reaction kinetics provides the rate law. The set of QSRC/LFER parameters fundamentally correlates the molecular reactivities with the molecular structures.
The KMT software has automated the process of building these detailed kinetic models. By exploiting Monte Carlo and graph theory techniques, reaction models containing thousands of species can be built in 1,000 CPU sec or less. This model-building speed has changed the serial model-building-modeluse paradigm to a new parallel approach, where a model builder can produce an updated optimal model in seconds. The thus-constructed models can then react the molecularly explicit feedstock using QSRCs for kinetic parameters to predict the product properties.
In an AAAS Plenary Lecture on 13 February 1998, Dr. Harold Varmus, then director of the National Institutes of Health, emphasized, among others, two specific themes: ". . . Discoveries in biology and medicine depend on progress in many fields of science. . . ." and ". . . Methods that dramatically expand biological data also demand new modes of analysis and new ways to ask scientific questions. . . ." He said: ". . . In short, biology is not only for biologists . . . ." (22). Here we present an engineering approach that we believe can be used effectively for complex biological processes. We plan to incorporate the concept of reaction network modeling by incorporating the essential biological pathways involved into the present-day PBPK modeling. For instance, if we are studying air pollution, we will be adding scores of biochemical reactions and pathways (reaction network) into the lung and liver compartments of PBPK models for the air pollutants being studied. We will also identify mechanistic bases for toxicologic actions or interactions by specific genomic and proteomic changes in relation to the biochemical reactions or pathways. Thus, pharmacodynamic (PD) interactions will be incorporated into the PBPK modeling to transform the model to a PBPK/PD model. In that sense, we are interested in developing second-generation PBPK/PD models. In our case, we may consider that the transformation of classic compartmental pharmacokinetics into PBPK modeling represents the first stage of delumping, that is, going from 2 or 3 compartments to 5-10 or 15 compartments. The second-generation PBPK/PD models will mean that, in the critical compartment, further delumping, via reaction network modeling using BioMOL, will carry PBPK/PD modeling down to the molecular mechanism level. If we fully utilize the power of computational technology, such second-generation PBPK/PD modeling will have the potential to handle very complex biological systems, thereby handling exposure to complex chemical, biological, and physical agents. Thus, we believe the approach discussed in this article and the development of BioMOL form a basis that could be used for the eventual development of a virtual human.