PhysiCOOL: A generalized framework for model Calibration and Optimization Of modeLing projects

In silico models of biological systems are usually very complex and rely on a large number of parameters describing physical and biological properties that require validation. As such, parameter space exploration is an essential component of computational model development to fully characterize and validate simulation results. Experimental data may also be used to constrain parameter space (or enable model calibration) to enhance the biological relevance of model parameters. One widely used computational platform in the mathematical biology community is PhysiCell, which provides a standardized approach to agent-based models of biological phenomena at different time and spatial scales. Nonetheless, one limitation of PhysiCell is the lack of a generalized approach for parameter space exploration and calibration that can be run without high-performance computing access. Here, we present PhysiCOOL, an open-source Python library tailored to create standardized calibration and optimization routines for PhysiCell models.

Despite the recent advances in the development of additional PhysiCell plugins, the new modules are mostly centred around model extensions. Nevertheless, model exploration can be as important as model development to validate results and evaluate whether the model predictions about the underlying biological mechanisms are plausible [7]. Furthermore, experimental data could be used to provide biological and/or physical constraints on model parameters to validate whether the model captures the range of expected biological behaviours [8]. Also, optimization routines could be employed to understand which model parameters maximize the similarity between the model results and a target dataset.
Subsequently, model developers may consider these optimal solutions to identify which biological mechanisms captured by the computational model may explain the experimental data.
We highlight that previous works have developed parameter exploration routines with PhysiCell, namely DAPT (RRID:SCR_021032) and PhysiCell-EMEWS [9,10], but these were specifically designed for high-performance computing (HPC) and distributed systems.
Hence, currently, general PhysiCell users without access to such resources, or whose needs do not require them, must develop their own scripts to process simulation results and perform model exploration studies. As well as introducing a barrier to scientific progress depending on the researchers' programming knowledge and computing resources, HPC workflows generally lack the standardization that may enable widespread use in the mathematical biology community [11]. In addition, DAPT and PhysiCell-EMEWS focus on parameter exploration and not optimization, and they require some level of expertise in both Python (RRID:SCR_008394) and PhysiCell.
Taking into account that there is still a need in the PhysiCell community for a standardized tool that implements calibration and optimization routines, we present PhysiCOOL, a generalized framework for model calibration and optimization of modelling projects written in PhysiCell. PhysiCOOL aims to be model agnostic. In other words, models are treated as a black box that can be executed through Python, making this approach suitable for several kinds of biological problems. Moreover, our library includes a built-in multilevel optimization routine for parameter estimation that is constrained by target output (experimental or otherwise). A visual representation of the new functionalities added by PhysiCOOL to the PhysiCell ecosystem is shown in Figure 1. We also provide two practical examples of how PhysiCOOL can be used, showcasing PhysiCOOL's optimization routine at two distinct complexity levels. Furthermore, we show how PhysiCOOL black-box models can be used to couple PhysiCell with other publicly-available Python libraries for model optimization.

IMPLEMENTATION
PhysiCOOL is a Python library that requires Python version 3.8 or higher. This package was created to work specifically with PhysiCell models, and it fully supports PhysiCell v1.10.4 or lower (the most recent version at the time of publication). Furthermore, PhysiCOOL has been tested extensively and includes unit tests to ensure that its modules are working as expected and that it can be used on different platforms.

Configuration file parser
As with many computational modelling frameworks, PhysiCell models are initialized with values stored in a text-based configuration file, namely an Extensible Markup Language To this end, PhysiCOOL introduces new functionalities, such as a configuration file parser that updates configuration files in an error-free and user-friendly manner. PhysiCOOL also enables users to turn models into black-box models, making the optimization pipeline model-agnostic. In addition, it implements a multilevel parameter sweep routine to optimize models using some target data. Lastly, PhysiCOOL facilitates the integration of thirdparty libraries, which makes PhysiCell more accessible.
(XML) file [3]. Thus, in parameter sweeps and sensitivity analysis studies, it is necessary to open these files and modify the parameter values to be studied every time a new simulation is run. This process can be done manually, either by editing the XML file directly or using GUI tools such as xml2jupyter [12]. However, it becomes unfeasible to repeat this action several times in large-scale studies. Henceforth, it is crucial to automate this process to run optimization and calibration workflows. Although it is possible to create Python scripts that will edit these files automatically with a standard module such as ElementTree [13], doing so requires users to identify the values to be updated with long strings that reflect the structure of the XML file, as shown in the code snippet below. Here, we aimed to develop a Python class that enables users to read the data from these configuration files more efficiently, making this process less prone to errors. We implemented a ConfigurationFileParser class that reads the data from the configuration file into custom Python objects that follow the expected structure and data requirements defined in the XML file. A more detailed description of this implementation is presented in Table 1. Variable types and numerical constraints are validated when new instances of these data classes are created and when their values are updated. To achieve this, we implemented our classes using Pydantic [14], which improves data validation in Python.
The task described in the code snippet presented previously can be implemented in a more user-friendly way with PhysiCOOL, as shown below:

Black-box models
In complex and large computational models, it may be challenging or even impossible to estimate the model outputs analytically. Consequently, it is common to conduct calibration

Multilevel parameter sweeps
Parameter optimization studies require the definition of a search space, which defines the range of the parameter values that will be studied. There are multiple approaches to defining this space and how to explore it. For example, random search algorithms can be employed to randomly sample points within a defined bounded parameter space.
Alternatively, a grid search, while a more computationally expensive option, systematically samples every point within a defined parameter grid space providing a more comprehensive overview of the model's response than that offered by a random search.
Grid-based approaches have advantages for stochastic frameworks such as PhysiCell, as gradient-based approaches may struggle to accurately calculate the gradient and change the parameter set to minimize the error between the model and the target data.
PhysiCOOL implements a multilevel parameter sweep class (MultiLevelSweep) that is aimed at identifying the parameters that best fit a target dataset through a grid search. In  these two values, the number of points per direction and the percentage per direction.
These values should be configured by the user and optimized for a given problem.
Furthermore, the number of levels and grid spacing parameters are related to the precision and sensitivity of each model parameter. That is, for less sensitivity or less precise models, a single-level coarse grid search may suffice. However, for parameters that require a high level of precision and significantly affect the model outcomes, multiple levels may be beneficial.
The results for each simulation are compared to the target data, and the error between both datasets is computed and stored. At the end of the level, the parameters that provide the minimum error value are selected as the centre of the parameter exploration grid for the next level, and the parameter bounds are updated accordingly.

EXAMPLES Simple model of logistic growth
The first example was implemented to calibrate two parameters of a simple model of logistic growth based on target data that defines a generated growth curve. Therefore, it serves as an introduction to this PhysiCOOL feature, as users are able to fully understand the behaviour of this simple model. It must be remarked that this model was not implemented in PhysiCell. We modelled the number of agents in a population, N, over a period of time t through a logistic function given by Equation (1): where K represents the carrying capacity, i.e., the maximum population size, N 0 represents the number of initial agents and r is the proliferation rate. In this study, we fixed the initial number of agents and evaluated how the carrying capacity and the proliferation rate regulated the growth curve of a population. An example of two growth curves obtained for different model parameters is shown in Figure 2(a).
We generated some target data using this model (K = 1000, r = 0.1) and, subsequently, we  Table 2.

PhysiCell chemotaxis model
The second example can be classified as a more complex problem since it was developed to calibrate a chemotaxis model written in PhysiCell. In this modelling framework, the cells' chemotactic response, i.e., the ability to migrate along a substance gradient, is dictated by a bias value defined between 0 and 1 [3]. When cells have a migration bias of 0, they move in a random walk. Conversely, if the value is set to 1, cells follow the substance gradient in a deterministic manner. Therefore, we developed a model to estimate the cells' speed and migration bias in response to an oxygen gradient based on their travelled distances. We implemented a 2D simulation with an oxygen source on one of the domain walls, as defined by the model's boundary conditions, and a group of cells placed on the opposite wall, as shown in Figure 3(a). We expected that the cells' final position would be modulated by the cells' sensitivity to the oxygen chemotactic gradient. On the one hand, if a cell population had low sensitivity and, thus, moved randomly, they would likely remain close to their initial position as they would move around without following any specific direction. On the other hand, cells that followed oxygen would move towards the opposite wall, as seen in Figure 3(b).
We generated some target data by running a simulation with a migration bias of 0.9 and a speed value of 2.0 μm/min and storing the final y coordinates of the cells. Subsequently, we ran our multilevel sweep pipeline to evaluate whether we could estimate the parameter values that originated this data with a set of initial points different from the target parameter values. The results of this study are shown in Figure 3(c).

Connecting to third-party libraries
PhysiCOOL makes it possible for users to turn their PhysiCell models into black-box models that receive some input parameters and return an output metric. Hence, it is straightforward to couple them with third-party Python libraries that accept this kind of model. For example, psweep [16] is a Python library developed to run parameter studies and save the input parameter values and the returned output metrics into a database. Users must define a set of parameters and, for each of the defined values, psweep will (i) run a given user-defined function that takes these parameters as input and (ii) save the input and output values returned by this function into the database. Therefore, a PhysiCOOL black-box model could seamlessly be integrated into step (i).
In addition, more sophisticated libraries could be considered to perform advanced optimization studies, such as Approximate Bayesian Computation (ABC) and Bayesian Optimization for Likelihood-Free Inference (BOLFI), to sample parameter spaces in a more efficient manner [17][18][19]. Henceforth, although PhysiCOOL offers built-in optimization routines, it can be used in a modular way to take advantage of other libraries that may be more appropriate to a certain study or type of research, without the need to implement these optimization algorithms from scratch.

FUTURE DIRECTIONS
At its current state of development, we believe that PhysiCOOL already improves PhysiCell's accessibility as it provides an intuitive interface to run studies in Python, which is more popular among biology researchers than C++, in which PhysiCell was originally written.
Additionally, this standardized approach provides a straightforward workflow for integrating target data (defined from simulations or biological observations) to constrain the parameter space for agent-based models. In the future, new features can be added to PhysiCOOL, such as the ability to generate non-linear parameter spaces, stopping criteria based on iteration or tolerance for the multilevel sweep and employing alternative optimization algorithms. Although future iterations of this library may include different optimization approaches, its modular design assures that advanced users are still able to build pipelines that suit their needs.

DATA AVAILABILITY
Snapshots of the code and data are available in the GigaDB repository [20].