A tutorial review of mathematical techniques for quantifying tumor

Intra-tumor and inter-patient heterogeneity are two challenges in developing mathematical models for precision medicine diagnostics. Here we review several techniques that can be used to aid the mathematical modeller in inferring and quantifying both sources of heterogeneity from patient data. These techniques include virtual populations, nonlinear mixed effects modeling, non-parametric estimation, Bayesian techniques, and machine learning. We create simulated virtual populations in this study and then apply the four remaining methods to these datasets to highlight the strengths and weaknesses of each technique. We provide all code used in this review at https://github.com/jtnardin/TumorHeterogeneity/ so that this study may serve as a tutorial for the mathematical modelling community. This review article was a product of a Tumor Heterogeneity Working Group as part of the 2018–2019 Program on Statistical, Mathematical, and Computational Methods for Precision Medicine which took place at the Statistical and Applied Mathematical Sciences Institute.


Introduction
Heterogeneity in cancer is exhibited at both the intra-tumor and inter-patient levels [1]. Intra-tumor heterogeneity in the form of genotypic and phenotypic variability between tumor cells is present in most solid tumors [2] and has also been identified in hematopoietic cancers such as leukemia [3,4]. Intratumor heterogeneity is driven by the introduction of genetic [5] and epigenetic alterations characterized by genomic instability and evolutionary selection of advantageous cell phenotypes from the cancer cell population [5,6]. Recent studies have revealed that genetic heterogeneity does not necessarily correlate with phenotypic heterogeneity, since not all genetic alterations result in a phenotypic alteration, and only a few genetic alterations provide a fitness advantage [7].
Intratumor heterogeneity is also affected by the tumor microenvironment whereby gradients of positive and negative growth factors, such as oxygen and lactic acid, have direct effects on cell proliferation and survival. Heterogeneous tumor microenvironments also contribute to intercellular variability as tumor cells adaptively switch between different phenotypes, e.g., changing metabolic [8] or migratory states, in response to the availability of nutrients. Furthermore, distinct areas of a tumor may experience different immunologic responses which cause further heterogeneity within the tumor, and these immunological responses can be affected by local nutrient concentrations. For example, most solid tumors experience regions of hypoxic stress [9] which in turn shapes cell phenotype and can induce resistance to an immune response. The presence of this intratumor heterogeneity leads to variability in tumor response to drug treatment [10,11] and makes predictions of tumor growth and spread particularly challenging [12]. In the context of tumor cell invasion in glioblastoma multiforme (GBM), phenotypic heterogeneity is often modeled by using several equations to represent phenotypically distinct subpopulations of the tumor. For example, this approach has been used to model the 'go or grow' mechanism in which one subpopulation is more mobile and the other is more proliferative [13,14], or to model phenotypic switching between hypoxic and normoxic subpopulations [15,16].
The presence of intratumor heterogeneity is increasingly recognized as one of the primary drivers of heterogeneity in patient outcomes. The uniqueness and ongoing evolution of each patient's tumor microenvironment, genetic profile, and cell phenotype composition contributes to the difficulty in predicting the effect of cancer therapies. For example, tumor cells can change their microenvironment to induce immunusuppression [17][18][19] or to protect themselves from drug therapy [20]. A heterogeneous tumor microenvironment may also contribute to variability in drug penetration, creating inter-cellular variability from which therapy resistant cell populations or stem cell phenotypes can be promoted [21,22]. A recent meeting that included cancer biologists, clinicians, and industry representatives highlighted the central problem with cancer therapy clinical trials under the current paradigm [7]: "...trials for which a single therapy agent is tested in a cohort with a matched biomarker do not provide information about the impact of heterogeneity or the longitudinal evolution of clonal or subclonal cells." Mathematical modeling has successfully described the heterogeneity in the growth and progression of tumors from individual patient data [23]. For example, in [24], a mathematical model was used with MRI data from glioma patients to estimate patient-specific rates of cell proliferation and diffusion, revealing that patients could be stratified into those with diffuse tumors (tumors with a small rate of proliferation as compared to the rate of diffusion) versus nodular tumors (tumors with a high rate of proliferation as compared to the rate of diffusion). Patients with diffuse tumors (identified through the use of a mathematical model) showed no significant survival benefit with gross total resection (GTR) versus biopsy/subtotal resection (BX/STR). Nodular tumor patients on the other hand do exhibit a survival benefit with GTR. This is just one example of the insight that mathematical models provide for precision medicine, but mathematical models can also be used to predict a subject's response to drug treatment [25][26][27], infer differences in environmental factors between metastatic and benign tumors [28], or describe how metastatic cells spread [29]. We note that a subject here may refer to a tissue culture, mouse, or human, all of which are vital parts of the drug development process. There is a growing trend in the mathematical modeling community to focus on personalized mathematical models that investigate not only the mean patient response, but also how individual patients' responses will vary due to the above-mentioned sources of heterogeneity [25,30]. Although large omics datasets exist due to high-throughput platforms, a current challenge is to integrate these data into a clinical setting [31,32]. There has been some progress in connecting clinical and statistical investigators [33], but treatment outcomes will be improved if clinicians can continue to be informed by refined models that account for tumor heterogeneity and predict treatment outcomes.
The goal of this review article is to summarize mathematical, statistical, and machine learning methods that can be utilized for estimating intra-and inter-tumor heterogeneity from longitudinal data. This review article was a produced by members of the Tumor Heterogeneity Working Group as part of the 2018-2019 Program on Statistical, Mathematical, and Computational Methods for Precision Medicine which took place at the Statistical and Applied Mathematical Sciences Institute (SAMSI). The aim of the SAMSI program was to facilitate interdisciplinary exchange between mathematical, statistical, computational, and health sciences researchers in developing data-driven methodology for precision medicine. We focus on the methods for estimating phenotypic heterogeneity from patient data with the use of a mathematical model rather than deducing genetic heterogeneity by clustering genomics data. Methods of modeling genetic, phenotypic, cell signaling, and spatial heterogeneity in systems biology have recently been reviewed in [34]. Here, we discuss the applicability of the methods for quantifying phenotypic heterogeneity to cancer prognostication and treatment. In section 2, we describe a canonical spatiotemporal Reaction-Diffusion-Equation (RDE) model of tumor growth, the Fisher-Kolmogorov-Petrovsky-Piskunov (Fisher-KPP) model. In section 3, we review the concepts and techniques for generating virtual populations, and their application to the study of tumor heterogeneity. We also generate a virtual population based on the Fisher-KPP model. This virtual population will be used in the subsequent sections for the illustration of the different methods for quantifying tumor heterogeneity. In sections 4 and 5, we discuss Nonlinear Mixed Effects Modeling (NLME) and non-parametric estimation for quantifying tumor heterogeneity. We illustrate an application of these two methods to a homogeneous virtual population. In section 6, we summarize Bayesian estimation methods. In section 7, we highlight the use of recent machine learning techniques for the study of tumor heterogeneity. In particular, we illustrate the application of generative adversarial networks (GANs) and variational auto-encoders (VAEs) for generating virtual populations. While we do not utilize clinical data in this paper, the generated data are intended to closely match the attributes of clinical data. By generating our own data, we are able to provide an evaluation of the reviewed methods by comparing estimated parameter and data distributions to their ground truth values. Most of the sections in this paper are accompanied by open-source code that can be accessed in a Github repository at https://github.com/jtnardin/Tumor-Heterogeneity/. This code provides data-driven tools for cancer modeling researchers that can be used to develop methods for connecting tumor heterogeneity to variability in treatment outcomes.

The Fisher-KPP model for tumor growth
Reaction-diffusion equation models were introduced into the cancer modeling literature in the mid 1990's by several groups [35][36][37] and have been employed with preclinical and clinical data [38][39][40][41][42]. To generate synthetic data sets for demonstrating the performance of various methods for quantifying tumor heterogeneity in a tutorial-like setting, we choose to focus on the Fisher-KPP Reaction-diffusion equation model of tumor growth in the absence of treatment for this review. While this model does not directly account for tumor necrosis, vascularity, immune response or treatment response, it is frequently used to model tumor growth from longitudinal imaging data, so we focus on it for the rest of this review [23,24]. If one instead has access to a time-varying scalar data (such as tumor volume over time), then an ordinary differential equation, such as the logistic equation, would be appropriate [30,43,44]. We will state a general version of the PDE model first, then in later sections we discuss simplified parameterizations that are useful for demonstrating key statistical concepts. Several of our assumptions are motivated by clarity instead of accuracy to any clinical or preclinical context.
The Fisher-KPP model for tumor growth in the absence of treatment takes the form of a nonlinear reaction-diffusion type Partial Differential Equation (PDE) for a density of tumor cells n = n(x, t) (units number per unit volume): n(x, 0) = n 0 (x) The spatial domain V ⊂ R d corresponds to the biological domain the tumor is growing in, which we take to be the planar set V = [0, 1] 2 for convenience. The units of (x, t) are centimeters and days, respectively, and the units of all other quantities follow by dimensional analysis. The boundary condition Bn can be taken to be Dirichlet, no-flux or free; we assume homogeneous Dirichlet, that is n(x, t) = 0 on the domain boundary, for convenience. We also assume that the initial condition n 0 is fixed and known, that all tumors grow from an initial population of N 0 = 5 cells localized around a central point (x 0 , y 0 ) = (0.5, 0.5) according to an isotropic Gaussian distribution with standard deviation σ = 0.01cm (i.e., 100 µm): In its full generality, the model parameters ( D, ρ, κ) are permitted to be spatiotemporally inhomogeneous fields, i.e., D = D(x, t), ρ = ρ(x, t), κ = κ(x, t). Such generality allows the user to model intrapatient and intratumoral heterogeneity, but owing to statistical or computational complexity, it is frequently convenient to restrict one or all of these fields to be constant or piecewise constant in time and/or space. Such simplifications are assumed in sections 4-6 below, while sections 3 and 7 maintain the spatially varying parameters. Extending the methods discussed in sections 4-6 to the fully spatial case is a subject of future work. In any case, we assume that D, ρ, κ are described by a p-dimensional parameter vector β ∈ R p . For example, in the fully homogeneous case the three coefficients in (2.1) are assumed constant, i.e., D(x, t) ≡ D 0 , ρ(x, t) ≡ ρ 0 and κ(x, t) ≡ κ 0 , and hence we can take β = (D 0 , ρ 0 , κ 0 ) ∈ R p with p = 3. For the spatially inhomogenous case, ( D, ρ, κ) are functions of a spatial variable x, so we must select a parameterization that is able to generate such spatial variability using a finite-dimensional β. This is achieved by way of a field synthesis map, which is constructed as follows. First, an appropriate set of functions X is selected for the PDE parameter fields, say (D(x), ρ(x), κ(x)) ∈ X = X D × X ρ × X κ , where X is a set of coefficient functions selected to guarantee classical well-posedness of (2.1). Then, a linear or nonlinear synthesis function Φ is constructed that maps parameter vectors β ∈ R p to their field counterparts in X, that is Φ : R p → X. For instance, selecting a set of basis functions φ 1 (x), . . . , φ p D (x), we can synthesize a spatially inhomogenous diffusion parameter by the linear mapping: where p D denotes the number of basis functions. Note that one must be careful to select φ m and β such that the resulting diffusion coefficient is non-negative. Selecting similar basis sets for the coefficients ρ and κ results in a complete parameter-to-coefficient linear synthesis map Φ. More generally, a nonlinear synthesis map such as the model (3.3) below can be employed. In machine learning, a field synthesis map is analogous to a generative model, such as those described in section 7, that produces samples from a distribution by feeding a latent noise vector through a neural network.
To model both population heterogeneity and patient-specific uncertainty we treat the parameter vector β as random, with distribution P β .
For the homogeneous parameter case, β is a three-dimensional random vector. In the more general spatially inhomogeneous case, β is a p-dimensional random vector, where p ≥ 3 is the total number of parameters employed to describe the spatial variability. The synthesis map Φ then allows us to treat the PDE coefficients (D, ρ, κ) as spatial random fields, since they are a function of both a random component (β) and a spatial index x ∈ V. More explicitly, denote a realization of the random vector β as β ω . Then, we can express ( D, ρ, κ) as a vector-valued random field as follows: The induced probability distribution of the coefficient fields is, by construction of the synthesis map, guaranteed to be supported in a set X of functions such that (2.1) is well-posed for all (or almost all) β. For example, we can require that for almost all β, the synthesized carrying capacity κ(β) satisfies κ(x; β) ≥ κ 0 > 0 for all x ∈ V. Note that the ubiquitous Gaussian random field model does not satisfy such a condition; we discuss a particular non-Gaussian random field model below that does provide such a guarantee, and refer readers to [45][46][47] for a more complete background on the theory of random and stochastic differential equations. Note that we have been careful to ensure that (2.1) can be treated path-wise, i.e., without resorting to the theory of stochastic calculus. See [48] for a discussion of the more general case where equations similar to (2.1) are treated in a fully statistical manner using functional calculus. We remark that, to accurately infer a general random field D, ρ, κ as a random function of spatial coordinate x from clinical data requires a large dataset, curated from a large patient population. To our knowledge, such a dataset is not currently available. However we can simulate these random fields and ensure these fields are consistent with the available data, by using statistical methods to create virtual patients and virtual populations.
In section 3 below, we will introduce the concept of a virtual patient. In the context of the model (2.1), each realization of β ∼ P β corresponds to a virtual patient. Note that the full statistical description of P β may require additional population parameters to be fully specified (e.g., means and (co)variances), and we denote these parameters by θ throughout this study and denote the probability distribution with parameter θ as P β|θ , or alternately with a Probability Density Function (PDF) p(β|θ) (abusing notation by using the variable β to denote both the random variable and the argument of its PDF). As the probability distribution P β is a precise mathematical description of a heterogeneous population, as modeled via the patient-specific parameter β and the PDE (2.1), quantifying inter-patient heterogeneity in the modeling context of this paper is analogous to performing statistical inference on P β , for instance estimating the meanβ and covariance matrix K β .
Given a realized parameter β, it is usually the primary goal to predict the future state or some vector Quantity-of-Interest (QoI) y ∈ R q . Such a prediction takes the form of a mathematical model (2.5) The dimension q of y is typically small, as this vector tends to only contain a limited number of humaninterpretable, and preferably observable, quantities of interest (QoI). While in the formulation (2.5) we have M : R p → R q , it is worth noting that it may be necessary to first predict a high-or infinitedimensional quantity to compute the final QoI y. For example, assuming that (2.1) is well-posed for all β, we can first construct a deterministic parameter-to-solution map for the PDE, written as: From the solution (2.6), we can further predict (say) the total tumor size (in number of cells) at some fixed time T via (2.7) In ( We employ the commonly used modeling assumption that the true value of y, denoted y (true) , is related to the model prediction via where (model) is a model error, which may consist of both systematic error (including numerical discretization errors) and calibration noise, and which depends on β. A measured data point y (meas) will be related to y (model) via where (meas) is a measurement noise, which typically has mean zero indicating an unbiased measurement method. We discuss measurement error further in section 6. Note that as written, M(β) is potentially over-parameterized, particularly if the dimension of the QoI is small and a large number of parameters have been employed to account for spatial variability. This leaves open the opportunity of performing model order reduction, which seeks to eliminate some parameters by selecting β ∈ R p with p < p. We briefly discuss sensitivity-based model reduction strategies in section 6.

Virtual populations
A virtual individual is a digital representation of a biological counterpart (e.g., a cell, mouse, or patient), which is comprised of a set of parameters describing the biological characteristics of the individual. Some parameters can be directly measured from the real individual, for example, the results of a blood panel, gene sequencing data, imaging data, or other time series measurements. Other parameters may not be directly measurable, but can be estimated by fitting a mathematical model to available clinical or experimental data. For example, the logistic ordinary differential equation model can be used to estimate a patient's tumoral carrying capacity from tumor volume data over time [30,43]. A virtual population, defined as an ensemble of many virtual individuals, provides a natural in silico platform to investigate tumor heterogeneity.
Mathematically, a virtual individual is characterized by a vector parameter β, which can in principle be high-or infinite-dimensional to model functional data such as time series, scalar and vector field quantities. Spatiotemporal parameters can effectively quantify many forms of intra-patient and intratumoral heterogeneity, and their (direct or indirect) measurement is usually possible in the laboratory or clinic via biosensors and imaging techniques [49]. In this paper, we will assume for simplicity that β ∈ R p is finite-dimensional, so all spatiotemporal quantities are assumed to be approximated by an appropriate linear or nonlinear synthesis map. For example, a spatially varying diffusion coefficient D(x) could be approximated by its values on a discrete grid, or alternately by its projection onto a finite-dimensional subspace.
Inter-patient (i.e., population) heterogeneity is then described by treating β as a random vector with probability distribution P β . A realization of β is denoted as β k , corresponding to 'virtual individual k', so that a collection of virtual individuals β 1 , . . . , β K corresponds to a virtual population. In the setting of precision medicine, it is frequently useful to incorporate patient-specific uncertainty by considering samples from the conditional distribution P β|g k , where g k is some patient-specific data such as quantitative imaging and/or blood test results. We will provide a more detailed review on patient-specific modeling in section 6. Note that unless otherwise necessary, we will use the same notation for both a population and patient-specific random vector, namely the random vector β has realizations β k and distribution P β , but it is important to recognize that that this vector is random for different reasons in the population and patient-specific cases. In a Bayesian mindset, a sample from a population is analogous to a sample from a prior distribution, while a patient-specific virtual patient is a sample from a posterior distribution. Note also that the statistical description of P β can either be parametric, i.e., P β = P β (θ), with θ being a finite-dimensional population parameter, or non-parametric where no such parametric assumption is made. Sections 3,4,6 and 7 focus on parametric descriptions, while section 5 discusses a patient-specific non-parametric estimation technique.

The virtual population framework
When the probability distribution P β is known either explicitly or implicitly, a virtual population can be generated directly by drawing samples β 1 , . . . , β K from P β , as will be demonstrated in section 3.3. Depending on the complexity of P β , this may be computationally straightforward, e.g., for low-dimensional uniform or Gaussian random β, or more demanding, for example when β parameterizes functional parameters such as diffusion coefficients or a drug concentration c(x, t) which may require solving an auxiliary random PDE in order to sample, or when β is conditional on patient-specific data, i.e., sampling a posterior distribution P β|g k .
In practice, however, the probability distribution P β is often partially or completely unknown. In such situations, one can estimate the underlying population probability distribution, P β , from empirically sampled data that may be directly or indirectly related to β [50,51]. One possible approach is to define pre-specified parameter ranges that are known to generate biologically feasible output from the mathematical model. For example, supposing that β = (D 0 , ρ 0 , κ 0 ) is a spatially homogeneous parameter in (2.1), it may be possible to determine a set of intervals for each component of β that will lead to biologically plausible output from the QoI model, for instance non-negative finite tumor sizes. Generating virtual patients by initially sampling uniformly from this predefined plausible parameter range allows us to sufficiently explore the broad range of responses that are possible. However, these preliminary model predictions might not reflect the distribution of some observed population-level data, which means the generated virtual populations do not have the correct population characteristics.
Several methods have been developed to address this issue [50][51][52][53]. For example, in the method presented by Allen et al. [50], a "plausible patient population" is generated by first matching predefined input (parameter) ranges and output (quantity-of-interest) ranges. Following our notation, this method supposes initially that P β is a uniform distribution on some set L (input) ⊂ R p . Then, defining a plausible range of outputs y (model) ∈ L (output) ⊂ R q , we say that a parameter β k ∈ P β is plausible if M(β k ) ∈ L (output) . The method proposed in [50] suggests generating a virtual population by first sampling plausible patients by applying a simulated annealing algorithm to an optimization objective designed to enforce the plausibility constraint. Then, a given plausible patient β k is included in the final virtual population with some probability, calculated based on an observed QoI distribution. The inclusion probability is based on: i) probability of inclusion for the whole plausible population, and ii) the ratio of the output probability density of the observed population to an estimated probability density for the plausible population. In this sampling step, the probability of inclusion for the whole plausible population is optimized until the generated virtual population (which is a subset of plausible patients) matches the distribution of the observed population data by some measure of statistical similarity such as the Kolmogorov-Smirnov test statistic.
It is worth noting that the ordinary accept-reject sampling approach presented in [50] suffers from poor sampling efficiency since it searches a large space of parameters and generates a significant number of plausible but unlikely parameter realizations that are ultimately rejected. Rieger et al. [51] improved upon this methodology by introducing more efficient sampling methods which include nested simulated annealing, a genetic algorithm, and Metropolis-Hastings-based importance sampling, and compared these different methods in their work. Another approach called the MAPEL (Mechanistic Axes Population Ensemble Linkage) algorithm, presented in [53], also follows the two-step process of generating a cohort of plausible patients followed by matching statistics to an observed population. Instead of rejecting plausible patients, however, the MAPEL method computes a prevalence weight for each plausible patient, with the prevalence weight computed to match the model output statistics to response bins measured in a clinical trial, according to a composite goodness-of-fit objective that acts at the level of histogram bins. While in theory each virtual patient could have a distinct prevalence weight, the MAPEL method elects to divide the parameter axes into bins, assigning a uniform prevalence weight to each bin (effectively modeling the parameter distribution P β as a multinomial distribution). A Nelder-Mead optimization method is then employed to update the prevalence weights.
Other methods for generating virtual patient populations for use in in silico clinical trials may also be applicable for the study of tumor heterogeneity. We refer readers to Chapter 5 of [54] for a more complete overview of recent successes in computer-aided clinical trials, and to [55] for a more theoretical discussion of virtual clinical trials.

Previous literature of virtual populations for tumor heterogeneity
Virtual populations have been employed in various contexts that are relevant to the study of tumor heterogeneity in both the drug development and clinical contexts. Specifically, virtual cell populations, mouse populations and human patient populations can be employed to understand the sources of heterogeneity at different stages of drug development.
The main motivation for generating virtual cell populations is to characterize population-level responses to a perturbation such as anti-cancer treatments. Due to the variations in the concentration of intracellular signaling species at the time of the perturbation, different cells can have different response to the same perturbation. Albeck et al. [56] study variability in cell apoptosis by generating a virtual cell population from a mathematical model of intracellular signaling networks. They validated this model with experimental data, and suggest that this virtual population technique can be applied to the development of pro-apoptotic cancer therapies. Note that an initial distribution of cell states is needed in order to simulate the cell population's response. For example, in [57] a gamma distribution was assumed for the concentrations of intracellular signaling species [58].
Once a potential anti-cancer treatment has been discovered, virtual mouse populations may aid the analysis and therapeutic design for preclincial studies of a potential drug treatment. In [25], a technique called the Virtual Expansion of Populations for Analyzing Robustness of Therapies (VEPART) was introduced and used to study the robustness of a tumor population in response to different treatment protocols. Given time course data on tumor volume and a mathematical model that has been initially fit to aggregate data, a non-parametric bootstrap technique is used to amplify the sample population to a large cohort of statistically plausible patients. This method is illustrated on mice with melanoma that receive oncolytic virus (OV) or dendritic cell (DC) vaccines. The virtual population was then used to design the optimal ordering of three doses of the OV and three doses of the DC vaccine (e.g., OV-OV-OV-DC-DC-DC designates three consecutive OV doses followed by three administrations of the DC vaccine over 6 days). Each treatment was ranked by how often it was found to be the optimal treatment for the entire virtual population. The current standard of care was ranked as the best treatment in some cases, but only lead to tumor eradication in 43% or less of the virtual population. By changing the doses of OV and DC, however, the authors showed that a different treatment schedule can lead to a robust treatment that eradicates tumors in 84% of the virtual population.
Moving up to a clinically relevant scale, virtual patient populations have been employed in a wide array of medical contexts, ranging from predicting outcomes for blunt force trauma patients [59] to studying insulin sensitivity in type 2 diabetes [52], and have been suggested as a method for evaluating patient-specific chemotherapy regimens [55]. For drug development, it is important to predict not only the mean response to an intervention but also the overall distribution of the population responses. In particular, virtual populations have been employed to help identify the key mechanistic parameters that give rise to variation in the response [53]. Virtual patient populations have also been used for evaluating medical radiation safety and performance of imaging systems [60,61] and to evaluate various radiotherapy protocols [62]. In the context of medical device assessment, a virtual patient is usually called a computational human phantom, and several phantom models employ randomization, morphing and posing to generate a virtual patient population that mimics a real population. While computational phantoms have historically emphasized anatomical structures, others are working to also incorporate functional (i.e., physiological) mechanisms, and thus may eventually provide a compelling platform for treatment evaluation in cancer therapy.
See [61] for a review of the current state-of-the-art in computational human phantoms.
In addition to the development and optimization of therapeutic agents, virtual populations have been used to guide the design of cancer biomarkers. Depending on the cancer, more than a decade may be needed to finish the Randomized Controlled Trials (RCTs) to validate the effectiveness of diagnostic markers. In contrast, emerging biomarkers may diffuse into clinical practice long before survival data becomes available. In the interim, Comparative Effectiveness (CE) studies typically collect data on how testing for a biomarker impacts an intermediate endpoint for treatment recommendations. For example, [63] shows an example of using a virtual population to help evaluate the effectiveness of biomarkers to target cancer treatment. It presents a statistical simulation tool that facilitates the modeling of the mortality impact of interventions. Given CE studies of typical intermediate endpoints, i.e., the common pre-mortality outcomes used to demonstrate the effectiveness of interventions, bootstrapping was used to generate a virtual population to mimic the population in the CE study. The treatment and mortality of the virtual population was then modeled to understand the long-term potential of emerging diagnostic biomarkers. In [64], a virtual population is generated to simulate the population response to a clinically tested therapy. This provides a mechanistic explanation into the heterogeneous response to the treatment and also helps identify several potential prognostic biomarkers.
Virtual population techniques have also been used to understand the impact of healthcare system administration on real patients. In [65], for example, an agent-based model representing colon and colorectal cancer patients was developed which includes a biological cancer evolution model and patients that interact with the healthcare system, including physicians, nurses and equipment. Simulations show promising interpolation results with respect to chemotherapy dosage and radiotherapy dosage. However, the model's ability to interpolate different administration protocols is still limited, and therefore calibration is required for each protocol.

Case study: Virtual population data generation from the Fisher-KPP model
In this section, we present the details of generating simulated tumor growth data with the Fisher-KPP model. We will generate two such virtual populations arising from either: i) a spatially-homogeneous model with inter-patient heterogeneity or ii) a spatially-heterogeneous model with intra-patient heterogeneity. These two datasets provide an ideal platform for demonstrating the performance of various methods for quantifying tumor heterogeneity presented in the subsequent sections. In Table 1, we detail which methods will be used to interpret these datasets through case studies found throughout this article. Each virtual population is produced by simulating a randomized Fisher-KPP model, with differences between the virtual populations arising from differences between the random field synthesis method employed for the model parameters ( D, ρ, κ), as well as different underlying probability distributions P β .
We will select a statistical description for β and corresponding random field synthesis method to demonstrate two tumor types: Spatially homogeneous and spatially heterogeneous tumors. For homogeneous tumors, ( D, ρ, κ) are the same for all spatial points, i.e., D ≡ D, ρ ≡ ρ and κ ≡ κ, and hence β = (D, ρ, κ) ∈ R 3 .
We will only randomize the growth parameter ρ, fixing D = 10 −6 , κ = 10.001. Note that the nominal value of κ = 10.001 selected here was chosen arbitrarily since it does not influence any subsequent analysis performed on the homogeneous dataset; a more biologically realistic value would be straightforward to incorporate. We sample ρ from the probability Table 1. Description of which datasets will be investigated in through case studies in later sections of this article. "Homogeneous" refers to the dataset arising from the spatiallyhomogenous Fisher-KPP model with inter-patient heterogeneity. "Heterogenous" refers to the dataset arising from the dataset arising from the spatially heterogeneous Fisher-KPP model with intra-patient heterogeneity.

Method
Sections and I k = 0 for patients with benign tumors and I k = 1 for patients with malignant tumors. We obtain a set of 25 tumor density data y k = y i, j,k , with y i, j,k ≈ n k (x i , t j ) denoting the approximated cell density at spatiotemporal point (x i , t j ) and k = 1, 2, . . . , 25 representing the virtual patient. We produced 15 benign tumors and 10 malignant tumors. The parameter ρ thus samples from a bimodal distribution comprised of benign (low ρ values) and malignant (high ρ values) tumors. The parameter µ 0 corresponds to the mean benign tumor rate of growth, and µ M , corresponds to the mean increase in the rate of growth for malignant tumors. The population parameters σ 2 0 , σ 2 M are the variances associated with these values. We set µ 0 = 1 × 10 −3 , σ 0 = 2 × 10 −4 , µ M = 0.03, and σ M = 1 × 10 −2 for this data set. Once we have sampled ρ k from Eq (3.1), we generate β k = (D, ρ k , κ) and simulate the Fisher-KPP solution as y k = M(β k ) using a numerical approximation to Eq (2.6). The data set generated in this homogeneous setting will be analyzed in sections 4.3 and 5.2.
For spatially heterogeneous tumors, ( D, ρ, κ) are assumed to be realizations of spatial random fields as described below. The inhomogeneous data will be employed for Bayesian-specific Bayesian inference in section 6.1 and for methods of machine learning in sections 7.2 and 7.3. We assume that D = D(x), ρ = ρ(x) and κ = κ(x), and model each as independent realizations of a 'lumpy-type' random field model [66,67]. For instance, D is generated by and the formula for generating κ and ρ can be written analogously. In (3.3), the 'lump' function (x; γ) is always bounded, integrable and strictly positive throughout the support set V. The number of 'lumps' L(ω) is Poisson distributed with meanL, and the random 'lump centers' c l are independent and identically distributed (i.i.d.) with distribution c l ∼ P c . We will assume for simplicity that the lumps have fixed shape γ = γ 0 , that the centers are uniformly distributed in V. We also assume the lump function is an isotropic Gaussian: In this case, the random field model is fully specified by the population parameter (L, A, σ 2 ). This parameter can be estimated for a given population by applying statistical estimation techniques; see e.g., [68] and section 6 below. Generalizations of these assumptions lead to more intricate field statistics, but at the expense of more parameters to specify and hence significantly more clinical data from which to infer reasonable distributions (see [69], for example). Note that we have also assumed that the three fields are statistically independent; this assumption is likely not supported by clinical data, but could be generalized, for instance by introducing a nontrivial statistical coupling between the lump centers in (3.3). Inferring the actual structure of this coupling, for instance the mutual covariances between the field parameters, would require extensive data and is a topic of future work. See also [70] in this issue for further discussion. The model (3.3) is flexible and rapid to simulate, and realizations can be guaranteed to satisfy conditions required for well-posedness of (2.1) such as strict positivity. The random field β is fully specified by the 9-dimensional population parameter ; the choice of θ we employed for the example virtual population is shown in Table 2, and example realizations are shown in Figure 1. Once a sample of the parameter vector β has been drawn, the PDE model (2.1) is simulated forward in time for each β k using a finite difference method-of-lines approach [71], employing Matlab's ode45 integrator for the time stepping. The result is a vector for 'patient' k at grid position x i and time t j (see [70] for further discussion of this numerical technique). Table 2. Population parameters θ employed to generate the test tumor population. These parameters were hand-tuned to obtain reasonable model output.
Carrying capacity κ(x) 20 5e7 5e-2 Using the population parameters θ provided in Table 1 for the lumpy model (3.3), we generate a set of K = 128 sampled realizations of β and y (model) k = M(β k ). In the top row of Figure 1, four realizations of the tumor cell density at a fixed time are shown, demonstrating a wide variety of tumor morphology during the initial growth phase. In the bottom row of Figure 1, we compare these virtual tumor cell densities to a collection of four Apparent Diffusion Coefficient (ADC) maps obtained from a publicly available clinical imaging dataset [72]. A 2D slice of the 3D ADC map for each tumor was extracted using a segmentation mask provided in the dataset. While ADC maps have been related to tumor cellularity [41], we provide these images for qualitative comparison only, and no parameter estimation step was attempted to match the virtual tumors to the real ADC maps. Such parameter estimation is discussed in [70]. In Figure 2, a single realization is shown for four time points, together with the corresponding random field realizations and the total cell number N(t). In Figure 3, population heterogeneity in the quantity-of-interest N(t) is demonstrated by plotting each of the 128 sample y k = N k (t) computations together with the sample averageN(t) and the resulting histogram of p N(x,t) for several time points. In Figure 4, we display the sample average tumor cell density for four times, which demonstrates a distinctly uniform 'spheroid' growth.
For the illustration of methods for quantifying heterogeneity in subsequent sections, we take the resulting simulated tumor cell density population y 1 , . . . , y K , and any quantity-of-interest functionals computed from these solutions, to be the ground-truth for subsequent experiments. Thus, we treat a given realization of the coefficient field generated using (3.3) to be the true latent parameter, and the corresponding QoI functionals applied to the model solution (2.6) as y true . Subsequent models may make further simplifying assumptions, such as spatial homogeneity, which will introduce modeling errors model relative to our simulated ground-truth. While we have elected to use virtual tumors as the ground-truth in this review, this is only for clarity and simplicity in describing subsequent statistical methodologies. Applying these methodologies to real clinical or preclinical data is feasible, but requires more careful data processing and is the topic of future work.

Nonlinear mixed effects modeling
Nonlinear mixed effects (NLME) modeling, also known as hierarchical nonlinear modeling, is a statistical framework for simultaneously characterizing both population-level behavior and individual variations. The response behavior can be described using a parameterized mathematical model, from which we can deduce both the average population parameter values as well as how individual parameter values vary from these mean values. The ultimate aim is to characterize the typical behavior of the population, and the extent to which the behavior varies across the individuals, as well as whether or not the variation is associated with individual attributes [73].
As an example, assume we aim to understand the intra-subject processes underlying tumor spread and growth from a large dataset of measured patient tumor volumes over space and time. An example of such dataset could be the virtual population data described in section 3.3. Applying the NLME framework for the fully inhomogeneous random differential equation would be a challenging task due to the large number of parameters required to account for spatial variability. Thus, in this section we will consider the homogeneous model for which (D, ρ, κ) = (D, ρ, κ) ∈ R 3 . By fitting this deterministic PDE model to patient data, we can understand the average rates of spread, growth, and maximal tumor volume for the population as well as the level of uncertainty in these parameter estimates. Such an understanding may serve as a first step in developing treatment and dosing strategies as well as guidelines for clinical studies [24]. When certain attributes of patients in the population are known (e.g., age, benign/malignant diagnosis, weight), the NLME framework allows one to further understand how these attributes influence the average behavior or uncertainty in the underlying processes.
We first describe the generic NLME framework in section 4.1 and then review previous applications of this framework to investigate tumor heterogeneity in section 4.2. Then, in section 4.3, we illustrate how to apply this framework to infer the underlying patient proliferation rate distribution from the homogeneous virtual population data described in section 3.3.

Figure 1.
In the top row, we show a virtual population of tumor cell densities n(x, t), at t = 335.2 days. These were generated using Eq (2.1) with lumpy random field coefficients using parameters given in Table 2. We give a common color scale to illustrate heterogeneity within the population. In the bottom row, we show MRI-based Apparent Diffusion Coefficient (ADC) maps for a collection of four human brain tumors gathered from the Brain-Tumor-Progression dataset on The Cancer Imaging Archive [72,74]. We provide these images for qualitative comparison only; the virtual tumors and real ADC maps are not intended to match.

The general NLME framework
We will briefly summarize the main procedure of NLME modeling and refer the reader to [73] for an exhaustive review. The general NLME framework consists of two stages: an individual-level model and a population level model. Let y i, j,k represent the quantity of interest at time t j and spatial location x i from the k th patient. The individual-level model is then given by where f represents a function governing within-individual behavior, β k represents the p × 1 vector of model parameters specific for patient k (which may depend on the patient's covariates u k , which may include age, weight, diagnosis, etc.), and i, j,k denotes realizations from both (meas) and (model) from Eq (2.9) representing the combination of many potential errors, including random intra-individual uncertainties, natural inter-individual variations, and measurement error). We assume for simplicity that E( i, j,k |β k ) = 0. For the problem of tumor spread and growth, y i, j,k represents the tumor density at spatio-temporal point (x i , t j ) for patient k, and f is the solution to the homogenous Fisher-KPP PDE, given by Eq (2.1) with β = (D, ρ, κ) ∈ R 3 . The vector β k contains the individual patient parameter The population-level model describes how variations in the model parameters between individuals are due to both individual attributes and random variations. The general formulation for the population model is given by where the function d has a p-dimensional output, u k denotes the covariate vector for patient * k, β ( f ixed) k represents fixed effects (i.e., nonrandom population parameters), and β (rand) k represents random effects (i.e., random parameters). A simple and common example for function d is given by in which the parameters representing each patient are assumed centered about the fixed effect β ( f ixed) k that is common for the whole population with the same covariates, u k , with random variation β (rand) k . There are cases where parameter values depend on patients' covariate values. For example, for the problem of tumor spread and growth described earlier, the growth rate ρ for the malignant patients should be higher than that for benign patients. In this case, we can write the growth rate as: where ρ 0 denotes the growth rate for benign tumors and ρ m denotes the increase in the growth rate for malignant tumors, and I k is a discrete variable designated as 0 for benign patients and 1 for malignant patients. Consequently, the parameter vector β k has the form: There are cases where the parameter values for one patient type have less variation than that for other patient type. This leads to the general linear population model provided by Here A k is a design matrix determining how patients' covariates influence model parameters and B k is a design matrix describing the variation associated with each of these estimates for different groups. The function f may not capture all within-individual processes, or may exhibit local fluctuations. As such, f may be viewed as an average over all possible realizations for patient k and β k parameterizes these different possible realizations. For example, we observe in Figure 4, that averaged RDE data appears similar to a homogoenous simulation of the Fisher-KPP simulation. * Note that in the typical NLME framework, x i , t j , and u k would be concatenated together into one variable, e.g., X i, j,k = (x i,k , t j,k , u k ), but we have chosen to write them as separate variables in this study to be consistent with typical mathematical modeling literature. If one only has access to time-varying values, then we would have X j,k = (t j,k , u k ).

Previous literature of NLME for tumor heterogeneity
In tumor biology, it is fundamental to understand both the population-wide behavior driving growth and progression as well as the inherent differences between individuals due to heterogeneity in immune responses and spatial microenvironment variables. This has prompted the common use of nonlinear mixed effects modeling in studies in tumor growth. Previous studies of tumor heterogeneity that use NLME modeling include [75][76][77][78][79][80]. Here, we review in detail a few studies that use NLME modeling to predict tumor responses to drug treatment [81,82] and the differences in metastatic spreading between patients [28,29]. For a broader review on the use of NLME for anticancer drug treatment, the reader is referred to [83].
Ferenci et al. [81] use empirical models to describe the effect of an angiogenesis inhibitor, bevacizumab, on final tumor size in mice. Tumor size was measured over time with digital calipers and small animal MR images during the initial growth phase of a xenograft tumor after mice were implanted with colon adenocarcinoma. Patients were divided into two groups: One receiving a large dose on the first and seventeenth day of the experiments, and the other group receiving a smaller daily dose. The considered models include the exponential, logistic, and Gompertz equations. Accordingly, β k contains the parameters for growth rate, carrying capacity, and time scaling for the different models. If I (group) k denotes the patient groups (two large doses or daily small doses) and I (meas) k denotes the type of measurement (digital caliper or MRI), then the population model is given by where β (group) denotes the changes to the mean population response, β (mean) , for different groups and β (meas) denotes the changes for the different measurement sources. Ultimately the authors determined that the exponential model could describe patient data robustly, whereas the sigmoidal models had large amounts of uncertainty in parameter estimates. The exponential model exhibited decreased growth rates for the daily dosing regime as compared to the mean population growth parameter, suggesting that smaller daily doses are more effective in preventing tumor growth than a small number of larger doses.
In [82], a compartmental ODE model was used to study the reduction in size of low-grade glioma (LGG) in response to three types of therapy: Procarbazine and vincistrine (PCV) chemotherapy, temozolomide (TMZ) chemotherapy, and radiotherapy. The dependent variables in the model consist of proliferative cells, quiescent cells, damaged quiescent cells, and concentration of the treatment. The model is fit to data measuring mean tumor diameter (MTD) from MRI scans for 21 PCV patients, 24 TMZ patients, and 25 radiotherapy patients. Each parameter in the compartmental model assumes log-normal random effects. The model for TMZ patients was validated on 96 other TMZ patients, and can accurately predict 5%, 50%, and 95% percentiles of the observed MTD time course using 200 virtual patients from the model. The model was further able to fit individual MTD trajectories based on treatment type and pretreatment MTD. The accuracy of this model in predicting both population-level distributions and individual trajectories suggests it is a viable tool for future use in predicting treatment efficacy for individual patients.
In [29], a transport model was used to describe metastasitic spreading during orthotopic breast tumor xenograft experiments in immunodeficient mice. Tumor size was measured over time with 3D bioluminescence imaging. An ODE model is used to describe tumor growth at the primary tumor site in combination with a size-structured PDE transport model to describe how metastatic cells spread to other areas of the body. The empirical 10-90% intervals of metastatic mass data for all patients varied over two orders of magnitude at all time points, demonstrating the large inter-individual variation in the data. The authors used log-normally distributed parameter values to match the model to these empirical distribution values, and concluded that the total metastatic burden was a sufficient metric from which to infer the rates of tumor growth and spread. Their use of NLME in this study demonstrated that individual differences in the primary tumor size lead to increased rates of spreading in patients.
Morrell et al. [28] investigate longitudinal levels of prostate-specific antigen (PSA), which is a biomarker for prostate cancer. From a longitudinal study [84], the PSA levels before, during, and after prostate cancer diagnosis were available from 18 men: 11 who had local or regional tumors and 7 who had advanced or metastatic tumors. The authors used a NLME framework to understand if the PSA levels behave different for the two types of patient classifications. To do so, a piecewise linearexponential model was fit to PSA levels over time, with random effects incorporated into the transition time between the linear and exponential phases as well as the growth rates of the exponential phase. This analysis demonstrated that a significant difference between metastatic and benign diagnoses is the lag between this transition time and time of diagnosis, suggesting that early detection of the tumor is the key to preventing harmful spreading.

Case study: Applying NLME to infer inter-patient heterogeneity
In this section, we illustrate the use of NLME on the spatially homogeneous virtual dataset generated from Eq (2.1) in section 3.3 with D = D ∈ R, ρ = ρ ∈ R, κ = κ ∈ R. To recap, there are 25 tumor profiles, {y k } k=1,...,25 , where y k = n(x, t; β k ) from Eq (2.6) denotes each tumor's spatiotemporal volume over time. Note that for simplicity in this tutorial we are assuming that y (meas) = y (true) . We assume some of the patients have been diagnosed as malignant and the others have been diagnosed as benign: Each patient is given a covariate variable, I k , which is 0 if they are benign and 1 if they are malignant. For simplicity in this section, we assume D and κ are known: we only need to infer ρ. We let this growth rate take the population model given by The parameter ρ 0 , thus corresponds to the benign tumor rate of growth, ρ M , corresponds to the increase in the malignant rate of growth, and the σ 2 0 , σ 2 M parameters are the variation associated with these values. Note that several of the NLME studies discussed in section 4.2 assumed log-normal distributions for some parameters. Such parameters in these situations are log-transformed during computation so that the parameters being inferred are normally distributed. We thus have the population-level model given by and we aim to infer θ = (ρ 0 , ρ M , σ 0 , σ M ) from {y k } k=1,..., 25 . Recall that this synthetic data set was generated by setting µ 0 = 1 × 10 −3 , σ 0 = 2 × 10 −4 , µ M = 0.03, and σ M = 1 × 10 −2 . Figure 5. Using the NLME framework to infer the distribution underlying ρ for the homogeneous data set. The solid green curves represent the true underlying distribution for ρ and The red dashed lines represent our inferred distribution for ρ using NLME. The purple histograms correspond to the number of realizations of ρ for the K = 25 patients. The left panel corresponds to benign patients and the right panel corresponds to malignant patients.
We performed estimation of this distribution for ρ using MATLAB's nlmefit command. This function estimates the underlying distribution by maximizing an approximation to the marginal likelihood assuming that the random effects are normally distributed and observation errors are independent of the random effects and i.i.d. normally distributed [85]. We computed this with an absolute tolerance of 1.0.
Our NLME implementation accurately inferred each of the four distribution parameters as: µ 0 = 9.440 × 10 −4 ,σ 0 = 2.239 × 10 −4 ,μ M = 3.44 × 10 −2 , andσ M = 1.01 × 10 −2 . The distribution (Eq (4.9)) generated by these estimated values, is depicted against the true underlying data distribution in Figure 5. The inferred distribution closely matches the true underlying distribution; using a two-sample Kolmogorov-Smirnov test between the true and inferred distributions (using 15 benign and 10 malignant samples from both distributions), we fail to reject the null hypothesis (that these two samples are from different distributions) at the 5% significance level.

Non-parametric estimation
As mentioned in section 3, the parameters in random differential equations are random variables, i.e., they are associated with a probability distribution. In this section, we describe the techniques for estimating these probability distribution using spatio-temporal aggregate data. The focus is on non-parameteric frameworks that make no assumptions about the form of the probability distribution. This framework has previously been used to estimate the distribution of growth-rates of shrimp populations [86,87] and to identify the inter-individual heterogeneity in the relationship between alcohol transdermal sensor measurements and blood alcohol concentration [88]. In the context of cancer modeling, this framework has been used to characterize the heterogeneity within a tumor [89]. In Rutter et al. [89], a reaction-diffusion equation was proposed in which the growth rate, ρ ρ ρ, and the diffusion rate D D D are random variables. Synthetic data were generated with a variety of underlying distributions for the parameter (such as bigaussian to represent a situation in which a portion of the tumor cells grew faster and the remainder grew slower, as well as point distributions to represent the traditional reaction diffusion equation with constant parameters). The authors then used the prohorov metric framework (PMF) to estimate the probability distribution of the random variables. Probability distributions of the parameter were succesfully recovered even with the introduction of 5% noise [89]. Below, we illustrate the application of the PMF for the inverse problem associated with the Fisher-KPP model using the same synthetic data as in section 4.

(5.2)
One of the many advantages to this approach is that it is flexible enough to model the regular reactiondiffusion equation (where ρ ρ ρ has an underlying point distribution). We describe the method here for completeness, but refer the reader to [90,91] (and the references therein) for proofs regarding the consistency and convergence of the inverse problem. The inverse problem is formulated to find an estimated probability distributionP ρ that minimizes the least squares problem:P where y i j represents the data at spatial location i and time j, where i = 1, .., N x and j = 1, ..., N t , and n represents the numerical approximations to the solution for the partial differential equation, and M represents the number of elements (or nodes) used in the finite-dimensional approximation (denoted P M ρ ρ ρ ) of the underlying distributionP ρ . When the underlying distributions are discrete, we use discrete approximations for which delta functions are equispaced over the desired parameter space. For example, if ρ ρ ρ lies in a finite interval, a parameter mesh grid of equispaced M nodes between the end points can be used: ρ ρ ρ M = {ρ k , k = 1, ..., M}. The inverse problem is simplified to: where w k are the weights of the respective points. Thus, we aim to determine the weights w k that minimize the distance between the computed solution and the data. Since the w k describe a probability distribution, we require that w k ≥ 0 and M k=1 w k = 1.
When the underlying distributions are continuous, we can use spline approximations composed of hat functions to parameterize a continuous distribution. Similar to the discrete case, we generate a mesh of M nodes for the random variable: ρ ρ ρ M = {ρ l , l = 1, ..., M}. However, in this case the nodes ρ l are used for constructing hat functions as follows where l = 1, ..., M. Then, the inverse becomes to find the probability distribution such that: where a l can be considered as weights of the hat functions. Since p l = a l s l (ρ ρ ρ) represents the probability density, it is required that M l=1 a l Ω ρ s l (ρ ρ ρ)dρ = 1. The goal is to find values of a l that minimize the distance between the computed solution and the data. Generally we do not know whether the underlying distribution is discrete or continuous. In order to decide which method, spline approximations or discrete approximations, and how many nodes to use, we need an unbiased measure. We use the Akaike Information Criteria (AIC) [92] that penalizes for using a higher number of nodes.

Case study: Applying the PMF to infer inter-patient heterogeneity
In previous work involving subpopulation identification for intra-tumoral heterogeneity, the data v ji was the aggregate tumor cell population for a single individual at spatial location j and time i. In our case we aim to understand inter-tumor heterogneity. Instead of identifying subpopulations of cells that grow at different rates, we want to identify subpopulations of individuals that grow at different rates. To accomplish this, we aggregate over the entire population. For example, if individuals 1 through K had a tumor cell population v k ji , the aggregate population would be K k=1 v k ji . We implemented the PMF for the same 25 synthetic tumors described in section 4, of which 15 tumors are benign and 10 tumors are malignant. We used a discrete approximation to estimate the underlying distribution. In particular, we found that 380 nodes, equispaced between 0 and 0.06, resulted in the lowest AIC scores. The resulting estimation exhibited a clear difference between growth rates for benign tumors (< 0.005) and growth rates for malignant tumors (0.02-0.06). In particular, 60% of the tumors are identified as benign which reflects the true number of benign patients (15 of the 25 synthetic tumors are benign). Figure 6 displays the 'true' underlying distribution (in purple rectangles) and the best discrete approximation (in red). The lines connecting the discrete approximations are for ease of reading the trends only. The PMF method is able to correctly determine that the underlying distributions is ! " ! # Figure 6. The true (purple) and predicted (red) growth rates for the 25 individuals using a discrete approximation. The left panel corresponds to benign patients and the right panel signifies malignant patients. The true (purple) and predicted (red) growth rates using a discrete approximation for the 15 benign individuals (left panel) and 10 malignant patients (right panel).
bi-modal with a peak in the low growth rates and a second peak in the higher growth rates. The total amount of the predicted distribution lying between the benign and malignant cases is less than 0.09%. Examining the recovered distributions in the benign case (left), we can see that the recovered distribution has slightly over-estimated the peak of the benign growth rates and the recovered distribution is slightly wider than the true underlying distribution. When examining the malignant case, the discrete approximation does a good job in approximating the true underlying distribution. We hypothesize that since the sample rate is sparse (only 10 patients), we do not recover the 'true' underlying normal distribution. Like many numerical schemes, the PMF framework is sensitive to the discrete mesh used in the approximation. In particular, the parameter mesh must be fine enough to fit the benign growth rates, which may result in overfitting of the malignant growth rates.

Bayesian methods
As we have seen, tumor growth models and their associated QoI can vary widely in complexity. For example, the 2-dimensional RDE model (2.1) with homogeneous coefficients (D, ρ, κ) together with Eq (2.7) leads to a QoI (2.5) with input and output dimensions p = 3 and q = 1, respectively (for each fixed time t = T ). Alternatively, incorporating spatial heterogeneity into each of the RDE parameter fields using a lumpy-type random field model (3.3), described by a fixed lump function (x) and a fixed number of lumps L, requires 2L scalar parameters (the (x, y) location of each lump) for each of the parameter fields, resulting in a total of p = 6L scalar parameters required to specify the fields and simulate the model output (2.6) and subsequent QoI models (2.5). If, instead of a scalar QoI, we require a prediction of the cell density on a spatial grid or for multiple time points, the output dimension q can be very large indeed. In addition, boundary conditions can influence the output in a nontrivial way, and ad-hoc choices can lead to inaccurate model predictions.
In an ideal situation, one would have access to patient-specific estimates of the coefficient fields D(x), ρ(x), κ(x) and the initial condition n 0 (x), from which the forward model (2.6) could be run to assist in prognostic and therapeutic decision making. The growth factor ρ(x) can be related directly to glioma cell proliferation rate, and hence estimated in vivo using a molecular proliferation marker such as the Positron Emission Tomography (PET) tracer 18 F-FLT [93]. Diffusion Tensor Imaging (DTI) may offer a pathway to estimating the cellular diffusion rate D(x) [94], but this technique requires further clinical validation. To our knowledge, there is no convenient clinical technique to measure either the initial condition n 0 (x) nor the carrying capacity κ(x), and neither 18 F-FLT or DTI are currently considered standard clinical practice. Thus, the typical strategy is instead to obtain patient-specific estimates of tumor cell density y (meas) k = n (meas) (x ik , t jk ), using models that relate cellular density to more commonly available imaging data such as Diffusion Weighted MRI (DWMRI) [40,41,95] or FET-PET [96]. These estimates can then be combined with a parameter estimation step to make either population-level inferences about the distribution of β, or patient-specific prognostic and therapeutic decisions. Another possibility is to modify the original growth model in such a way that the model parameters correspond more closely to directly measurable physiological processes instead of phenomenological parameters, a technique emphasized in [48].
Alternatively, one might attempt to pursue a more classical approach to statistical model calibration by supposing that a set of calibration data of known ground-truth (free from or with minimal measurement error) values of both model inputs β and true outputs y (true) is available. Such data can be used to infer both the distribution of β and the model discrepancy model . However, there is typically a paucity of such data with which to calibrate tumor growth models, in particular it is usually challenging to control or form independent estimates of the latent parameter β in a context that is useful for clinically relevant patient-specific inference. For instance, it may be possible to construct in vitro or small animal experiments to control β (for instance by modifying nutrient availability and using synthetic ECM such as Matrigel ), but the resulting inferred distributions may not apply to the clinical context. As discussed above, a wide range of clinical imaging data is typically available (e.g., [74] is an open-access database), but imaging data is always noisy, incomplete, and at best only indirectly related to the model parameter of interest, so classical calibration is nearly impossible with such data. With the issues outlined here and in the paragraph above, how do we proceed? A variety of Bayesian techniques are available for performing model calibration at the population level and patient-specific parameter estimation, as well as providing uncertainty quantification on parameter estimates and model outputs, even in the case when patient data is noisy and incomplete.
Broadly speaking, a Bayesian technique assumes that all parameters are random variables, including population parameters which may ostensibly be non-random. For example, in the RDE model with homogeneous coefficients β = (D, ρ, κ), we account for population heterogeneity by assuming that β is random with probability density p(β|θ), where θ is a population parameter such as the population mean β = (D,ρ,κ) and covariance matrix K β . While such population parameters are well-defined as largesample averages, and hence non-random, a Bayesian technique will assign a probability distribution to them anyway. Probabilistic conditioning on available data -whether it be population-level calibration data or patient-specific data -then specifies the distribution to be used for subsequent inference. Such a distribution is called a posterior. With the notation and setup outlined in section 3, we assume that β ∼ p(β|θ), where θ is a population parameter and p(·) is a Probability Density Function (PDF). Bayesian techniques are capable of performing inference on θ (to assess inter-patient heterogeneity), as well as on a patient-specific realization β k (to assess intra-patient heterogeneity and quantify uncertainties). We discuss each case separately, starting with the patient-specific case, since population inference can only proceed with a collection of samples from individual patients.

The patient-specific Bayesian inference framework
In the context of patient-specific inference, we wish to use available data to make inferences about an individual patient's parameter, which we denote by β k here. As throughout this review, we assume that β k is a realization of the random vector β, which we assume to have PDF p(β|θ). While the true population distribution p(β|θ) may be partly or completely unknown, one can assume a particular form for patient k, for instance by selecting a convenient nominal value of θ. The assumed distribution for patient k is called the prior and is denoted p (prior) (β k ). Ideally, the choice of prior is informed by calibration data and not simply convenience; non-informative priors such as uniform and maximum entropy distributions are an alternate choice if such data are unavailable [45,47].
The next component of a patient-specific Bayesian procedure is an explicit measurement model relating y (meas) k to y (true) k . Such models typically take the form where y (meas) k ∈ R m is the measured data for patient k, and H : R q → R m is the measurement operator. For imaging data, m is typically quite large (O(10 6 −10 8 ), especially for volumetric images). For a wellcalibrated measurement model, it is usually reasonable to assume that E y (meas) ) and covariance matrix K (meas) [67]. The probability density of y (meas) k |y (true) k is denoted p (meas) (y (meas) |y (true) ); as a function of y (true) , this density is called the measurement likelihood and is written L (meas) (y (true) |y (meas) ) = p (meas) (y (meas) |y (true) ). Solving the statistical inverse problem (6.1) is considered standard practice, with both frequentest and Bayesian techniques being available [67,97,98]. In both cases, one has a choice to produce either a point estimateŷ (true) k , or a full Bayesian posterior p(y (true) k |y (meas) k ). In the latter case, the prior and likelihood are combined to write However, as discussed throughout this review, we may be more interested in the latent model parameter β than we are in the state y (true) k ; as highlighted in [23,24,99] and elsewhere, the parameters D and ρ in the Fisher-KPP model correlate with prognostic outcome more so than cell density n(x, t).
Combining (6.1) with the model (2.6), we have If H is linear (a reasonable assumption for many imaging systems, for instance), this reduces to Equation (6.4) highlights a fundamental issue, namely that it is difficult to deconvolve the influence of model discrepancy and measurement error: The measured data may differ from the model predicted output both due to model error and measurement noise, and in the absence of noise-free data, it can be difficult to calibrate (model) [100]. This leads many to assume (possibly erroneously) that either A Bayesian solution to (6.6) will be similar to (6.2), except that we are inferring β k instead of y (true) In general, drawing samples from the posterior (6.7) requires the usage of a Markov Chain Monte Carlo method such as the Metropolis-Hastings algorithm or one of its more sophisticated variants [47,97]. If one can assume that (meas) k is conditionally independent of y (meas) k , then the likelihood in (6.7) can be written as For example, if the measurement noise is Gaussian, we would have

Previous literature of patient-specific Bayesian inference for tumor heterogeneity
Several groups have pursued the usage of Bayesian strategies for patient-specific model calibration in the context of mathematical oncology. For a tutorial review on this topic in the context of an ODE tumor growth model, see [101]. We provide a brief review of several approaches here.
In [102], a reaction-diffusion system that couples proliferating and migrating cells to an extracellular matrix is combined with a measurement model similar to (6.5) to formulate patient-specific tumor growth prediction as a data assimilation problem. In data assimilation, measured data y (meas) i,k , collected for patient k at a series of time points t = t i , is used to sequentially update the estimate of the state y (true) i,k . The authors employ a technique called the Local Ensemble Transform Kalman Filter (LETKF), which employs an ensemble (i.e., sample) of state estimates to emulate the posterior distribution p(y (true) |y (meas) ). They do not attempt to perform the simultaneous parameter estimation problem (i.e., solving the statistical inverse problem (6.5)), instead selecting nominal values for β with which to perform the forward prediction y (model) = M(β). Their approach is thus more akin to a data assimilation solution to (6.1) in the case when y (meas) is time-resolved.
In [103], the authors discuss an application of the Bayesian calibration and validation framework presented in [104] to a class of diffuse interface/continuum mixture models of tumor growth, and discuss methods of evaluating model fit and selecting the 'best fit' model by comparing quantity-ofinterest PDFs from calibration and validation scenarios. The QoI they aim to predict is the same as our Eq (2.7), namely the total tumor burden at some future time t = t 3 , which we write as y (QoI) . Using calibration data from t = t 1 and validation data from t = t 2 , they form two posterior PDFs (in our notation) p (calib) = p (post) (y (QoI) |y (meas) (t 1 )) and p (valid) = p(y (QoI) |y (meas) (t 2 )). These posterior distributions are then compared using a metric between probability distributions such as total variation to evaluate goodness of fit, the idea being that the posterior formed with the calibration data should not be significantly different than the posterior formed with the validation data. They compare three models to each other, concluding that a model that incorporates only simple proliferation is rejected in that it cannot reproduce the QoI data generated by a more sophisticated model that incorporates both proliferation and apoptosis as well as oxygen dependence; however, the model with proliferation, apoptosis and oxygen dependence cannot be rejected as a surrogate for a model which incorporates time-dependent parameters. While these conclusions are fully in silico and hence do not have direct clinical significance, they demonstrate Bayesian calibration methodologies in the context of tumor growth modeling. A later article by the same group provides an extensive review of the application of Bayesian methodologies to multiscale tumor growth modeling [105].
In [106], the authors assume a Fisher-KPP tumor growth model with homogenous diffusion and growth parameters D(x) ≡ D and ρ(x) ≡ ρ, and κ ≡ 1, then use a combination of T1Gd (T1-weighted Gadolinium contrast enhanced) and T2-FLAIR (T2-weighted FLuid-Attenuated-Inversion-Recovery) MRI images to form a Bayesian posterior for the parameters (D, ρ). They choose a likelihood model based on the Hausdorff distance between segmented MRI images and the model predicted tumor, use a variety of priors including a log-uniform, and employ a variant of Hamiltonian Monte Carlo (HMC) to perform the posterior sampling. They do not account for uncertainty in the image segmentation itself, nor do they appear to account for spatial inhomogeneity in either model parameter.
The recent article [107] assesses uncertainty and robustness of a treatment response metric called Days Gained (DG), which is a QoI derived from a Fisher-KPP model by comparing a patient's actual post-treatment tumor size to a virtual untreated control, simulated using the RDE model. The evaluated robustness and uncertainty in DG by accounting for uncertainty in both the segmentation and in the time of tumor initiation, and also evaluate the effect of prior selection, comparing a Gaussian prior calibrated to a real population of glioblastoma patients to an ad-hoc uniform prior. Their results appear to indicate that DG is robust to both the uncertainties assessed as well as the choice of prior.

The population-level Bayesian inference framework
While the Bayesian techniques discussed in section 6.1 apply to the case of patient-specific uncertainty analysis, it is also useful to apply Bayesian strategies to the population inference case. As we have discussed throughout this review, a heterogeneous population is described by a parameter distribution P β , which may depend on a population parameter θ. In 6.1, Bayesian analysis is applied to β; a population Bayesian analysis would apply to θ.

Sensitivity analysis and model reduction framework
When there are many parameters to calibrate, a first step is often to reduce the dimensionality of the problem by performing a sensitivity analysis. Heuristically, sensitivity analysis tries to identify those input parameters that most influence the output values. For example, in the RDE model (2.1), of the original p parameters, it may be that only a few play a significant role in the output QoI. Sensitivity analysis techniques are usually categorized as being local or global, with the former studying the effect of parameter variation around a fixed, nominal value, and the latter evaluating the effect of parameter variation across a wide range of possible values. The simplest form of local sensitivity analysis, at least conceptually, is to compute a finite difference approximation of the partial derivatives Other techniques for local sensitivity analysis include forming and solving the forward and adjoint sensitivity equations (the 'FSAP' and 'ASAP' methods), or using automatic differentiation [45,47]. A common approach to global sensitivity analysis is a technique called Sobol analysis [45,47,108]. If the effect of the parameters on the output are all statistically independent, Sobol indices measure the relative influence of each of the parameters β 1 , β 2 , ..., β p . In essence, the method decomposes the total change in the output into a sum of the changes due to each individual β i parameters, plus the changes due to every pair of parameters (β i , β j ), plus every triple of parameters, and so on. The first order Sobol indices yield results similar to the classical Analysis of Variance (ANOVA) technique in statistics. Many times, however, the parameters are not independent, and in such a case, the Sobol approach can give misleading results.
A different methodology is called the active subspace model [109]. In essence, active subspaces approximate the gradient of the outputs y (model) with respect to the inputs β, leading to the matrix C = (∇M(β)) (∇M(β)) T . (6.12) The k largest eigenvalues ofĈ, and the associated eigenvectors, define the most influential set of parameters. Often one chooses k large enough to capture some threshold of the total variability-say 90%-the premise being that k p. Active subspaces generalizes the more classical technique of Principal Components Analysis (PCA), which finds linear subspaces in which the majority of variability resides. By contrast, active subspaces can produce a nonlinear manifold which contains most of the variability.
A simpler but related approach is to use Morris indices [47,110]. Assuming the input parameters have been scaled to the unit hypercube β ∈ [0, 1] p , one selects a point at random to start and computes the output value for this input. One then moves a small step (typically 5-10%) in the direction of one of the input dimensions and re-computes outputs. One then proceeds dimension-by-dimension until all dimensions have been sampled, keeping track of the outputs obtained. One can (approximately) integrate along this path to obtain a value of the total variability of the outputs.
Having identified the important input parameters, perhaps re-parameterizing the problem in the process, one can then explore the reduced parameter space in greater detail. A common way to proceed is as follows. First, in the less important parameter dimensions, set the parameter values to some nominal accepted values; second, one can apply a design of experiment approach that samples widely in the remaining parameter dimensions, and lastly, one can employ MCMC methods to calibrate the distribution of the important inputs that best represent any available data. Since MCMC-based calibration can be computationally demanding, it is often helpful to construct a statistical emulator, that is, a surrogate model that very quickly estimates the simulation output at untested input values and, importantly, provides an estimate of the error in this estimation procedure. Gaussian Stochastic Processes provide a robust method for emulator construction [111,112].
As part of the calibration process, it is important to check whether the MCMC has overfit the parameters. That is, sometimes the posterior distribution of the parameter inputs are concentrated, but at an incorrect value. Such a finding would suggest an overly high degree of confidence in the parameter estimate, even though that parameter distribution does not cover the "correct'' value. The most common cause of such over-fitting is model discrepancy-that is, the model does not include all the features that nature is using. This discrepancy is the source of (model) in Eq (2.8). Although the calibration and model error cannot be distinguished independently, one can estimate the size of the error across likely input values, to estimate the extent of the model error [113]. Typically, the model error (model) is modeled as a Gaussian Process.

Case study: Patient-specific Bayesian inference
In Figure 7, we demonstrate a basic application of (6.7) in the case of a linear measurement operator H, making the homogeneous parameter assumption that β = (D, ρ, κ) are spatially constant.
We simulate imaging data by assuming the measurement operator H from (6.1) is a linear blurring operator, that is, the imaging data consists of noisy linear functionals of the form y meas m = (h * n)(x m , t m ) + meas m , with 1 ≤ m ≤ M. In this model, h is a Gaussian point spread (blurring) function, * denotes convolution, (x m , t m ) are the locations and times of the imaging sensors, and meas is a mean-zero Gaussian noise vector. We assume that M = 2 × 47 2 , i.e., two 47 × 47 pixel images are taken at two separate time points, namely t = 40.6 and t = 81.1 days; see Figure 8. This image data model is a reasonable approximation to many light-based microscope measurements, as well as some reconstructed tomographic images. More detailed imaging system models are discussed elsewhere [67] and are not the purpose of this review. For the experiment shown in Figure 7, the prior p prior (β) is assumed to be uniform on the set (D, ρ) ∈ [0, 1e − 4] × [0, 1], while κ is assumed fixed at some nominal value. We employ a standard Metropolis-Hastings MCMC method to draw samples from p post (D, ρ), taking the maximum likelihood estimate of (D, ρ) as the initial point and a Gaussian proposal density with standard deviation (σ D , σ ρ ) = (1e − 11, 1e − 7), leading to an acceptance rate of around 50%. Note that in Figure 7, a marked concentration effect is taking place whereby the accepted values of (D, ρ) seem to concentrate around a 1D submanifold of the parameter space, indicating that the two images available are not sufficient to fully identify the parameter (D, ρ). If additional imaging measurements were available, for instance molecular imaging of cell proliferation as discussed in the introduction to this section, this identifiability issue might be mitigated; see [70] for more information.

Machine learning
Machine learning has recently emerged as a powerful tool to automate tedious tasks such as image classification and segmentation [114][115][116]. Many of the concepts used in machine learning practice can be applied to personalized medicine-for a recent review on machine learning applications in healthcare, see [117]. One advantage of using machine learning methods is the ability to find patterns in data without requiring human input, i.e., feature engineering. However, this advantage comes with some drawbacks, such as lack of interpretability and generalizability.
Recent applications of machine learning to cancer research include data processing such as image processing, analysis of genomic data, and quantification of electronic health records. For example, image processing tasks encompass lesion detection [118,119], cell segmentation [116,120], and tumor segmentation [121,122] which feature heterogeneity in image modality (phase contrast, MR, CAT, etc.), machine type, and across patients. For a recent review on deep learning methods for personalizing medicine via imaging, see [123] and the references therein. Recent work includes survival prediction given imaging data and demographic information [124][125][126] in a very heterogeneous patient pool. Genomic expression data has been used to predict tumor type [127] and predict survival time in conjunction with histological imaging in gliomas [128]. For a tutorial and review on deep learning models for genomic data, see [129]. Machine learning has also been applied to translating electronic health records (EHR) [130] for use in predicting sepsis [131] and diabetes [132]. Within the context of cancer, machine learning can be used to process EHRs to detect Figure 8. Simulated image data used in the patient-specific Bayesian demonstration. We assume that direct imaging of n(x, t) is available in the form of y meas = H n + meas with H being a linear shift-invariant imaging operator and meas being a Gaussian noise. Each image is 47 × 47 pixels.
colorectal cancer [133] and to predict pancreatic cancer using EHR data in conjunction with PubMed keywords [134].

Previous literature of machine learning methods for tumor heterogeneity
Multi-task learning (MTL) [135], a machine learning methodology that considers learning more than one task simultaneously (e. g., predicting both the type and color for a single image of a flower) provides an opportunity to account for patient heterogeneity in prediction. By training tasks in parallel, this technique enables shared feature learning through joint modeling of multiple tasks and has also been shown to help with model regularization [135]. For a recent overview on multi-task learning approaches, see [136] and the references therein. Recent work proposed a MTL architecture in which predicting outcomes for separate individuals are considered as multiple tasks [137]. The objective of this work was to predict the health, stress, and happiness of an individual from longitudinal time-series data of a set of 343 features extracted from sensors, smartphones logs, behavioral surveys, and weather information. The authors discovered that higher prediction accuracy was achieved when a multi-task neural network model was used, in essence treating each prediction for each individual as a separate task. The key feature in this multi-task approach is that the task separation occurs by using a single neural network architecture, in which the first few layers in the network are shared among all individuals, and the last last layers branch out to predict each individual's outcome separately. By defining a separate prediction task for an individual or "user", in this way, population level information is leveraged and population features are "shared" among individuals. Notably, it was shown that the multi-task model with users-as-tasks was significantly more accurate than the typical single-task modeling approach in which a separate neural network model is used for each individual. This multi-task learning approach could be applied to create individualized predictions of cancer growth in similar settings for which too few data points exist to train an accurate machine learning model for each patient, but data exists for a large population of patients.
More recent work has proposed combining mathematical modeling and machine learning to enhance accuracy and interpretability. The 2019 Mathematical Onocology Roadmap [138] urged the community to begin examining the ways machine learning and mathematical modeling can be integrated. For example, physics-informed neural networks (PINNs) [139] were shown to help constrain the neural networks that are used for learning equations [140,141] to obey the laws of physics. Other examples have shown that taking a hybrid approach to prediction by combining characteristics of both mechanistic modeling and data-driven modeling can offer improvements in prediction accuracy and parameter estimation over the standalone methodologies [142,143]. Within the context of GBM, recent work showed that constraining machine learning models with the proliferation-invasion model (Eq (2.1)) resulted in more accurate predictions of tumor cell density than mathematical models or machine learning models alone [144].
In the remainder of this section, we focus on an example machine learning methodology for generating virtual data that are similar to the real/original data: Generative networks (VAEs, GANs, etc.). Generative modeling in machine learning attempts to produce new samples from some unknown probability distribution P data given a set of observations from that distribution. In other words, given input data (y ∈ R Ω ), a model can be trained to generate data like it (ŷ ∈ R Ω ). Thus, generative modeling, similarly to virtual populations, can provide a platform to investigate heterogeneity. Note that while conceptually similar to the process of generating virtual population data in section 3 where patient-specific virtual populations are generated based on the patient-specific distribution of parameters (derived from patient-specific data), generative models in machine learning are trained based on the patient-specific distribution of the data itself (e.g., images of tumor density). Additionally, these methods frequently assist in other ways such as increasing the amount of available training data and discovering a data set's salient features.
In most recent applications, generative models typically adopt one of two neural network approaches: Variational auto-encoders (VAEs) or generative adversarial networks (GANs). In the following subsections we provide high level overviews of each method as well as examples where they have been utilized in precision medicine. In section 7.

Case study: Variational auto-encoders for virtual population generation
Variational auto-encoders [145] are an extension of auto-encoders which specialize in compressing and decompressing data. Auto-encoders (AEs) are an unsupervised learning method that leverage neural networks to learn lower-dimensional representations of data without the need for feature engineering. Traditionally, AEs are composed of two distinct neural networks, known as an encoder and a decoder. The encoder neural network uses input data y ∈ R Ω (e.g., images where Ω is large) and through a series of feature extraction and down-sampling layers "compresses" each input into an -dimensional vector of latent variables z ∈ R where Ω. The decoder network uses z as an input and, through a series of feature extraction and up-sampling layers, "decompresses" the latent vector into an outputŷ ∈ R Ω which is a reconstruction the original input data y. The low dimensional bottleneck between the input data and their reconstructions forces the AE to learn a maximally informative compressed representation of the input data. AEs are trained end-to-end using the reconstruction error (typically mean squared error) between inputs y and reconstructionsŷ. Concretely, given an encoder N E (y; θ E ) : R Ω → R parameterized by θ E and decoder N D (z; θ D ) : R → R Ω parameterized by θ D , the objective of auto-encoders is to minimize the cost function J AE (y,ŷ) = y −ŷ 2 2 (7.1) whereŷ = N D (z) = N D (N E (y)). We note that the parameters θ described in this section refer to the weights and biases of neural network models and do not describe the population parameters from section 3. AEs in this form, however, are not generative since for a fixed input, the reconstructed output is always the same. VAEs extend auto-encoders to the class of generative models. Similarly to AEs, VAEs are also composed of encoder and decoder networks. However, rather than having the encoder output the latent vector z directly, the encoder network in a VAE instead outputs parameters which characterize a probability distribution (typically a normal distribution) from which realizations of z can be sampled. In this way, VAEs encode input data into latent distributions rather than latent samples. For example, if we wish to encode input data y into an -dimensional normal distribution from which we can easily draw samples, we can modify the encoder neural network to output two vectors µ ∈ R and σ ∈ R . Then, the decoder network uses as inputs a random sample z ∼ N(µ, Σ), where Σ is the diagonal covariance matrix with diagonal elements σ 1 , . . . , σ , and outputs a reconstructionŷ of the original input y. In this formulation, the -dimensional normal distribution, characterized by µ and σ, is composed of independent 1-D normal distributions.
Similar to auto-encoders, variational auto-encoders are also trained end-to-end by minimizing the reconstruction error between inputs y and reconstructionsŷ. However, in addition to accurate reconstruction quality, the optimization also accounts for similarity in latent representations. Therefore, an additional constraint is added to the VAE objective function, specifically on the latent probability distribution characterized by encoder outputs µ and σ. In practice, this constraint is the Kullback-Leibler (KL) divergence between the latent distribution N(µ, Σ) and the standard normal distribution N(0, I) where I is the -dimensional identity matrix. Thus, the objective function for VAEs takes the form J VAE (y,ŷ) = y −ŷ 2 2 + KL N(µ, Σ), N(0, I) where the first term represents the reconstruction error and the second term represents the KL divergence between the latent distribution and the standard normal distribution. By training the encoder and decoder networks with this dual-objective function, the VAE learns to represent high-dimensional input data in a continuous low-dimensional latent space where similar inputs are characterized by similar latent distributions. Given the clear optimization problem with well-defined inputs and outputs, training VAEs is systematic and stable, however the reconstruction quality can be unrealistic and "blurry" due to the nature of mean squared error as an objective function. The interested reader can find a more detailed explanation of VAEs in [145,146]. We next describe how VAEs can be used to model virtual population data. However, other notable recent uses of VAEs towards precision medicine include the best performing segmentation algorithm for the Brain Tumor Segmentation Challenge (BRATS) [147] and Stacked Sparse Autoencoders (SSAEs) for nucleus detection in H&E images [148] We applied VAEs to modeling the virtual population data of 128 tumor cell density time series generated in section 3.3. For the encoder and decoder we employ deep convolutional neural networks (see https://github.com/jtnardin/Tumor-Heterogeneity/ for code) using a latent space of size = 100. For training and validation data, we use the first 20 time points of each of the 128 virtual patients for a total of 2560 observations. To pre-process the data, each 256 × 256 realization is linearly downsampled to 128 × 128 and normalized pixel-wise to [0, 1]. The data are then randomly partitioned into 80% training and 20% validation sets. The training data are augmented using random rotations of 90 • during training while no augmentations are applied to the validation set. To train the encoder and decoder networks, we minimize the VAE objective function in Eq (7.2) using the Adam optimizer with default parameters except a learning rate of 1 × 10 −4 for a total of 5000 epochs with a batch size of 64. The KL Divergence term is scaled by 1 × 10 −5 to keep it from dominating the reconstruction error in the objective function. We report the results of the VAE that achieves the smallest error on the validation set.

Original
Reconstruction 1 Reconstruction 2 Reconstruction 3 Figure 9. A virtual patient's tumor cell density (left image) is encoded into a latent distribution by the encoder network, then three random samples from the latent distribution are input to the decoder network to produce the three reconstructed images to the right of the original.
The trained VAE can be used for generating new data and producing new realizations of already existing data, which could be used synergistically with virtual populations for enhanced investigation into heterogeneity. In practice, the generated data can be used in conjunction with observed data for better mathematical model calibration or improving the performance of other machine learning models for tasks like classification or segmentation. In Figure 9 we demonstrate how the trained variational auto-encoder (VAE) can be used to produce new realizations of a given observation. An example observation (left) is selected and encoded into a latent distribution characterized by µ, σ ∈ R 100 . Then the next three plots show various decoder reconstructions from samples z ∼ N(µ, Σ). Figure 9 also illustrates some of the strengths and weaknesses of VAEs. In particular, VAEs allow scientists to encode observed data into a lower-dimensional latent representation from which we can sample new realizations that are similar to the observation itself. However the sampled reconstructions can be blurry, as shown most clearly in VAE Reconstruction 3 ( Figure 9) in which the boundaries around the tumor are more diffuse.

Case study: Generative adversarial networks for virtual population generation
Generative adversarial networks (GANs) [149] are another generative modeling approach with widespread use in the field of machine learning. Like VAEs, GANs are also composed of two neural networks, called a generator and discriminator, but rather than being trained end-to-end to minimize reconstruction error, the generator and discriminator train competitively against each other. Intuitively, the generator network attempts to fool the discriminator network by producing realistic looking data while the discriminator is trained to distinguish between real and generated data. This method is formalized in the following paragraph.
The objective of GANs is to produce new samplesŷ ∈ R Ω by approximating the unknown probability distribution P data that generates the observed data y ∈ R Ω in the training set. Since the true probability distribution is unknown, we instead wish to sample from a lower-dimensional probability distribution P z that is chosen a priori (e.g., standard normal) and map it through a sufficiently complex function (i.e., generator neural network) to approximate the true distribution P data . Let z ∈ R be a sample from P z , then the generator N G (z; θ G ) : R → R Ω parameterized by θ G outputs a generated data sampleŷ. The discriminator N D (y; θ D ) : R Ω → (0, 1) parameterized by θ D , where (0, 1) denotes the open interval between 0 and 1, inputs real and generated data and tries to predict the binary classification (i.e., 1 for real, 0 for fake) of its inputs. In this way the discriminator represents the probability that a generated sampleŷ came from the true probability distribution P data .
Since GANs are composed of neural networks, they can be trained by minimizing a common objective function where N G (z) =ŷ. The first component of Eq (7.3) trains the discriminator to maximize the probability of correctly classifying incoming real and generated data while the second component trains the generator to fool the discriminator. Once the GAN is trained, the generator network can be used to generate new samples from the learned true data distribution P data by simply drawing samples from P z . Due to the adversarial objective function, training GANs can be unstable for several reasons, including mode collapse [150] where the generator outputs the same sample regardless of the input, and oscillating/vanishing gradients [151] where the adversarial network parameter updates during training unfairly benefit either the discriminator network or the generator network. Provided that a GAN is trained sufficiently well, the resulting generator model is significantly more accurate at generating realistic looking samples than VAE reconstructions. The interested reader can find a more detailed explanation of GANs in [149,152]. GANs have been used increasingly in precision medicine applications. For example, they have been used to increase the amount of available training data for the task of liver lesion classification which boosted classification accuracy from 78.6% sensitivity and 88.4% specificity to 85.7% sensitivity and 92.4% specificity [153]. GANs have also been leveraged to address imbalanced brain tumor MRI data and to serve as an anonymization tool for patient data [154]. Several modifications of the GAN architecture exist that extend their capability to more complex domains. In particular, a GAN variant known as CycleGAN enables unpaired image-to-image translation by combining two GANs under a single adversarial objective function [155]. This modification allows for conversion of images from one domain to another, e.g., apples to oranges, horses to zebras, etc. without the need for aligned image pairs. In precision medicine, a CycleGAN was used to translate X-ray images into synthetic CT scans for more accurate image segmentation [156].
We now demonstrate an implementation of GANs applied to the virtual population data of 128 tumor cell density time series generated in section 3.3. Similarly to the encoder and decoder for VAEs, for the generator and discriminator we employ deep convolutional neural networks (see https://github.com/jtnardin/Tumor-Heterogeneity for code) using a latent space of size = 100. Further, we use the same 2560 observations with identical pre-processing and rotation augmentations. To train the generator and discriminator networks, we minimize the GAN objective function in Eq (7.3) using the Adam optimizer with default parameters except for learning rate 2 × 10 −4 and first moment weight 0.5. These values were chosen based on [152]. The GAN is trained for a total of 50,000 iterations, where in each iteration we randomly sample and augment a batch of 64 observations. Developing metrics to know when a GAN model has converged remains the subject of ongoing research, thus we report the results from the generator that produces the best qualitative results.
GANs share many of the same applications as VAEs in which generated data can be used for model calibration and improved classification and segmentation accuracy. GANs can also work synergistically with virtual populations as a platform to investigate tumor heterogeneity. In Figure 10 we demonstrate how the trained generative adversarial network (GAN) can be used to produce new realizations of the observed data. The four plots show various generator realizations from samples z ∼ N(0, I). Figure 10 also illustrates some of the strengths and weaknesses of GANs. In particular, we use the GAN to generate realistic looking samples, however we lack the ability to sample reconstructions from fixed observations in the training set since the GAN implementation lacks an 'encoding' step. In other words, VAEs have the ability to produce a latent sample z from an input image but when utilizing GANs one must draw z randomly each time since there is no encoder network. Therefore, GANs produce realizations where VAEs produce reconstructions.

Discussions
Understanding how heterogeneity between and within tumors allows cancer to harmfully spread and survive is crucial for determining strategies to develop robust patient-specific decisions for precision medicine. Several recent biological studies [157,158] have highlighted the central role of heterogeneity for tumor survival and spread, underscoring the importance for mathematical modelers to combine models with measures of heterogeneity to help inform how heterogeneous growth and spread processes lead to tumor progression. In this review, we have detailed several methods, including virtual populations, NLME, non-parametric estimation, Bayesian statistics, and machine learning as options for modelers to use for inferring and quantifying tumor heterogeneity with the application of a mathematical model. Each of these methods have strengths and weaknesses that should be considered before implementation.
To demonstrate the strength and weakness of each method, we showcase their practical implementation in this review. We used synthetically generated data for which the true underlying distribution are known and therefore can be compared with the inferred distribution. Although these methods were not applied to clinical data in this manuscript, we included possible references and guides for the interested reader. Due to the severity of brain cancer, many open-source imaging datasets, such as the cancer imaging archive [74] consist mainly of one time point (references with one time point), which are not appropriate for all described methods. GANs are one method that can be used to analyze datasets with one time point. In order to infer parameters for underlying PDEs, multiple time points may be necessary. Multiple time-point in vitro [13], preclinical [159,160], and clinical data of brain cancer could be used to quantify intertumoral heterogeneity using NLME, PMF, Bayesian, and Machine Learning Methods. Such methods could also infer heterogeneity from time-dependent in-vitro tumor volume data [161][162][163]. An interesting extension of this work would be to compare and contrast the ability of these methods to predict tumor growth in preclinical/clinical data when the underlying distribution is unknown.
Virtual populations quantify how heterogeneity arising from the distribution of model parameters, P β , leads to differences in some population-wide quantity of interest. Virtual populations are frequently used to infer P β , discover anti-cancer treatment, investigate therapeutic drug designs, and predict sensitivity of patients to different treatment regimens. We generated two virtual populations in this work: One with spatially-homogeneous model parameters and another with spatially-heterogeneous model parameters to exhibit the capabilities of the following techniques for quantifying or inferring tumor heterogeneity. NLME is a statistical framework to infer some underlying parameter distribution for a parameterized mathematical model from distributed data. We were able to use this framework to accurately identify the underlying proliferation rate distribution from homogeneous data for benign and malignant patients with only a few virtual patient realizations (K=25). The high computational cost of this NLME did not allow us to investigate intertumoral heterogeneity through the spatially heterogeneous virtual population dataset. We suggest that extending this NLME framework to be used with a random field model is an interesting area for future research.
Non-parametric estimation within the prohorov metric framework (PMF) allows for accurate distribution inference from aggregate data without a priori knowledge of the underlying distribution. One drawback to non-parametric estimation is that it requires enough data in order to accurately resolve parameter distributions. Due to potential parameter identifiability issues with estimating parameters for the random field as well as the random parameters, we only addressed inter-patient heterogeneity with non-parametric estimation in this review and did not address intratumoral heterogeneity. The PMF has previously been used to estimate intratumoral heterogeneity [89]. Optimization with respect to both intratumoral and inter-patient variability is left for future work.
Bayesian methods allow for a thorough understanding of the underlying uncertainty associated with any mechanistic parameters. This uncertainty can be investigated with a sensitivity analysis to determine which parameters most heavily influence the model output or by defining patient-specific posterior distributions for patients.
Machine learning is a recent field with promise for automating and personalizing tasks in medicine, including diagnosis and individualized therapeutic decision making. Generative modeling from machine learning allows for the reconstruction/realization of tumor images from a collection of real images. Such generated samples can be utilized to understand several underlying tumor properties, including the progression in size over time, shape, as well as variability in these measures. While we did not combine this aspect of our study with a mathematical model, the synergy of these methods with mechanistic mathematical models is a promising avenue for future research to create interpretable machine learning tools and better predict tumor invasion.
We used the Fisher-KPP model throughout this review as a model for tumor volume over space and time, which can be used in combination with spatial imaging data. We note that the model used will change between studies based upon the problem one is addressing as well as the available data. Many different mathematical models may be substituted in with the methodologies discussed within this review to aid the modeler in quantifying the heterogeneity present in their data.