Parameter estimation and uncertainty quantification using information geometry

In this work, we: (i) review likelihood-based inference for parameter estimation and the construction of confidence regions; and (ii) explore the use of techniques from information geometry, including geodesic curves and Riemann scalar curvature, to supplement typical techniques for uncertainty quantification, such as Bayesian methods, profile likelihood, asymptotic analysis and bootstrapping. These techniques from information geometry provide data-independent insights into uncertainty and identifiability, and can be used to inform data collection decisions. All code used in this work to implement the inference and information geometry techniques is available on GitHub.


Introduction
Computational and mathematical models are versatile tools, providing valuable insight into complex processes in the life sciences. Models can further our understanding of mechanisms and processes, facilitate development and testing of hypotheses, guide experimentation and data collection and aid design of targeted interventions [15,72,89,102,105]. However, there are considerable challenges associated with calibrating these models to data. For example, models need to be sufficiently sophisticated to adequately reflect the behaviour of the underlying system, while ideally admitting identifiable parameters that are interpretable, and that can be estimated from available or obtainable data [32,58]. Further, available data can be limited and often is not collected for the express purpose of parameter estimation; data may be noisy, incomplete, or may not provide the level of detail or sample size required to obtain precise parameter estimates [31,43,87,95,101].
Due to the challenges associated with parameter estimation, we are often interested not only in point-estimates, but also the associated uncertainty [22,65,103]. Quantifying and interpreting this uncertainty establishes a level of confidence in parameter estimates; and by extension, in the insights derived from the model. Further, this uncertainty quantification can give insights into identifiability: whether the information in a data set can be used to infer unique or sufficiently precise parameter estimates for a given model [92]. Often we are concerned with both structural identifiability and practical identifiability [9,42,91,99,100]. Structural identifiability can be thought of as a property of the underlying model structure and parameterisation; and refers to whether it is theoretically possible to determine unique parameter values, given an infinite amount of perfect noise-free data [14,92,109]. Structural identifiability requires that unique parameter combinations precipitate distinct model outputs. Structural identifiability occurs if and only if the Fisher information matrix, which we soon discuss, is of full rank [50]. In contrast, practical identifiability is less well defined, and depends on the quality and quantity of data available and existing knowledge of the parameters [14]. In the context of profile likelihood methods, practical non-identifiability can manifest as contours of the log-likelihood function that do not admit closed levels; the log-likelihood does not reach a predetermined statistical threshold within the physical parameter regime [59]. If a model is not structurally identifiable, it cannot be practically identifiable.
Practical non-identifiability may be addressed either through improving data quantity or data quality [14,91]. Data quantity can be improved by increasing the number of observations; such as by making additional observations at different time points. Data quality may be improved through reducing the noise present in the data, for example by obtaining a dataset with reduced measurement error or repeating measurements across identicallyprepared experiments [13,84]. It is also possible to resolve practical nonidentifiability through incorporating existing knowledge about parameters, such as physical constraints or information established in previous studies; or specifically in the Bayesian inferential framework, through informative priors [34]. Addressing structural non-identifiability is more challenging, for example this may necessitate a change to the underlying model structure [37,84,99].
Uncertainty quantification takes many forms, with common examples including Bayesian methods, profile likelihood, asymptotic analysis and bootstrapping [31,33,70,101,104]. Bayesian methods are widely used for parameter estimation and uncertainty quantification, with Bayesian computation being employed throughout the mathematical biology and systems biology literature. Broadly, these methods involve repeated sampling of parameter values from a prior distribution and invoking Bayes theorem to approximate the posterior distribution; the posterior distribution describes knowledge about the probability of parameter combinations after taking into account the observed data and any prior information [14,104]. Wellknown approaches include rejection sampling, Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) or particle filtering. In rejection sampling, parameters drawn from a prior distribution are used to simulate the model; simulated data is compared to the observed data based on some distance metric, and if this metric is within a prescribed tolerance, the parameters are accepted as a sample from the approximate posterior distribution, otherwise they are rejected [33,53]. Rejection sampling can be computationally expensive as the rejection rate can be significant with an uninformative prior [56,81]. In MCMC the parameter space is sampled following a Markov chain-a memoryless stochastic process where the probability of the next state depends only on the previous state [76]-with a stationary distribution corresponding to the posterior distribution. Samples are accepted or rejected based on the relative likelihood between the current configuration and proposed sample [6,60,95,104]. For SMC, rejection sampling can be used to produce an initial coarse approximation of the posterior distribution. This coarse approximation informs further (sequential) sampling efforts in the region of parameter space corresponding to high likelihood, reducing the rejection rate when compared to rejection sampling alone [45,56,95]. MCMC and SMC approaches can offer significantly improved efficiency in comparison to rejection sampling [56,104], but both involve specifying hyperparameters and these choices are not always obvious.
In situations where the likelihood function is intractable or not easily evaluated, Approximate Bayesian Computation (ABC) provides a range of related likelihood-free methods for estimating posterior distributions [94].
Popular approaches include ABC rejection sampling [45,51,81,94,108], ABC MCMC [66,90,93], and ABC SMC [56,95]; we do not focus on ABC methods here, as the approaches we explore in this work are applied to problems with tractable likelihoods. We direct interested readers to the wealth of information in the references provided.
For Bayesian inference methods, uncertainty can be quantified based on features such as the coefficient of variation and probability intervals of the posterior distribution [101]. There are a variety of approaches for uncertainty quantification for frequentist inference methods. In profile likelihood, a parameter of interest is varied over a fixed set of values, while re-estimating the other parameters; providing insight into identifiability and uncertainty [15]. In asymptotic analysis, confidence regions can be constructed based on local information via a Taylor expansion of the Fisher information about the maximum likelihood estimate (MLE) [31,59]. In bootstrapping, data is repeatedly sampled and parameter estimates are computed from the samples; these estimates are used to construct confidence intervals [70].
Through the geometric approaches we review in this work, more akin to traditional approaches for sensitivity analysis [65,69,110], we explore the curvature of the parameter space through an information metric induced by the likelihood function. Whereas likelihood-based approximate confidence regions provide insight into specific level curves of the likelihood functionthe levels of which depend on an asymptotic large sample argument [76]this geometric approach provides insight into the shape and sensitivity of the parameter space. For example, we compute geodesic curves that describe the geometric relationship between distributions with different parameters [68]; and explore the scalar curvature throughout parameter spaces. We review ideas from information geometry in the context of inference and uncertainty quantification; not with a view to replacing established methods such as profile likelihood, asymptotic analysis, bootstrapping and Bayesian methods [31,70,101,104], but rather to supplement them where additional insight may prove useful.
Information geometry is a branch of mathematics connecting aspects of information theory including probability theory and statistics with concepts and techniques in differential geometry [1]. In this exposition we seek to outline only the key concepts required to understand the information geometric analysis in this work. However, we note that more thorough and rigorous treatments of the concepts introduced in this section, and mathematical foundations of information geometry, can be found in texts and surveys such as [1,17,73]. Central to the information geometry ideas explored in this work is the concept of a statistical manifold ; an abstract geometric representation of a distribution space, or a Riemannian manifold consisting of points that correspond to probability distributions, with properties that we later discuss. For example, the set of normal distributions parameterised by mean, µ, and standard deviation, σ > 0: can be thought of as a two-dimensional surface with coordinates (µ, σ) [17].
In this work we will use θ to refer to the parameters of interest that we seek to estimate; i.e. θ = (µ, σ) for the univariate normal distribution with unknown mean and standard deviation. In Section 3, we consider various combinations of components of θ; including model parameters, variability in observations characterised by a separate observation noise model, and initial conditions associated with a differential equation-based process model.
When referring to all possible parameters, rather than solely the unknown parameters to be estimated, we denote this Θ.
In applications where we consider multiple data sets, or different candidate models or candidate parameter values, we are interested in methods of comparing distributions. A well-known measure for comparing a probability distribution, P , to another, Q, is the Kullback-Leibler (KL) divergence from P to Q, denoted D KL (P, Q) [26]. The KL divergence, or relative entropy, can be computed as [26]: where p(x) and q(x) are the respective probability density functions of P and Q. Consider two sets of parameters, θ * andθ; let log(p(x)) = log(p(x|θ * )) = (θ * ) and log(q(x)) = log(p(x|θ)) = (θ), where (·) denotes the log-likelihood, discussed in detail in Section 2. If p(x|θ * ) is the true distribution and p(x|θ) is our estimate, then (2) is the expected log-likelihood ratio and the relationship between MLE and KL divergence becomes evident; maximising the likelihood is equivalent to minimising KL divergence [71].
An issue with the KL divergence is asymmetry; It is not necessarily obvious in a given situation which orientation of the KL divergence will most appropriately inform decisions such as model selection [88]. Due to the aforementioned asymmetry, and its failure to satisfy the triangle inequality, the KL divergence is not a metric-it is not a measure of distance in a differential geometric sense-on a given manifold [17]. One means of addressing this asymmetry is through devising various symmetrised forms of the KL divergence to inform model selection criteria [88]. Alternatively, we may approach the issue from a geometric perspective. It is natural to think of geometry in terms of objects or shapes in Euclidean, or flat, space. Euclidean space is characterised by orthonormal basis vectors; the standard basis in three-dimensions being e 1 = (1, 0, 0) T , e 2 = (0, 1, 0) T , e 3 = (0, 0, 1) T , where superscript T denotes the transpose. In the n-dimensional orthonormal basis, we can compute the squared infinitesimal distance between the points S and S + ds, where ds has components ds i , as [2]: Differential geometry extends ideas from Euclidean geometry to manifolds.
Manifolds are topological spaces that resemble flat space about each individual point in the space; they can be considered locally flat, but have a different topology globally. The sphere is a classic example, whereby points on the surface are locally topologically equivalent to two-dimensional Euclidean space, but globally the sphere is curved and has a compact topology; it is bounded and closed [5]. In particular, we are interested in Riemannian manifolds; differentiable manifolds-sufficiently locally smooth that our typical notions of calculus remain valid-upon which we are able to measure geometric quantities such as distance, through the existence of a Riemannian metric on the tangent space of the manifold, that generalises the notion of an inner product from Euclidean geometry [54].
A Riemannian metric is a smooth covariant 2-tensor field; on a differentiable manifold M , the Riemannian metric is given by an inner product on the tangent space of the manifold, T p M , which depends smoothly on the base point p [46,54]. A tangent space can be thought of as a multidimensional generalisation of a tangent plane to a three-dimensional surface.
Each point p on a manifold is associated with a distinct tangent space. An n-dimensional manifold has infinitely many n-dimensional tangent spaces;  [3]. Formally, we introduce the unique, torsion free Levi-Civita connection, ∇; an affine connection on the Riemannian manifold that yields isometric parallel transport, such that inner products between tangent vectors, defined by the metric, are preserved [35]. The coefficients of this connection are the Christoffel symbols, which we discuss further in Section 2. Readers are directed to [3,8,35] for further detail regarding the Levi-Civita connection, and how it relates to other concepts discussed in this work. A manifold equipped with such a connection and a Riemann metric is a Riemann manifold.
Metric tensors can be thought of as functions that facilitate computation of quantities of interest such as distances on a manifold. A metric matrix with elements g ij , G = [g ij ], is positive definite and symmetric [54]. The metric matrix defines an inner product between u and v as u, v G = u T Gv, and the length of u as ||u|| G = u, u G [20]. On a Riemannian manifold we consider a generalisation of the square of the infinitesimal distance element (3), appropriate for non-orthonormal bases [2], given by Foundational to information geometry is the notion that the Fisher information matrix defines a Riemannian metric on a statistical manifold [83].
Expression (4) highlights a link between sensitivity analysis, structural identifiability and practical identifiability [63]. For sensitivity analysis and structural identifiability, only the curvature of the model space is studied through J(θ). In practical identifiability analysis, the sensitivity of the model is linked to the data through an observation process, and the curvature of the parameter space is studied through, for example, I(θ).
In this review, we present and explore fundamental techniques in inference and information geometry, including confidence regions, geodesic curves, and scalar curvature. Through application to standard distributions and canonical models in the life sciences, including population growth pro-  [11]; and is available on GitHub.
In Section 2 we describe the inference and information geometry methods implemented in this work, including maximum likelihood estimation, profile-likelihood based approaches, geodesic curves and scalar curvature calculations. Results of applying these techniques to univariate and multivariate normal distributions, linear, exponential and logistic growth models and the SIR model, are presented in Section 3. We discuss the utility of these techniques, and identify opportunities for extension and further consideration in Section 4.

Methods
Here we describe the parameter inference and information geometry methods used to produce results in this work. We also describe the numerical methods used to implement these techniques. The techniques we discuss in this section readily generalise to parameter spaces with an arbitrary number of dimensions, so we discuss the techniques here for arbitrary dimensions.
However, for the sake of exploring the techniques through visualisation in Section 3, we restrict ourselves to two-dimensional manifolds. In context, this means we consider only two parameters to be inferred in any given example, treating other parameters as known and fixed; for example, as if they are drawn from prior knowledge or pre-estimated.
Although we consider deterministic mathematical models, data used to estimate parameters can exhibit significant variability. We follow a standard approach and assume that the mathematical models describe the expected behaviour, and that our observations are normally distributed about this expected behaviour [42]. This allows us to think about a statistical model, m(θ, t), in terms of its expected behaviour, µ and the standard deviation of the observations, σ.
We restrict the examples in this work to cases where σ is constant; setting σ(θ, t) = σ. In this work we focus on the most commonly employed additive noise model [42,84,91,95,105]. Additive noise implies that the variance of the data is independent of the mean behaviour. In cases where variance scales with mean behaviour, multiplicative noise may be more appropriate. The information geometric methods presented here are applicable in cases where the Fisher information can be obtained; including models with multiplicative noise, and parameter or time dependent standard deviation.
However, obtaining the Fisher information is a separate challenge, and can be difficult when considering different process and noise models.

Parameter inference
In this work, parameter estimates are inferred from data following a standard maximum log-likelihood based approach. We make observations at L timepoints, T = (t 1 , t 2 , ..., t L ). At each time-point we make N observations, X = (x 1 (T ), x 2 (T ), ..., x N (T )). With this notation the log-likelihood function is where f (x; µ, σ 2 ) is the probability density function associated with our observation process. In this work we hold N constant across time-points, though non-constant N is easily incorporated into Equation (5) as N j . The likelihood function can be thought of as the joint probability density of all the data for a given set of parameters. In examples where σ is unknown, we treat σ as an element of θ, but note that the expected model behaviour is independent of σ. The MLE is the point estimate,θ, that satisfieŝ where arg max(·) returns the argument, θ, that maximises (θ; X ) in (6).
The associated maximum log-likelihood is (θ). MLEs of the parameters of interest are obtained by solving (6) numerically as outlined later in Section 2. For an iid sample from a univariate normal distribution, N (µ, σ 2 ), maximising the likelihood function of µ is equivalent to performing least-squares estimation [14], although having access to the likelihood function facilitates uncertainty quantification.
Presenting confidence regions alongside MLEs enhances our interpretation of the likelihood function, while still acknowledging that the estimates carry uncertainty [76]. We apply a probability-based log-likelihood approach when constructing confidence regions for model parameters. From Wilks' theorem [76], asymptotically as N → ∞, an approximate α-level confidence region is given by where ∆ ν,α is the αth-quantile of the χ 2 (ν) distribution; with ν degrees of freedom [15]. In this work the degrees of freedom correspond to the number of parameters of interest, i.e. ν = dim(θ). To enable comparison between different data sets and models in Section 3, we consider the normalised loglikelihood,ˆ (θ) = (θ) − (θ). This forms the basis for log-likelihood ratio based hypothesis tests [76]. The normalised log-likelihood is zero at the MLE:ˆ (θ) ≡ 0.

Information geometry
As outlined in Section 1, the Fisher information describes the curvature of the log-likelihood function. It describes how much information a random variable, X, contains about a parameter, θ. For unbiased estimators, the inverse of the Fisher information provides a lower bound on the covariance matrix, via the Cramer-Rao inequality [107]. Formally, the Fisher information is the covariance of the score, where the score is defined as the partial derivative of the log-likelihood with respect to θ [55,76]. The Fisher information matrix can be written as [48,76]: We can recover our expression for the Fisher information in Equation (4) from Equation (8), by considering how Equation (8) changes under repa-rameterisation, and via application of the chain-rule for differentiation [55].
With observations at L unique times, T = (t 1 , t 2 , ..., t L ), we can think of a model as a mapping between the parameters and the outputs that we can observe: We consider some examples where σ is unknown and is estimated as a part of the analysis; in these instances σ ∈ θ, however we express σ explicitly in the mapping presented in (9) to emphasise that it behaves somewhat differently to a model parameter. The expected behaviour of the model does not depend on σ, and variability in the data maps directly to σ. In all the examples we consider, σ is constant. This could be extended to incorporate variability dependent on the expected behaviour, for example logistic growth with standard deviation that depends on the population density [16]. In the mapping, this could be expressed as σ(µ(θ, t)).
Following Equation (4), we can form the Fisher information as a combination of the Fisher information matrix of the observation process, O(m), and the Jacobian of the model with respect to the parameters, J(θ). From (9), with ν unknown parameters (dim(θ) = ν), we can view the model Jacobian as Noting that we are taking σ to be independent of model parameters, all of the partial derivatives of σ in (10) This can be verified by applying Equation (8) to (1). For data at L timepoints with N 1 , N 2 , ..., N L observations at each time, with constant standard deviation, the Fisher information for the observation process is a 2L × 2L (block) diagonal matrix: Similarly, for a model with M species, where we have observations of all M species at only one time-point we recover Fisher information in the form of (12). For observations of M species at L time-points we form a 2LM × 2LM (block) diagonal matrix from (12). Assuming a constant standard deviation, for the computations in this work we could more simply express (12) as the the total number of observations contributing to our information regarding the standard deviation, and the factor of two comes from (11). In this case, the model Jacobian as presented in (10) is modified such that only the final row includes the partial derivatives with respect to the standard deviation.
Before outlining specific techniques of information geometry, we present a conceptual example to develop some intuition for information geometric concepts. Consider the manifold corresponding to the family of univariate normal distributions parameterised by mean, µ, and standard deviation, σ > 0. Let P ∼ N (µ 1 , σ) and Q ∼ N (µ 2 , σ) be two normal distributions. Geometrically speaking, increasing σ reduces the distance between P and Q; this corresponds to a contraction of the space. Conversely, decreasing the variance dilates the space; as σ → 0, the Fisher information, diag(1/σ 2 , 1/σ 2 ), is degenerate and the distance between P and Q tends to infinity.
Equipped with the Fisher information, we may begin to explain some foundational ideas from information geometry; including geodesic curves, geodesic distances between distributions for statistical models, and scalar curvature [1]. We denote the elements of the Fisher information as forming the shortest path between two points in a Riemannian manifold [4].
The length of this shortest curve is referred to as the Fisher or Fisher-Rao distance [77]. We soon discuss a relationship between confidence regions and the length of geodesic curves. Informally, with greater information supporting a MLE, coinciding with an increase in its relative likelihood, confidence regions tighten. This also corresponds to a dilation of the parameter manifold; thereby increasing the geodesic distance between the MLE and other parameter combinations, reflecting their relatively reduced likelihood.
A curve z(s), parameterised by s, connecting the points z 1 = z(s 1 ) and z 2 = z(s 2 ) on a Riemannian manifold, has length [46]: A Riemann geodesic is a curve that minimises L(z) (13), such that the distance between two points on a Riemannian manifold is given by the curve that satisfies For Gaussian likelihoods, there is an asymptotic relationship between the geodesic distance between the MLE,θ, and a point θ α that corresponds to an α-level confidence region on the manifold [7]. The geodesic distance betweenθ and θ α : d(θ, θ α ); can be written in terms of the αth-quantile of Pairing Equations (7) and (14) yields an asymptotic relationship between confidence regions and geodesic length [7]: In Section 3 we present likelihood-based confidence regions alongside geodesic curves of the corresponding length, as characterised by (14), and comment on the validity of Equation (15) in a range of scenarios.
Geodesic curves satisfy the following system of differential equations in n dimensions [78]: where s is the parameterisation of the geodesic curve, in accordance with Equation (13), and Γ m ij are the Christoffel symbols of the second kind [17], defined as We can convert from Christoffel symbols of the second kind to Christoffel symbols of the first kind by lowering the contravariant (upper) index through multiplication by the metric: Γ kij = g km Γ m ij [41]. Here, repeated indices, in this case m, imply a summation is to be performed over the repeated index, following the Einstein summation convention [5]. Conversely, we can recover Christoffel symbols of the first kind from Christoffel symbols of the second kind via the inverse metric: g km Γ kij = Γ m ij . Christoffel symbols of the second kind are the connection coefficients of the Levi-Civita connection; the Christoffel symbols are symmetric in the covariant (lower) indices [35]. On an n-dimensional manifold, the Christoffel symbol is of dimension n × n × n.
Geodesics can be used to construct theoretical confidence regions, to measure the geometric distance between probability distributions, and to perform hypothesis testing, for example to test equality of parameters [47,68,73].
Under certain conditions, analytical expressions can be obtained for the solutions of the geodesic equations, and the corresponding Fisher-Rao distances, for example in the case of the univariate (1) and multivariate (25) normal distributions [18,77]. However, we solve Equation (16) numerically, after converting the second order ODE to a first order system of ODEs using standard techniques.
We are also interested in exploring the scalar curvature, also known as the Ricci scalar, of our manifolds. To compute the scalar curvature, we must first construct the Riemann tensor, and subsequently the Ricci tensor. As we only require these tensors for computation of the scalar curvature, and do not attempt to interpret these tensors directly in this work, we provide only a limited outline of their interpretation. The Riemann curvature tensor is constructed from the Christoffel symbols and their first partial derivatives.
Here, it is convenient to think about these partial derivatives as being with respect to the parameters of interest. Due to the possibility of raising or lowering indices of Christoffel symbols and tensors via the metric, there are several equivalent expressions for computing the Riemann curvature tensor [41]. The elements of the Riemann tensor of the first kind can be written as The Riemann tensor of the first kind is a (0,4) tensor (with no contravariant indices and four covariant indices), and can be converted to the (1,3) Riemann tensor of the second kind via the inverse of the metric: On an n-dimensional manifold, the Riemann tensor is of dimension n × n × n × n; due to various symmetries however, there are far fewer independent elements [62]. The Riemann tensor provides information about the intrinsic curvature of the manifold. A geometric interpretation is that a vector from a point on the manifold, parallel transported around a parallelogram, will be identical to its original value when it returns to its starting point if the manifold is flat. In this case, the Riemann tensor vanishes. If the manifold is not flat, the Riemann tensor can be used to quantify how the vector differs following this parallel transport [61].
From the Riemann tensor of the second kind, we can compute the Ricci tensor of the first kind. The Ricci tensor, R ij , is obtained by contracting the contravariant index with the third covariant index of the Riemann tensor of the second kind; that is, On an n-dimensional manifold, the Ricci tensor is of dimension n × n, and is symmetric [61]. The Ricci tensor can quantify the changes to a volume element as it moves through the manifold, relative to Euclidean space [61].
The scalar curvature, Sc, can be obtained as a contraction of the Ricci The scalar curvature is invariant; it does not change under a change of coordinates (re-parameterisation). For Gaussian likelihoods, the corresponding manifold is flat; characterised by zero scalar curvature everywhere. As such, the scalar curvature provides a measure of how the likelihood of the underlying statistical model deviates from being Gaussian-often referred to as non-Gaussianity in the physics and cosmology literature-irrespective of the parameterisation [35]. As we will explore in Section 3, it can also provide insights into parameter identifiability.

Hypothesis testing
Here we outline the approach for performing likelihood-ratio-based hypothesis tests, and hypothesis tests based on geodesic distance. As we consider synthetic data in this work, we know the true parameter values, θ t . In practical applications this is not the case. As such, we may seek to test whether some previously held notion about the true parameters, θ t = θ 0 , is supported by the data, based on the computed MLE. This could be investigated via the following hypothesis test: From Equation (7) the test statistic for such a likelihood-rato-based hypothesis test can be expressed as where asymptotically as N → ∞, λ LR ∼ χ 2 (ν), following Wilk's theorem [76]. From the asymptotic relationship given in Equation (15), it follows that under the same asymptotic relationship the test statistic for a hypothesis test based on geodesic distance is [47]: The likelihood values required to compute Equation (22) can be obtained directly by evaluating Equation (5). To compute the geodesic distance between two specific points in parameter space, as required by Equation (23), it is necessary to solve a boundary value problem to obtain the geodesic curve between θ 0 andθ. Approximate p-values can be computed from these test is the cumulative distribution function of χ 2 (ν) [15]. We provide practical examples of each of these approaches to hypothesis testing in Section 3.

Results
In this section we present results combining likelihood based parameter inference and uncertainty quantification with ideas from information geometry, including geodesic curves and scalar curvature. We apply these techniques to univariate and multivariate normal distributions, linear, exponential and logistic population growth models and the SIR model. Through these canonical examples, we explore pedagogically differences in the inference and information geometry results that arise as we consider parameter estimation and uncertainty for increasingly complex systems.
Synthetic data for the univariate and multivariate normal distributions are generated by sampling from the respective distributions given in Equation (24). For simplicity, in this work we consider synthetic data from uncorrelated observation processes with constant standard deviation in both time and parameter space. However, we note that the techniques presented in this work can be generalised to handle data with non-constant variance and for other distributions where the Fisher information is available [16].
Univariate : where Σ = diag(σ 2 ) is the covariance matrix. For the population growth and SIR models considered in this work, synthetic data is generated by drawing from a normal distribution with mean described by the model solution and a prescribed standard deviation; effectively substituting µ = µ(θ, t) in Equation (24) for observation processes with a single variable, and µ = µ(θ, t) for observation processes with several variables. When σ is one of the parameters to be estimated, σ ∈ θ, but µ does not depend on σ. Parameter values that we use to generate synthetic data correspond to parameter estimates inferred from field data in the literature [72,92]. We

Normal distributions
We first consider parameter inference and information geometry techniques applied to observations drawn directly from univariate and bivariate normal distributions, with no underlying process model. In Figure 1 we present results for the univariate normal distribution (1), estimating θ = (µ, σ).
The true mean and standard deviation used to generate data are (µ, σ) = (0.7, 0.5). Estimates are obtained via maximum likelihood estimation. MLEs of normal variance are known to provide biased underestimates [76], and the derivation of the Fisher information assumes an unbiased estimator [30].
This may partially explain the particular differences observed between the likelihood-based confidence region and the endpoints of the geodesics in Figure 1, wherein the geodesics appear not only to suggest a tighter confidence region, but also appear to be biased towards parameter space with smaller standard deviation. As the number of observations increases from N = 10 to N = 100, we observe not only that the MLE more precisely estimates the true parameter values, but also the endpoints of the geodesic curves more closely correspond to the likelihood-based confidence regions. This is consistent with both the theoretical asymptotic relationship between geodesic length and likelihood-based confidence regions given in Equation (15), and also the bias of the MLE for standard deviation decreasing, as N increases.
The manifold representing the family of normal distributions parame- True parameter values, (µ, σ) = (0.7, 0.5), are marked with green discs, with the MLEs indicated using red discs. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with a geodesic length corresponding to a theoretical 95% confidence region. Increasing the number of data points, N , tightens the confidence regions, improves the correspondence between geodesic curves and likelihood-based confidence regions, and reduces the scalar curvature.
The probability density function for the multivariate normal distribution with two independent variables; x, y ∈ R, with constant standard deviation σ is In Equation (25) there are 3 parameters that we could estimate from data; Θ = (µ 1 , µ 2 , σ). As we estimated the mean and standard deviation for the univariate normal distribution in Figure 1, we consider inference of both means for the multivariate normal; θ = (µ 1 , µ 2 ). Results are presented in We also observe that the confidence regions are symmetric with respect to each mean parameter. When estimating only the mean parameters of the multivariate normal distribution, we see that the scalar curvature is zero everywhere. This is to be expected, as the Fisher information for normally distributed observation processes, Equation (11), depends only on the standard deviation and not the mean. As such all of the partial derivatives used to construct the Christoffel symbols, (17), are zero; this vanishing of the Christoffel symbols translates to zero scalar curvature through Equations (18), (19) and (20). We also observe that, in contrast to the evident curvature of the geodesics for the univariate normal case presented in Figure   1, the geodesic curves in Figure 2 appear perfectly straight when plotted in Euclidean geometry. The Riemann tensor (18) is zero everywhere when inferring multivariate normal means. This suggests that the manifold is flat.
Results presented in this work predominantly feature 95% confidence regions. We note that although this choice is common [28], it is also arbitrary, and equivalent analysis could be performed at different confidence  Figure 3.
Having considered the techniques as applied directly to distributions, we now incorporate ODE-based process models, such that our observations are normally distributed about the solution of a mathematical model. 2), are marked with green discs, with the MLEs indicated using red discs. Magenta curves correspond to likelihoodbased 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with geodesic lengths corresponding to a theoretical 95% confidence region. Increasing the number of data points, N , tightens the confidence regions. In contrast to the univariate case where we infer standard deviation in Figure 1, when only inferring the mean parameters of the multivariate normal distribution, we see that even with few observations, N = 10, the geodesics and likelihood-based confidence regions match closely. As we are estimating means only, and there is no model-induced curvature, the scalar curvature is zero everywhere.  Figure 1a, and (b) multivariate normal distribution with inferred means, µ 1 , and µ 2 , as considered in Figure 2a. MLEs are indicated using red discs. Dashed curves correspond to likelihood-based 50% (green), 90% (blue), and 95% (orange) confidence regions. Solid lines are geodesic curves emanating from the MLEs, with geodesic lengths within a theoretical 50% (green), 90% (blue), and 95% (orange) confidence distance.

Population growth models
The canonical logistic growth model, alongside generalisations and related sigmoid models such as Gompertz and Richards' models, have been extensively applied to study population growth dynamics in the life sciences [74,92]. In Figure 4 we present data from the literature describing the area covered by hard corals in a region as they regrow following an adverse event.
This can be modelled as a logistic growth process [92]. Logistic growth of a population with density, C(t), is characterised by a growth rate, r > 0, ini-    Recall from Equation (10) that we only require the partial derivatives corresponding to unknown parameters in any given example. When estimating θ = (a, C(0)) we find that, similar to the multivariate normal case where we estimate means, the scalar curvature is zero everywhere. We also observe that the end-points of the geodesics align with the likelihood-based confidence region. We stress that this arises through the relationship in Equation (15), and is not forced to occur via termination of the numerical solution of the ODE once it reaches the likelihood-based confidence region.
However, due to the relationship between a and C(0), we find that the confidence regions in this case are not symmetric about the MLE with respect to each parameter. Rather, we see that for a given normalised log-likelihood value a larger growth rate corresponds to a smaller initial condition, and vice versa. This aligns with our intuition when considering fitting a straight line through data, as presented in Figure 5a; lines with a greater slope (a) must start lower (C(0)) to fit the data.
When one of the parameters to be estimated is σ, we observe similar  By construction, as detailed in Figure 5, the linear and exponential models with identical parameters and initial conditions produce very similar behaviours over a sufficiently small time-scale. This is seen when comparing the inference results for the exponential model, presented in Figure   7(g-l), to the corresponding linear results in Figure 7(a-f). When inferring θ = (a, σ), deviations from the corresponding linear results are minimal. The likelihood-based confidence region and corresponding geodesic endpoints for θ = (a, C(0)) are marginally tighter and less elliptical. When inferring θ = (C(0), σ), we find that the confidence region for the exponential model is narrower with respect to C(0) than that of the linear model, though near-identical with respect to σ. As for the linear case, the scalar curvature is Sc = −1/N everywhere when σ is one of the unknown parameters, and zero everywhere otherwise.  Figure 5. The true parameter values are marked with green discs, with the MLEs indicated using red discs. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance. True values of model parameters correspond to the logistic growth parameters; a = 0.9131, C(0) = 0.7237, with reduced standard deviation σ = 0.2301.

Logistic growth
Logistic growth describes growth at a rate dependent on the size of the population, with growth ceasing once the population reaches a carrying capacity.
For sufficiently small populations relative to the carrying capacity, logistic growth is approximately exponential [98]. As the population approaches the carrying capacity, the rate of growth slows. The logistic growth model is with solution .
The long-time limit of Equation (26)  , Recall that θ includes only the unknown parameters to be estimated, so the components required from Equation (27) to form J(θ) depend on the specific example.
Example synthetic logistic data is presented in Figure 6, demonstrating the model fits for θ = (r, C(0)), θ = (r, K) and θ = (r, σ). With data at early, mid and late time, T = (t 1 , t 2 , t 3 ) = (2.74, 6.84, 10.95) years, we observe an excellent model fit in all cases. The fit is best when θ = (r, σ), as only one model parameter is unknown. Comparing θ = (r, C(0)) and θ = (r, K) we observe a marginally better fit at late time when K is known, and at early time when C is known, as expected.
We present inference results for the logistic model for θ = (r, C(0)) in Figure 8(a-f) and for θ = (r, K) in Figure 8(g-l). We do not present further results of inferring σ for the logistic model, as little insight is gained beyond what we glean from the linear and exponential growth results. For θ = (r, C(0)), the normalised log-likelihood reflects the same relationship between growth rate and initial condition as for the linear and exponential case. With early-mid time data and early-mid-late time data, we are able to infer θ = (r, C(0)). With only mid-late time data, we find that the parameters are not practically identifiable. This can be seen from Figure   8(c); the normalised log-likelihood remains above the threshold prescribed in Equation (7), and a closed likelihood-based 95% confidence region cannot be constructed. This is also reflected in Figure 8(f) alongside zero scalar curvature; such that the plot appears empty. Comparing Figure 8(a,b), and noting that they each rely on the same total number of observations, the importance of early and mid time data when inferring θ = (r, C(0)) is reinforced. The confidence region is tighter with only early-mid data, than with the same amount of data spread across early, mid and late times.
Inferring θ = (r, K) reflects similar behaviour. In Figure 8 When considering θ = (r, C(0)), we see that with early-mid time data and mid-late time data, the scalar curvature is zero everywhere. However, introducing a third time-point (early-mid-late data) results in a non-constant negative scalar curvature. We expect that this relates to the relationships between the parameters, and the difference between a mapping (where we have two pieces of information and two parameters to estimate), and a fit (where we have three pieces of information and two parameters to estimate). We do not observe similar behaviour for θ = (r, K) with data at three time-points; the scalar curvature still appears to be zero everywhere. One explanation for this is that data at t 1 , where C(t) K, may be effectively independent of K; providing no information about K [103]. This may effectively reduce the problem to a mapping. Given that the scalar curvature is a feature of the manifold rather than the data, it is of interest to investigate what would happen, were the true parameters to lie within this region of non-constant scalar curvature.
To address this, we generate an alternate set of synthetic logistic growth data using parameter values from within the high curvature region; (r, C(0)) = (0.9, 0.2), with (K, σ) = (79.74, 2.301) as before. Inference results are presented in Figure 9. We still observe correspondence between the endpoints of the geodesics and the likelihood-based confidence region, however the confidence region is now significantly narrower and reflects a more hyperbolic shaped relationship between r and C(0) in terms of the nor- , (a-f); and with inferred growth rate, r, and carrying capacity, K, (g-l). True parameters are as noted in Figure 6, with known standard deviation, σ = 2.301. Heatmaps visualise the normalised loglikelihood (a-c,g-j) and the scalar curvature (d-f,k-l). The true parameter values are marked with green discs, with the MLEs indicated using red discs. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance. Columns of the figure correspond to observations from early-mid time (T = (t 1 , t 2 )), early-mid-late time (T = (t 1 , t 2 , t 3 )), and mid-late time (T = (t 2 , t 3 )), where (t 1 , t 2 , t 3 ) = (2.74, 6.84, 10.95) years. Each plot reflects a total of 30 observations, distributed equally between the specified time-points. The red outline in (j) corresponds to the (zoomed in) region (g), also outlined in red. In (g,j) we plot 1000 geodesics to observe the geodesic near the true parameter values. We do not present Sc corresponding to (g,j), however it is zero everywhere. are marked with green discs, with MLEs indicated using red discs. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are 100 geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance.

SIR epidemic model
The SIR model describes the dynamics of epidemic transmission through a population [72].
Alongside β and γ we could also treat the initial conditions; S(0), I(0) and R(0), as unknown parameters to be estimated. The standard SIR model presented in Equation (28) is sufficient for our purposes in this work, however numerous extensions to the SIR model are considered in the literature. These extensions incorporate factors such as age structure, birth and death, exposed but not yet infected individuals, seasonality, competition between infectious strains, waning immunity, vaccination and spatial structure [19,72,75,86]. Data pertaining to the proportion of a population infected during an influenza outbreak in a boarding school is presented in Figure 10. Observations in the original data record the number of infected individuals over a 14 day period [72], in a population of N = 763, with initial populations (s(0), i(0), r(0)) = (762, 1, 0). This data is used in [72] to estimate parameters for the SIR model, which, after scaling such that S + I + R = 1, are β = 1.6633 and γ = 0.44036. We treat these values as the true parameters when generating synthetic data; examples of which are presented in Figure   11. In the context of an SIR model, the presence of multiple observations at a single time-point could reflect, for example; reporting errors, uncertainty in test accuracy or expert judgement [39,111]. In the boarding school data considered in [72], observations pertain only to the number of infected individuals. Given that the SIR model features multiple populations, data could in theory contain observations of the other populations also. Example synthetic data with observations on all three populations is presented in  are marked with green discs, with MLEs indicated using red discs. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance. Initial populations are as described in Figure 11. scalar curvature has the opposite effect. However, repeating the analysis with different synthetic data sets-generated from a different random seedsuggests that in some cases the geodesics will extend beyond the likelihoodbased confidence regions, and in some cases they will fall short, however the scalar curvature remains positive in all cases. MLEs indicated using red discs. Magenta curves correspond to likelihoodbased 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance. Initial populations as described in Figure 11.

Hypothesis testing
In Figure 14 we present several example hypothesis tests, using both likelihood- We also perform hypothesis tests for the logistic model in the high curvature region of parameter space. Like before, results are comparable for different numbers of observations at each time point; N = (10, 10, 10) and N = (50,50,50), as considered in Figure 9. Even in this high curvature region, we find that the endpoints of geodesics corresponding to a theoretical 95% confidence distance very closely match the likelihood-based 95% confidence regions. This is again reflected in the results of the hypothesis tests, where very similar results are obtained from the likelihood-ratio-based hypothesis tests and the geodesic-distance-based hypothesis tests, even for relatively extreme θ 0 . As the number of observations increases, we observe for each θ 0 considered, that in accordance with the confidence regions tightening, the test statistics increase and accordingly p-values decrease.
As we are using synthetic data and know the true parameters, we can use hypothesis testing to pedagogically investigate Wilks' theorem [76] and the asymptotic relationship given in (15). We generate 1000 synthetic datasets and for each dataset perform a hypothesis test for the true parameters.
This is repeated for the univariate and multivariate normal distributions In each case, we test several example hypotheses, θ 0 , marked by coloured discs. Geodesics between the MLEs (red discs) and each θ 0 are shown in red. Magenta curves correspond to likelihood-based 95% confidence regions. Black lines are geodesic curves emanating from the MLEs, with lengths corresponding to a theoretical 95% confidence distance.
with N = 10 and N = 1000 observations. In Figure 15 we present densities for both the likelihood-ratio-based and geodesic-distance-based test statistics, alongside the probability density of χ 2 (2). For the multivariate normal distribution with θ = (µ 1 , µ 2 ), the density profiles for λ LR and λ GD are near-identical; as expected following the results in Figure 14 and Table 1.
We also observe a good match between these profiles and χ 2 (2), even with just N = 10. For the univariate normal distribution with θ = (µ, σ), when N = 10 we observe differences between λ LR and λ GD . Both profiles are similar to χ 2 (2), though there appears to be a higher density in the tails of the distributions of the test statistics. As the number of observations increases to N = 1000, the difference between λ LR and λ GD reduces significantly, and both closely match χ 2 (2).
From Wilks' theorem [76] and (15), asymptotically 95% of the 95% confidence regions we construct should contain the true parameter values. We can determine what proportion of the likelihood-based and geodesic-distancebased 95% confidence regions that we construct contain the true parameter values using the information presented in Figure 15. This is done by comparing the test statistics to the critical value; ∆ 2,0.05 , from (7)  Purple curves correspond to the density of the χ 2 (2) distribution, while blue dotted lines represent the likelihood-ratio-based test statistics and orange dashed lines represent the geodesic-distance-based test statistics.

Discussion
Parameter estimation is wrought with challenges relating to the availability and quality of experimental or field data [31,43,95,101]. This prompts a strong consideration of uncertainty quantification to support point-estimation of model parameters [22]. In this section, we discuss the results presented in Section 3. We highlight opportunities for application of information geometry techniques, including geodesic curves and scalar curvature; to supplement traditional maximum likelihood based parameter inference and uncertainty quantification. We conclude by outlining areas for further investigation.
Even for relatively small sample sizes, we observe good correspondence between the likelihood-based 95% confidence regions and the end-points of geodesic curves corresponding to a theoretical 95% confidence distance, in accordance with the asymptotic relationship described in Equation (15); particularly when estimating model parameters. When estimating standard deviation, as outlined in Section 3, geodesics appear to suggest a tighter confidence region, and appear to be biased towards parameter space with smaller standard deviation. We observe this effect decreasing as the number of observations increases; in line with the known underestimation bias of minimum likelihood estimates of variance [76]. The misalignment of likelihood-based confidence regions and geodesic endpoints appears to occur more frequently in examples with non-zero scalar curvature, although we observe a good match in Figure 9 despite the non-constant scalar curvature.
Visualising the scalar curvature throughout a parameter space can indicate areas where there may be issues with identifiability. Areas with significant non-constant scalar curvature can suggest a complicated relationship between parameters in terms of the normalised log-likelihood, such as the hyperbolic confidence region observed in Figure 9. However, it is possible to produce examples, such as Figure 8(c,f), where there is practical non-identifiability despite zero scalar curvature everywhere. Although we do not show it here, for the logistic model with θ = (r, K) in the region of parameter space where C(0) ≈ K, computation of the scalar curvature breaks down as the Fisher information matrix becomes singular. Here, it may be obvious that we can not identify the growth rate, r, from a process that is initialised at its steady state (C(0) = K). However, observing this behaviour in general may help to detect issues with identifiability, particularly for models without analytical solutions. This computational cost will generally pale in comparison to the costs associated with collecting experimental or field data, and may be easily justified if the information geometry techniques are used to guide data collection.
If information geometric analysis identifies a region of parameter space with significant non-constant scalar curvature for a model, such as in Figure 9, and practitioners have a prior expectation that the true parameter values fall somewhere within this region, this may indicate that a greater quantity or quality of data is needed to improve identifiability for that particular model.
Alternatively, such analysis may guide practitioners in choosing favourable experimental conditions; for example in cell culture experiments, where it is possible to vary the initial cell seeding density [15]. Experimental design is a process wherein experiments are performed or simulated iteratively with perturbations, such that some measure of information is maximised.
Through this process the most informative experiments are identified, facilitating design of optimal experimental protocols [38,57,85]. Common to these approaches is the importance of quantifying and comparing information. While we do not consider optimal experimental design in this work, there is potential to incorporate information geometric techniques in the experimental design process as a means of comparing information between experimental perturbations. This is an area for further investigation.
Although we focus on how information geometry can supplement traditional maximum likelihood based inference and uncertainty quantification, primarily through visualisation, it should be noted that concepts from information geometry have also found application in the inference context from a computational efficiency standpoint. For example in Bayesian inference, by defining Monte Carlo sampling methods on a Riemann manifold, the geometric structure of the parameter space can be exploited [36]. Simulated paths across the manifold automatically adapt to local structure, facilitating efficient convergence, even in higher dimensions and in the presence of strong correlation [36,44]. Concepts from information geometry, including geodesic curves, are also implemented in methods for model reduction [96].
These applications of information geometry techniques to improve computational algorithms highlight further utility of geometric concepts for inference in higher dimensions, beyond that which we demonstrate through visualisation in this work.
Geodesics can be used to measure the distance between probability distributions. As demonstrated in Section 3, it is possible to perform hypothesis tests based on geodesic distance [47,68,73]. The approach for performing a hypothesis test is to solve a boundary value problem to find the geodesic connecting two points in parameter space, and use the corre- In this work we only consider models that admit unimodal likelihoods.
In cases where the likelihood is multimodal; provided that we are able to obtain the Fisher information required to compute the Christoffel symbols, we are still able to compute the scalar curvature and perform hypothesis tests based on geodesic distance. With multimodal likelihoods, it would not be possible to construct confidence regions from geodesics emanating from the MLE. Although, we note that constructing confidence regions for multimodal likelihoods is also problematic with traditional likelihood-based inference methods.
There are several avenues for future research in this area. Here, we consider two-dimensional manifolds to facilitate convenient visualisation, how-ever the inference and information geometry techniques are general, and can be readily applied to higher dimensional manifolds [3,76]; albeit with increased computational cost. Extending this analysis to three dimensions would enable consideration of situations where there is scalar curvature associated both with the variability of the observation process, σ, and also with interactions between model parameters; for example, it may be insightful to consider θ = (β, γ, σ) for the SIR model, where we associate a constant negative scalar curvature with σ and non-constant positive scalar curvature due to interactions between β and γ. In three dimensions, likelihood-based confidence regions can be visualised as a series of two-dimensional slices oriented in three-dimensional space [15]; this technique could be applied to visualise slices of the scalar curvature in three-dimensions. One approach for visualisation in higher dimensions is to produce an ensemble of these twoor three-dimensional confidence regions for various combinations of parameters of interest, with other parameters fixed at their MLEs. Alternatively, in higher dimensions it may be more appropriate to use non-visual techniques, such as hypothesis testing.
While we have considered ODE models, there is appetite in the literature for parameter estimation, uncertainty quantification and identifiability analysis for more complicated models; including partial differential equations, stochastic differential equations (SDEs), delay differential equations [12,67,91]. This appetite extends to non-differential-equation-based models, including agent-based models [52] and network models [40]. A natural extension of this work is to present examples demonstrating how the information geometry techniques can be applied to these more complicated models. This will introduce new challenges, though it may be possible to leverage existing techniques; for example, linear noise approximation may be used to obtain a representation of the Fisher information matrix for SDEs [50]. Further, we fix σ across observation times, model parameters and pop-ulations. However, the techniques presented in this work can be generalised to handle data with non-constant variance [16]; the expression for the Fisher information matrix given in Equation (4) can be extended to account for a parameter-dependent covariance matrix [64]. Investigation of examples paralleling those in Section 3, but with non-constant standard deviation, may prove insightful.
Here, the Fisher information defines a Riemann metric on the statistical manifold. For some inference problems it is not practical to obtain the Fisher information. Where the Fisher information is not available, the samplebased observed information-computed as negative the Hessian of the loglikelihood function, or via Monte Carlo methods-may be available [25,79].
The observed information has been demonstrated to equip a manifold with an observed geometric structure akin to the expected geometric structure