Deterministic and Stochastic Parameter Estimation for Polymer Reaction Kinetics I: Theory and Simple Examples

Two diﬀerent approaches to parameter estimation (PE) in the context of polymerization are introduced, reﬁned, combined, and applied. The ﬁrst is classical PE where one is interested in ﬁnding parameters which minimize the distance between the output of a chemical model and experimental data. The second is Bayesian PE allowing for quantifying parameter uncertainty caused by experimental measurement error and model imperfection. Based on detailed descriptions of motivation, theoretical background, and methodological aspects for both approaches, their relation are outlined. The main aim of this article is to show how the two approaches complement each other and can be used together to generate strong information gain regarding the model and its parameters. Both approaches and their interplay in application to polymerization reaction systems are illustrated. This is the ﬁrst part in a two-article series on parameter estimation for polymer reaction kinetics with a focus on theory and methodology while in the second part a more complex example will be considered.


Introduction
Parameter estimation (PE) for chemical kinetics means the process of fitting a mathematical model of the reaction process of interest to given observation data by tuning the parameters of the model. The model is mostly given in the form of reaction rate DOI: 10.1002/mats.202100017 equations, that is, a system of ordinary differential equations (ODEs) that describes the temporal change of concentrations of the reactants and products and their properties by modeling individual reactions involving parameters like reaction rate coefficients. Measurement data is given in the sense that some characteristic quantities of the process are measured for a sequence of time points. In PE, one wants to find the parameters for which the solution of the model is as close as possible to the data given. In PE, closeness is typically measured by means of the residual function that measures the distance between modelbased prediction and measurement data.
PE for polymer reaction kinetics is used in hundreds of articles. There also is a wide range of literature on PE for ODE systems, including its specific use in polymer reaction engineering, [1][2][3] or systems biology, [4] for example. This article complements these works by a combination of different approaches to the PE problem that normally are dealt with independently: We will discuss (1) the classical approach to PE via minimization of the residual function and (2) the approach utilizing Bayesian PE and uncertainty quantification. We will outline the pros and cons and the different contexts in which these approaches seem appropriate, and will demonstrate their use in application to different realistic scenarios.
It is not the objective of this article to review the literature on PE for polymer reaction processes. Its aim is to demonstrate that classical and Bayesian PE complement each other in ways that allow to deal with the following typical real-world scenario: the model is still under construction, for most model parameters at most rough estimates are available, and the available measurement data is not sufficient, in the sense that not all quantities of interest for the process can be measured, there are too few time points, and/or the measurement quality is low, that is, there is significant measurement error. In this scenario, classical PE often leads to severe problems: the misfit between model and data, the residual, stays significantly large even after many minimization steps, and the resulting fit is unsatisfactory, or the parameters are strongly correlated. We will shed light on the exact reasons for this outcome and discuss how to improve the situation. In this scenario, classical PE suffers from the fact that insufficient and low quality measurement data leads to uncertainty about the optimal parameters. www.advancedsciencenews.com www.mts-journal.de In general, parameter uncertainty can take several forms: (1) The uncertainty of all parameters is small and their distribution is approximately Gaussian around the optimal values, (2) the uncertainty is large for some parameters meaning that they are hardly informative and the model may even be insensitive with respect to these parameters, and (3) the uncertainty is large because the residual function has more than one main well. Case (1) is the lucky case in which classical PE typically works well despite lowquality data. In case (2), special precautions have to be taken in classical PE in order to avoid the "ill-conditioned" problem of determining the precise value of uninformative parameters (see Section 2.2.1 below). Case (3) poses a severe problem for classical PE that can only be dealt with via many restarts of the minimization procedure for diverse initial conditions. In realistic cases, however, one may have to deal with a mixture of all three cases and may want to get quantitative information on parameter uncertainty and the effects causing it.
Quantitative information on parameter uncertainty typically requires using Bayesian PE instead of classical PE. Bayesian PE, in the form discussed in this article, samples the distribution on parameter space induced by the limited data available, allows to include prior information, and does not only allow to quantify uncertainty but also to identify uninformative parameters as well as multi-modal distributions. However, it is computationally much more demanding than classical PE. Therefore, uncertainty quantification via Bayesian PE can complement classical PE but will not replace it.
Bayesian methods for chemical reaction kinetics have also been discussed extensively in the literature, especially for model inference and model reduction for large chemical reaction networks [5][6][7][8] or polymer reaction kinetics, [9] in relation to specific chemical contexts as catalysis, [10] or combustion, [11] for large experimental data [12] or high dimension cases, [13] as well as for more theoretical aspects like approximation quality and sparsity. [14] In contrast, applications to polymer chemistry, especially applications to polymer reaction kinetics with a focus on Bayesian PE, were rarely considered: For example, in [15], a Bayesian framework including PE is presented for surfactantpolymer flooding with no focus on polymer reactions. Process design via Bayesian optimization and Bayesian design of experiments has attracted some attention recently (see refs. [16][17][18], for example), but with no focus on PE.
In this article, Bayesian PE for polymer reaction kinetics is discussed, including uncertainty quantification and its propagation by the model for the real-world scenario of too few and low quality data. We outline the theoretical foundations and introduce a practical sampling algorithm including a novel and efficient step-size scaling procedure. Moreover, we will show how the different tools used in classical PE like efficient evaluation or gradient computation of the residual re-appear (and can be re-used) in Bayesian PE.
The aim of this article is to show that only an interplay between classical and Bayesian PE allows to perform reliable PE for the real-world scenario of insufficient data, allowing to shed light on parameter uncertainty and the effects causing it. The article describes an entire pipeline of tools, from classical PE, via its improvement for ill-conditioned cases, up to uncertainty quantification via Bayesian PE, including efficient sampling and visualization. While many individual ideas contained in this article are not entirely new and have in part been known since many years, their composition into a unified framework for PE, including uncertainty quantification (PE+UQ) is new in the best sense of a feature article. To our best knowledge, no article exists in which the topic PE+UQ in polymerization is covered in a similar way or in a comparable breadth. Moreover, several of the components were adapted to the context and several others are novel. We believe that this article provides a helpful reference for students and experts involved with non-standard PE problems in both polymer kinetics and other problems where measurements and unknown parameters are strongly unbalanced.
In order to make real-world testing easily possible, the entire integrated pipeline is made available in the commercial software package PREDICI. [19] This article is the first part of a two-paper series. In this part, we will develop the theory and illustrate it on a few very basic examples. In the second part, a comprehensive co-polymerization model from polymer kinetics will be used to show how to really work with the suggested tools.
The article is organized as follows: First, in Section 2, we give an introduction to PE for ODEs, starting with the general setting, then considering classical PE including residual minimization, the potential source of ill-conditioning and improved numerical stability, followed by outlining the key ideas of Bayesian PE including tools for efficient sampling of the posterior distribution, its visualization, resulting uncertainty propagation, and concluded by a comparison between classical and Bayesian PE. Next, in Section 3, we illustrate the interplay between classical and Bayesian PE and demonstrate the performance of the proposed algorithms in application to two examples, an illustrative example of a chemical reaction scheme with four substances, and a more realistic radical polymerization setting. Finally, in Section 4, we draw conclusions and set the results into perspective.

Parameter Estimation
This section is devoted to the in-depth discussion of two different approaches to parameter estimation (PE) for mathematical models in terms of differential equations.

General Remarks on Parameter Estimation of Chemical Reaction Models
We will concentrate on the PE problem for chemical reaction models that we are going to consider later in this article. Such models can be written as systems of ordinary differential equations (ODE) where x 0 is a given initial value and ∈ ℝ d denotes a vector of parameters, such as a set of reaction rate coefficients, that determine the outputs of the model. The general form of (1) also holds for general evolution equations, in particular, the countable system of ordinary differential equations (CODEs) that are used to describe polymerization kinetics. [19] The state of the system at time t, for example, the vector of concentrations of chemical species in a reactor, is denoted by x(t). That is, x(t) in general is a multi-dimensional vector in ℝ m and the right hand side of our www.advancedsciencenews.com www.mts-journal.de ODE system, f , maps x(t) to its temporal derivative, again a vector in ℝ m . Under very mild conditions on f , the ODE system (1) has a unique solution that is completely specified by the initial condition and the present parameter values. We simply write it in the following form: where x i (t) denotes the ith component of the vector x(t) and F i the respective component of F. By integrating Equation (1), in time we find that In general, the solution map F is not available in explicit form but can only be computed numerically and comes with the (often considerable) computational effort of computing the trajectory of the ODE system (1) from time 0 to time t. This is especially true for polymerization systems that are solved with respect to full chainlength distributions. Note that in reality complex polymerization systems can result in complex mathematical models, for example, ODEs of a high order or Partial Differential Equations (e.g., population balance equations describing the particle size distribution in emulsion polymerization [20] ). In the notation of this article, the model formulation is encapsulated solely in the model function F so that the theory from hereon is applicable regardless of the form or complexity of the system. As we will see later, if the evaluation of the model function is expensive, this naturally makes the use of the numerical methods we introduce expensive.

Measurements
Next, we assume that we made measurements of the state of the system at times t j , j = 1, … , T. We denote these measurements by X = (X 1 , … , X T ). In general, we have to assume that we may not be able to perform measurements for all components of the state vector x(t) but just for some of them. For the sake of simplicity, we will use notation in which x(t j ), the model's state at t j , is directly compared to X j , for example, by computing the Euclidean distance where X ji denotes the ith component of the measurement vector at time t j . If measurements are only available for some components of the state vector, then the sum must solely contain these components. For real applications, we should write g i (X(t j )) using the state vector X(t j ) and a function g i that maps the state variables to a measurement of type i. Typical examples are the conversion of a chemical substance or the mean value of a polymer distribution. Both are computed in terms of some variables x i of the state vector. However, for the theory developed herein, we will just use x i (t j ) as model description of a measurement to make the presentation simpler. Of course, the goal is to develop a model so that each output x i (t j ) is a close as possible to X ji , that is, to minimize the distance in Equation (4).

Residual Function
If we knew that the model (1) perfectly reproduced the measurements for a given parameter vector * , then we would have a perfect fit in the sense that For all other parameter vectors, we can measure the deviation between model and data by the weighted least squares residual function, where the diagonal weight matrix D j contains the diagonal entries X 2 ji , i = 1, … , m. We choose the entries of D in this way to measure the quality of fitness of the model F to each data point X ji by the same simple relative error model since we do not assume any special knowledge on specific data points. This also makes it easier to translate the formulation to further theoretical sections in this article. Note that the weight matrix could be chosen differently to reflect even a manually chosen weighting of the errors of individual data points (see, e.g., refs. [21,22]). For example, in many real cases, one will extend the scaling D ji by measurement inaccuracies or other thresholds s ji by setting D ji = max(X 2 ji , s ji ). In the perfect fit case, the residual function would have a minimum at * with R X ( * ) = 0. Any evaluation of the residual function comes with the cost of computing the solution of the ODE system, in general by numerical means.
We cannot assume to have a perfect fit for various reasons, for example, our model might simply be imprecise. Nevertheless, parameter vectors for which the residual function R X is smaller will be preferable to one with larger values of R X . The parameter values for which R X attains a (global) minimum will belong to the best fit given the model and will be called optimal parameter, given model and data. As we will see, it is crucial to be aware that there may be more than one minimum, so-called local minima, and that the vicinity of a global minimum may contain regions in which the values of R X only slightly deviate from its minimal value, at least in comparison to the error of computing the residual function numerically. These are exactly the problems that make PE notoriously hard.

Measurement Errors
Until now, we have not considered any measurement error. In general, the measurements will have a relative error, that is, the measured value X ji of component i at time t j will be related to the real valueX ji via www.advancedsciencenews.com www.mts-journal.de where ji is an unknown measurement error that is often modeled by a normally distributed random number with mean 0 and variance 2 ji with some small ji that has to be provided as part of the measurement process. Consequently, even for a perfect fit we only can hope for such that the residual function will no longer vanish at * , but yield the value Therefore, all other parameter vectors with R X ( ) = R * X are as good as * with regard to fitting the model to the data. For the sake of simplicity, let us consider the uniform case ji = . Then, the square root of R * X is normally distributed with mean 0 and variance 2 . Because of this, the probability distribution on the parameter space is given by The exponential function comes in because of the assumed normal distribution of the measurement error. This probability distribution is called the likelihood in the sense that all parameters leading to residual values around 0 with standard deviation are likely candidates for the "true" parameters while ones with higher residual values are exponentially unlikely. This concept survives if we do not assume a perfect fit. Then parameter values in avicinity of a minimum of the residual function are likely, while ones with higher residual values are unlikely.
In case that the ji are not identical across all i, j, it is straightforward to modify the likelihood accordingly by dividing each term in Equation (6) by the corresponding 2 ji instead of 2 . In part 2 of this series, we will devote more space to the measurements and their accuracies in the context of polymerization kinetics.
Remark 1. Note that we take the perspective that the measurement X ji is given and together with the measurement error induces a distribution on what the true valuesX ji is. This leads to the Equations (8) and (9). One could alternatively assume X ji = F i (t j , X 0 , ⋆ )(1 + ji ), so that the measurement results from a perturbation of the true model, and define the diagonal entries of the weight matrix in (6) by D ji = F i (t j , X 0 , ) to arrive at (9).

Classical versus Bayesian PE
These short considerations already explain the different approaches to the PE problem: Classical PE is focused on computing minima of the residual function in a reliable and numerically efficient way. However, even local minimization of R X may be troublesome because there may be extended regions in parameter space where the residuum is numerically indistinguishably small which makes numerical computations ill-conditioned. In contrast, Bayesian PE tries to explore the likely parameter regimes stochastically. Numerically, this is a much more demanding task that avoids the problems of classical PE but often may be computationally infeasible. If feasible, it generates information about the uncertainty of model-based predictions caused by uncertainty about the parameter values. The following two sections will outline these aspects in more detail.

Classical Parameter Estimation
While the generation of synthetic data with a model with parameters-or of true measurements with a chemical experiment in the laboratory-is usually called the forward problem, the inference of the parameters from the data is referred to as the inverse problem: In classical parameter estimation, solving the inverse problem means, one searches for the global minimum * of the residual function.
In many cases, such an inverse problem is ill-conditioned. This means that even slight variations to the data can cause a significantly different result for the estimated parameter. Since we assume the data to be subject to measurement errors, these perturbations naturally occur and can cause misleading results regarding the parameter estimation. These effects can be significant and, it must be emphasised, very severely alter the result of the parameter estimation, even if the forward problem is wellconditioned (which means slight changes to the parameters only lead to small perturbations of the model result). This is illustrated in Figure 1. An example can be found in ref. [1], pages 227-230.
There are various numerical methods to find a local minimum of a function, in our case of the residual function. Most of these methods focus on finding critical points where the derivative of the function is 0. These points typically are local minima. The global minimum is selected as the local minimum with the lowest residual function value. There exists an abundance of literature on numerical parameter estimation methods to which we refer for a complete overview of the topic, such as refs. [2,[23][24][25][26][27]. We will concentrate on one particular method to both illustrate the basic methodology of such methods and introduce an effective way to overcome typical problems in parameter estimation. This method will also be used in the experiments later on in this article.

Gauss-Newton Method with Essential Directions
A prominent way to find the zeros of a function is Newton's method applied to the derivative of R X ( ). Let us write the terms for all i, j into a single vector where for now j denotes the jth entry of a parameter . This approach leads to the Gauss-Newton method. [1,28] It denotes an iterative scheme in which, from an initial parameter guess 0 , subsequent iterates k , k = 1, 2, … are generated. Please note that the index of the vector k now denotes the iteration step such that its jth entry is denoted k,j . Each step of the iteration consists of the following two sub-steps: A direct way to solve this minimization problem is by demanding that Using the Moore-Penrose pseudo-inverse J + ( k ) (this is a substitute for the inverse of the matrix if the true inverse does not exist or the matrix is not square), [29,30] this gives The pseudo-inverse J( k ) + can be computed using Singular Value Decomposition. To this end, one first represents J = J( k ) as where U ∈ ℝ N×d and V ∈ ℝ d×d are orthogonal matrices, while S ∈ ℝ d×d is a diagonal matrix, carrying the singular values s 1 ≥ s 2 ≥ ⋯ ≥ s d on the diagonal. If s d > 0, J + is given by For handling the specific case s d ≈ 0 or s d = 0, see below. With this, we can write the Gauss-Newton step in Equation (16) as where U l denotes the lth column of U. Damping: When computing the pseudo-inverse of J( ), we essentially solve an inverse problem (a linear system of equations) which can be ill-conditioned. In this case, errors in the data may perturb the Jacobian J( ) + . In addition, correlations between parameters will result in unreasonably large step lengths. Consequently, the objective function at the end point of the step may show a potentially huge increase instead of a monotonous decrease. To counter this, one has to apply a damping strategy by multiplying the Gauss-Newton step with a damping factor k ≤ 1, so that The damping factor is updated from step to step using a monotonicity test based on the objective function. There are many sophisticated strategies available. A summary is given in ref. [31]. Essential Directions: In many cases, one has to deal with hidden parameter correlations. These correlations are not easily detected, but rather hidden in the model structure. The parameters are also not fully correlated in a way that one can express one parameter by a function of some others, which would merely be a modeling defect. Instead, they are only locally dependent in the sense that a change of one parameter can nearly (with respect to the residual) be adjusted by some other parameters. By that, the parameter estimation practically does not lead to a unique solution. In other words: only some of the directions in parameter space are really essential whereas the others depend on the essential directions. These are called the flat directions.
Next, we outline a technique for detecting such parameter correlations and for dealing with them. This technique is rarely discussed in the context of classical PE, but proves to be essential for practical purposes, for more details please see ref. [1]. A similar approach using rank decisions to identify sensitive parameters is presented in the text book. [4] As we will see below, parameter correlations and their detection and removal are deeply related to the so-called condition www.advancedsciencenews.com www.mts-journal.de number of the underlying Jacobian matrix J at the respective step of the iteration, which is given by that is, by the ratio between the biggest and the smallest singular value. If s d = 0, then the condition is infinite. Often, the singular values of J show a clear gap in their magnitudes at an index k ess , meaning that s 1 , … , s k ess are clearly bigger than s k ess +1 , … , s d . With this, we can closely approximate J( ) by This allows us to fix the condition of the minimization problem in Equation (14) to The Gauss-Newton step is then given by The explicit matrix notation (23) shows that we have transformed the parameter space of the problem into a new space where the essential directions are just the main axes. By dropping the non-essential coordinates there, performing the Gauss-Newton step and then going back to the original space, we have automatically performed an adjustment of the flat directions based on the progress of the essential directions. It is important to note that the parameters of the flat directions are not insensitive or even fixed. There are examples where all pairs of two out of three parameters (with one parameter fixed in a reasonable range) are perfectly sensitive and essential, but the fit problem with all three parameters has only two essential directions. In all classical PEs shown in this publication, we have applied the detection and treatment of essential directions. We call this method reduced-direction approach. [32] Flat directions are usually identified by being assigned to singular values that lead to a condition number of about 100 or larger (if there is no other clear gap between the singular values). In a 2D parameter space, the condition can be visualized by the ratio of the half-axes of an ellipsoid (with the lengths of these half-axes corresponding to the singular values). It is obvious that in cases of large condition numbers of, say, (J) = 100 or larger, this ellipsoid almost degenerates to a straight line exhibiting a clear parameter correlation. Finally, it is important to note that the condition of a problem and the number and type of essential directions is strictly dependent on the problem setup. Even a slight change of the scaling or a different weighting can alter these structures. All analyses of this kind are only performed locally at single points in parameter space that are reached by the iteration. Therefore, it is very important to also create a global view of the problem.

Bayesian Parameter Estimation
Bayesian parameter estimation aims at avoiding the possibly illconditioned inverse problem by a stochastic reformulation of the problem such that only forward problems need to be solved in order to find good parameter values. It does not result in a single optimal parameter value but, in contrast, in information on the probability of certain parameter values. To this end, one includes some a priori known uncertainty about measurement errors into the parameter estimation. The Bayesian inverse problem is structured as Model, Measurements X, Measurement precision → Probability for each (24) The underlying assumption is that the uncertainty in the given data and model imperfection translates to an uncertainty in the estimated parameter. From a quantification of data uncertainty, Bayesian PE deduces a probability distribution on the parameter space. If this distribution is peaked, that is, concentrates around a single maximum, then the parameter uncertainty is small and the parameter at the maximum highly informative. If, in contrast the distribution is spread out, perhaps with many local maxima, then parameter uncertainty is large. The central relation of Bayesian PE is p( |X) denotes the so-called posterior density, interpreted as the probability distribution on parameter space resulting from the data values X observed. L denotes the likelihood that the observed data X come from the model with parameters . The density p Θ , called the prior density, is used to encode all the prior knowledge that we may have on the parameter values. The relation states that the posterior density is proportional, that is, equal up to a constant, to the product of the likelihood and the prior density.

Derivation and Fundamentals of Bayesian Modelling
Bayesian PE starts from the fundamental Bayesian identity that directly results from the definition of conditional probabilities: where as above, X = (X 1 , … , X T ) denotes observation data, the parameter vector, and A and B subsets of the parameter space and the data space, respectively. Both sides of this equality are identical to the joint probability that X ∈ B and ∈ A.
Next, one assumes that the respective probability distributions on data and parameter space exhibit probability densities, meaning that www.advancedsciencenews.com www.mts-journal.de where L now denotes the density related to the probability of the data, given the parameters. Since this is true for all A, B, we get the Bayesian identity (26) in terms of densities: Assuming p X (X) > 0, (29) yields We will see later that the posterior density p( |X) allows us to do uncertainty quantification on the parameter estimation and in the resulting model based predictions. At this point, however, the key fact is that computation of the posterior density is typically well-conditioned even if the inverse problem is ill-conditioned. [33] Specifics on The Likelihood: Under the assumption that the noise ji in (8) is normally distributed with variance 2 we find, as above, where ∝ means identity up to some normalization factor such that ∫ L(X| )dX = 1. This holds because The first relation holds because every data point X ji is assumed to be normally distributed around F i (t j , X 0 , ) with variance 2 . The likelihood of all data points is then the product of these terms. The factor mT from Equation (6) can be omitted here since it only changes the normalization constant. Note that the residual that is used in classical PE is directly transformed into a probability. Specifics on p X : In order to get an interpretation of the data density p X , we first integrate equation (29) with respect to on both sides, yielding Therefore, p X merely plays the role of the normalizing factor of the posterior understood as a density over parameter space. Because of this identity, one often simply writes the relation (29) in the form given by Equation (25).
Specifics on The Prior: The density p Θ , called the prior density, is used to encode all the prior knowledge that we may have on the parameter values. When introducing the so-called potential function (that takes the value ∞ where the density is zero), and using the expression for the likelihood, we can express the posterior by Although with this, we merely reformulate the posterior density by introducing additional notation at this point, the term S X will be of help later on. For the form of the prior, there are two typical cases: Uniform prior: One may know the parameters are positive values and perhaps even that their values must come from a certain interval. In this case, p Θ is constant in a certain subset A of the parameter space and zero outside, that is, c > 0 is a constant value resulting from normalization, It simply means, we know the parameters are in a certain interval but have no additional information about them. Gaussian prior: Often one assumes that the parameters come from a normal distribution around some mean value 0 with an appropriate covariance matrix Σ. In this case, If the prior takes this Gaussian form then, Of particular interest is the maximal posterior parameter estimate denoting the most probable parameter value based on both the data and the prior estimate. In the case that the prior is constant over an interval and the optimal parameter estimate of classical PE from (8) is in this interval, then it is equal to the maximal posterior parameter estimate. If the prior is not uniform and has its global maximum at a different value than the estimate from classical PE, then this is generally not the case. In fact, then there exists a conflict between our prior, data-independent knowledge, which suggests that the optimal parameters are close to one certain value, and the evidence that the data provide. The posterior distribution takes both into account. Its specific shape then depends on the exact specifications of the prior, the likelihood and the data. For example, it could well be a bi-modal distribution with peaks at the maxima of likelihood and prior.

Computing the Posterior and Related Expectation Values
In the literature, two main algorithmic concepts for computing the posterior dominate. The first and most simple one is gridbased: one defines a grid of parameter values k , k = 1, … , M, evaluates the likelihood and the prior at these points and sets This requires M evaluations of trajectories of the ODE system (one for each point as part of the evaluation of the residual R X ; remember that the likelihood L is directly computed from R X as in (31)). Whenever the parameter space is high dimensional, the number of points of a typical grid gets too large (it grows with n d if we take n grid points per dimension for each of the d dimensions). For such cases, sparse grid or sparse approximation techniques offer a solution if the dimension is not too high. [14,34] Expectation values (like the mean or the variance) of the posterior distribution then are computed by for some function A = A( ) (with A( ) = for the mean and A( ) = T for variance).
The second approach samples the posterior distribution. This means, one generates a set of parameters, the samples, that are distributed according to the desired posterior distribution.
To do this, there is a variety of techniques based on Monte Carlo methods [35] and corresponding multilevel approaches. [36] A particularly prominent variant is the so-called Langevin sampler, [37,38] or its pre-conditioned or underdamped versions, [39,40] that generates a sequence of parameter points ( 1 , … , n ) according to the following iterative scheme: In each step, one first computes the proposal̃k +1 for the next parameter point viã where r j is a random number generated from the standard mdimensional normal distribution with mean 0 and variance 1, Δt some sufficiently small stepsize, and gradS X the gradient of the function S X from Equation (37). Next, one determines the acceptance probability according to the Metropolis-Hastings algorithm [41] : This choice of the acceptance probability yields that for the sequence of parameters, the detailed balance property holds; that is, in expectation, there are as many jumps from a parameter tõas in the other direction. Under mild additional conditions, this has a consequence that the distribution of parameters in the sequence converges to 1 Z e −S X ( k ) as in Equation (35). Last, in each step, one sets k+1 =̃k +1 with probability and k+1 = k otherwise. The acceptance step guarantees that the sampler cannot enter regions of the parameter space where S X = ∞, which may exist due to uniform priors, for example.
The sequence generated via this kind of MALA is automatically distributed according to the posterior [37] such that expectation values are simply given by by the Birkhoff Ergodic Theorem. [42] These sampling techniques automatically concentrate points in regions of the parameter space where the posterior density is significantly large. Expectation values converge with n −1∕2 in the number of points generated, almost independent of the dimension. However, each step of the Langevin sampler requires the evaluation of the gradient of the residual function, which may be very expensive whenever the dimension of the data space is high. Note that for multi-modal posteriors, the expressiveness of the expectation value can be low since it will simply lie somewhere between the different modes without sending a interpretable message. In this case, it becomes especially important to get a global view of the representation compared to simply considering the expectation value.
The optimal acceptance rate in order to yield a fast convergence typically lies between 50% and 70%, as is discussed in refs. [37,43]. It does not hurt if the acceptance ratio lies slightly outside of this interval. Truly worrying would be acceptance ratios of close to 100% because this would indicate that the step size was too small to efficiently sample from the entire state space of interest and below ≈ 20% because in this case the algorithm is likely to remain in a state unreasonably long.
We provide additional explanations on the algorithm in Appendix A.1.

Prescaling of the Step Size
In the case that parameters are in different orders of magnitude, the choice of a reasonable step size Δt can pose a difficult task: a choice that is suitable for one parameter might either be too big for another, yielding proposals to frequently come from regions where S X is high so that the proposal is accepted only with very low probability, or too low so that an infeasibly high number of steps is required in order for the distribution of samples to converge because the steps taken become minuscule in the directions of parameters with values on higher orders of magnitude (see Figure 2).
We therefore introduce a novel step size prescaling technique, which chooses different step sizes in each parameter direction by proposing a prescaling of the step size for each parameter beforehand. We replace the step size Δt by a diagonal matrix P that contains the step sizes for each direction in its diagonal. The proposal step then has the form where r k is a vector with random entries drawn from a standard normal distribution. In Appendix A.2, we provide an algorithmic way on how to suitably set the diagonal entries of P. Note that realistic examples very often require a proper prescaling. Step size that is suitable for parameter 1 but not for parameter 2 since the proposal lies outside the parameter space. Bottom: Step size that is suitable for parameter 2 yields that many steps are necessary to move through the space of parameter 1.

Marginal Densities
Whenever is more than 2D, we can hardly visualize the full density any more. Instead, we will be interested in the posterior densities for each of these individual parameters or a subset of two of them. These densities will be called marginal densities: let be multidimensional and be split into = (u, v) and let the set of all possible parameters be denoted by Θ = Θ U × Θ V (so that u lies in Θ U and v in Θ V ). Then the marginal distribution over u is defined as In other words, one integrates the posterior p with respect to all entries of except for the ones in u (see Figure 3).

Visualization of Densities with Kernel Density Estimation
Sometimes a global representation of the posterior is needed, for example, for the sake of visual inspection. Then point-wise values need to be interpolated in some appropriate form. To this end, we suggest Kernel density estimation (KDE). [44,45] Although there exist various other approaches to visualising the distribution, we want to illustrate KDE here in order to provide a complete work flow from the classical PE problem of minimizing the residual, the augmentation into the Bayesian PE, sampling from the posterior and the visualization. KDE can be understood as a continuous form of a histogram. Having sampled parameters 1 , … , n from the posterior with, for example, MALA, we could create a histogram of the distribution by counting the number of samples that lie in each of a chosen set of boxes. However, weaknesses of histograms include a severe loss of precision when choosing boxes too big and need for a high number of samples when choosing boxes small. In KDE, we approximate the posterior at every point bŷ where K is a closeness measure and H is a diagonal matrix that contains so-called bandwidths h 1 , … , h d > 0 as diagonal entries. As a consequence,p KDE ( ) will be high in regions with many samples since the values for K(H −1 ( − k )), which are added together, will be high if many samples are close to . This is consistent with the fact that by construction of the MALA algorithm, many samples should come from regions where the posterior is high. In total, we approximate p( ) through a continuous weighting according to closeness of samples to . The similarity to histograms lies in the fact that we do keep note of how many samples are close to a point in the parameter space but instead of simply counting how many samples lie inside a certain box, we include the exact distance between each sample and . More details on the choice of the function K and the bandwidths can be found in Appendix A.3.

Comparison between Classical and Bayesian PE
So far we have discussed two fundamentally different approaches to PE. In classical PE, we seek a solution of the optimization problem in order to find the parameter that, in combination with the model, best explains the data at hand. In Bayesian PE, we assign a probability distribution to parameter space, the posterior distribution, assuming that the data and model might not be precise.
In this way, we can quantify how reliably we can determine the parameters and draw conclusions for the range of possible forward simulations of the model. Although both approaches can be tackled by various means, there are specific state-of-the-art methods for both, which we have presented in our variants of the Gauss-Newton optimization and MALA. We have to observe, however, that in spite of their different aims, algorithmically both of these approaches are in fact quite similar. As is conceptualized in Figure 4, both methods rely on the approximation of the gradient of the residual function that compares data and the output of the model for a given parameter. While in classical PE, we construct a sequence of which we hope that it quickly converges to a minimum of the residual, in the Bayesian approach with MALA, we generate a sequence (of a chosen length) which should visit parameters in a frequency that is proportional to the probability with which they denote the optimal parameter. From then on, the real structural differences become apparent: In the Bayesian approach, we can then quantify and visualize the parameter uncertainty and instead of using the optimal parameters for forward simulation of the model, compute the propagation of parameter uncertainty by the model, for example, by computing the 90% percentile of the forward trajectories (this last aspect will be illustrated in Section 3).

Examples
We now showcase the interplay of classical PE with the Bayesian approach to parameter estimation in combination with the prescaled MALA on two examples and explain which conclusions can be made from the results in practice. The first example is quite simple and meant to illustrate the basic steps while the second is more complex. All simulations have been performed in the commercial program package Predici, v11, 2021. [19] www.advancedsciencenews.com www.mts-journal.de

Example 1: Simple Four-Substance First Order Reaction
Consider the reaction scheme with initial concentrations A 0 = 10 mol l , B 0 = C 0 = D 0 = 0 mol l . From here onwards, we omit the units of the parameters. The reaction temperature is not important here. We simulate the reaction for a time span of 1000 s and observe concentrations of A, B, C, and D on 22 different time points. We then perturb each data point by a factor that is normally distributed with mean 1 and standard deviation = 0.1 to simulate measurement errors ( Figure 5).

Estimating Parameters with Data about A
As a first parameter estimation step, we determine best-fitting parameters k ab , k ac and k ad , using only the data points corresponding to A. Using the Gauss-Newton method with essential directions with initial values given by (k ab ) 0 = (k ac ) 0 = (k ad ) 0 = 10 −5 , we obtain that the residual is minimized for k ab = 0.89 × 10 −4 , k ac = k ad = 2.85 × 10 −5 , yielding a residual of ≈ 0.08. The estimate of k ab shows an error of 11% while the estimates of k ac and k ad deviate strongly from their true values. The condition number of the Jacobian results to be ≈ 10 8 , a value that can be interpreted as sign of strong correlation of at least two parameters. Actually, the analysis of degrees of freedom shows only two out of three independent parameter directions here.
For this reason, we will now investigate the probability densities for both parameters. We deploy the prescaled MALA and make the correct assumption that the data are subject to normally distributed measurement errors with = 0.1. We bound the parameter space to the interval [10 −6 , 10 −3 ] × [10 −6 , 10 −4 ] × [10 −6 , 10 −4 ] for k ab , k ac and k ad , resulting in a uniform distribution on this interval as the prior distribution p Θ . We specify as residual function the function introduced in Equation (6). We set up the prescaled MALA so that a step should on average have the length of one fiftieth of the length of the parameter space in each direction. The sampling sequence created has length 2500. As initial values, we choose the estimated parameters given above but discard the first 500 steps of the algorithm to minimize dependence on the initial values. The acceptance ratio lies at 75% which is slightly higher than the supposed optimal range of 50-70% mentioned in Section 2.3 but still suitable.
The result is shown in Figure 6 (blue curves). As we can see, we get a broad range of probable values for all three parameters, especially k ac and k ad . It is important to observe here that in spite of the deviation of the estimated parameters from the true values, the true values are well inside the range of probable parameters. However, the data about A do not seem to be enough to determine good estimates for k ac and k ad . This lack of data together with the assumption of a relatively high measurement uncertainty yields a high uncertainty of the parameters.

Including Data about B, C and D
We now include the observed values of B, C, and D into the data set, which also consist of 22 data points each. The parameter estimation using Gauss-Newton with essential directions gives k ab = 0.89 × 10 −4 , k ac = 0.95 × 10 −5 and k ad = 4.57 × 10 −5 with a condition number of ≈ 2; so, apparently, the inclusion of B, C, and D significantly improves the parameter estimation. The estimated values, however, still deviate from the true parameters, especially the estimate for k ab . In order to check whether there still exists a large uncertainty around these values, we again sample parameters using prescaled MALA from the posterior distribution that comes from the inclusion of B, C, and D. As we can see in Figure 6 (red curves), the densities for k ac and k ad become much slimmer, yielding a more reliable estimate for k ac . Plus, the true values are still inside the range of probable values.

Effect of a Lower Measurement Uncertainty
Let us assume, we take the measurements now with a more precise apparatus. To this end, we perturb the original data again but with = 0.03. The result of the Gauss-Newton scheme now is k ab = 1.0 × 10 −4 , k ac = 0.99 × 10 −5 , and k ad = 4.96 × 10 −5 , with a condition number of ≈ 1. This is much closer to the true values. Sampling from the ensuing posterior distribution (with = 0.03) with prescaled MALA again generates densities which are much more precise than with = 0.1.
In summary, as we can see in Figure 6 (orange curves), higher trust in the observed data-resulting in a smaller value forand inclusion of the concentrations of B, C, and D in the parameter estimation procedure now enables us to determine all parameters with only a small amount of uncertainty. It is important to understand that the specified value for depends on the measurement precision of the experimentalist. The more precisely the measurements are taken, the more meaningful generally the result is. This is also illustrated by the fact that the parameter densities become slimmer, meaning that the range of likely values for a parameter is narrowed down with higher measurement precision. This means, receiving a small residual in the classical PE and a large range of parameters with a high value for need not be a contradiction: It means that if the data uncertainty is large, then the residual can be made small but at the same time the optimal parameters are not reliable.
Note that the overall measurement error may have different sources. In most cases one determines as a repetition error, that is, measuring the same sample n times for example by spectroscopic or chromatographic methods. Then is usually small, but we neglect all errors caused by sample preparation which usually needs several operations like dilution, filtration, neutralization, extraction, etc. Further errors arise from the evaluation of the raw data, that is, the transformation of the detector signal to the final measuring quantity. This may involve setting the correct baseline, choice of calibration curve, etc. The latter ones are usually higher than just the repetition error. By choosing different values for , we want to demonstrate the importance of the evaluation of correct .
If one has either, ideally, estimated the measurement precision to be high or at least is confident in the data measurements then should be chosen small. As a consequence, the parameter den-sities will likely be very sharp and centered around the optimal value from classical PE. In conclusion, this example illustrates how the reliability of parameter estimation depends on the data at hand and the amount of trust we can put into it.

Uncertainty Propagation
We can now substitute the sampled parameters into the model itself to see the range of probable outcomes of a simulation. To this end, we have simply created one trajectory for each sampled parameter vector, and at each point in time discarded the upper and lower 5% of values, leaving us with the 90% percentile. In Figure 7, we can see that all data points for D lie inside the 90% percentile of simulations. This indicates that our model can explain the data quite well, taking into account the data uncertainty (of additional interest could be to perform forward simulations, that is, beyond the time span for which the data were taken). This should be no surprise since the model we used to estimate the dynamics is precisely the one we used to create to data. This does not always have to be the case, for example, if the data are generated by a complex chemical experiment for which we only have a much simpler model, as the next example shows. Remark 2. In the above example, one might argue that trying to fit all parameters only using the measurements for substance A is not reasonable. However, in more complex examples, this is the typical situation: having a set of measurements and a given model with some unknown parameters where it is not clear whether it is possible to get a unique or narrowly-distributed parameter fit.

Example 2: Radical Polymerization
For the next example, we will address a situation which one often encounters in polymer research. Suppose there is a new www.advancedsciencenews.com www.mts-journal.de Figure 7. Illustration of uncertainty propagation: 90% percentile of concentrations of D for forward simulations with sampled parameters. We can see that the model is capable to explain the data within its uncertainty.
monomer M in whose basic kinetics in radical polymerization we are interested. First, you have to decide on the kinetic model, that is, what are the elementary steps which happen during this process. Radical polymerization is a chain process, that means the kinetic chain must be initiated, it must propagate, and finally terminate. Let us start with a rather simple model which one can easily pick off from text books of polymer science.

The Chemical Reaction Models
The initiator I decomposes to two radicals R with a rate coefficient k d and an efficiency f , which considers that not all radicals R may start a kinetic chain because they are destroyed by some (unknown) side reactions before they can add the first monomer unit.
R starts chains by adding the first monomer M giving a polymerization-active, "living" chain P 1 of length 1. The active chains grow by adding one monomer after the other to chains P s of length s (often named macroradical) until they terminate with each other yielding dead polymer D, the final product which you can isolate and sell. Termination can either be by combination of two active chains P s and P r yielding dead polymer D s+r of length s + r or by disproportionation, when the chain length of the active chains are preserved yielding two dead chains D s and D r . When the initiator radical R starts a growing chain, the group R is incorporated as end group (EG) in the polymer chain, and to account for the concentration of these incorporated groups, a massless reaction product, a counter EG, is introduced as auxiliary quantity.
Reaction scheme for the assumed polymerization mechanism for which rate coefficients will be estimated using "experimental" data generated with the "real" model given in (49).
One has to keep in mind that the life-time of an active chain is in the order of seconds or less, so R must be continuously generated by decomposition of I to generate new active macroradicals which grow to chain lengths of some hundreds or thousands until they terminate.
It is obvious that the mode of termination has a strong influence on the chain length of the final polymer D, that is, of its molar mass. The same holds for the efficiency f , which determines how many chains will be effectively started, that is, on how many chains the polymerized monomer will be finally distributed. Thus, it is important to know the values of f , k t,c , and k t,d to be able to design a new polymer grade with a desired chain length or molar mass.
We assume that we know the exact value of k d , usually provided by the supplier of the initiator, and the value of the propagation rate coefficient k p from some independent measurement. [46] The task now is to estimate values of the unknown coefficients f , k t,c , and k t,d .
At this stage, one has to perform a couple of experiments and measurements. Here, we replace the experiment by simulation results, but with a somewhat more complex, "real" model including reactions which are responsible for the efficiency f of the initiating radicals R.
Reaction scheme for the "real" polymerization mechanism to generate "experimental" data together with rate coefficients and recipe from Table 1.
Typical initiators are so-called azo-initiators which upon heating give two small radicals R and one nitrogen molecule N 2 which are captured in a so-called solvent cage, that is, they are surrounded by solvent molecules. The reaction of such small radicals with each other occur near diffusion control, and so these two R may react with each other to give a non-reactive compound T before they diffuse out of this solvent cage to meet a monomer and start a growing chain. This is called the cage effect. Once they have left the solvent cage, these primary initiator radicals may also react with growing chains to dead polymer D, a reaction which is called primary radical termination. Again, EG gives the concentration of end groups in polymer chains. Contrary to the assumed model, end groups from the initiator radical R are generated by chain initiation as well as by primary radical termination. These two reactions add an inherent "initiator efficiency" to the reaction system, and depending on the value of k cage and k prim the amount of radicals available for chain initiation will be reduced. Note that this efficiency is not constant throughout the reaction, in contrast to the simple model (50). To generate "experimental" data, we use the "real" kinetic model (51), together with the rate coefficients given in Table 1, left. The kinetic model is translated to a system of ODEs assuming the mass balance for a batch reactor together with the mass action law of reaction rates and constant densities of 1 kg l for all compounds together with the initial reactor load and molar masses M for monomer M, initiator I, and solvent S given in Table 1, right. In the following, we denote the complete reactor setup as the recipe.
The data are shown in Figure 8 and the exact values in Table B1 in Appendix B1.

Estimation of Parameters
We will now estimate the parameters f, k t,c , and, later, k t,d , using model (50) based on the data generated by model (51).

Test 1: Fitting f and k t,c with Monomer Concentration [M]
: At first, we estimate f and k t,c only with the data of the monomer concentration [M]. We select as initial values f 0 = 1 and (k t,c ) 0 = 10 6 . k t,d is assumed to be 0 as it is in the real model (note that since both models differ this does not have to be a good value in the simple model). Using the Gauss-Newton with essential directions algorithm in Predici, we find as optimal results f = 0.17, k t,c = 3.6 × 10 6 (52) while R X = 0.130, condition ≈ 1000 The high condition number should be a worrying sign. It indicates that parameters might be correlated so that there  Table B1. Apparently, there exist very different parameters which give very similar residuals. If we now assume the data to be subject to measurement errors, it becomes immediately unclear, which of these two parameters is better suited in our model because, if the true data are slightly different from the observed data, then the parameters in (54) could easily yield a better residual than the ones in (52). It is thus worthwhile to view the PE here from the Bayesian perspective and approximate the probability distribution. To this end, we assume to have made a measurement error of = 0.05 and use the prescaled MALA. Note that here one could also use the grid-based approach since we only consider two different parameters. However, we also want to showcase the applicability of the more complex MALA approach which is better suited for high-dimensional parameters. Details on the prescaled MALA in this example are found below.
The ensuing distribution is visualized using KDE in Figure 9. As we can see, we cannot reliably estimate the parameters since the region of parameters with high probabilities is vast for both parameters, especially for f . Considering the structure of the chemical reaction, this should not come as a surprise. While a high efficiency f denotes a high fraction of effectively initiating radicals and thus a fast conversion of the monomer, k t,c is the rate of the termination by combination during which radicals are destroyed. As such, f and k t,c work against each other, yielding multiple possible choices for them with similarly small residual. Together with the assumption of possible data measurement errors, this yields a large range of similarly probable parameters.
Please see Appendix B.2 for technical details of the application of the PE methods in this section.
Remark 3. Note that, although f = 0.17 together with k t,c = 3.6 × 10 6 also gives a small residual, apparently parameters in its vicinity do not since only few samples were found in this region. This is, in fact, an artifact of the MALA algorithm: proposals which are close to this parameter are still rejected because there the residual is high. As a consequence, the small region around f = 0.17, k t,c = 3.6 × 10 6 is slightly undervalued by the created sequence of samples.

Test 2: f and k t,c with [M]
and M w : We include the four data points of the average molar mass M w into the data and again perform parameter estimation with the essential directions approach in Predici. The result is Now, with a very small condition number, we are able to determine f and k t,c . Their values are also in accordance with the result of the Bayesian PE in the previous setting: 0.72 for f and 1.6 × 10 7 for k t,c were inside the set of probable values. We again perform the Bayesian PE with prescaled MALA for this setting and find a much smaller range of probable parameters, which also includes the optimal parameters stated above (Figure 10). The optimal parameters are close to the maxima of the determined distribution. Apparently, the inclusion of the average molar mass M w into the data has a strong effect on the reliability of the parameter estimation: the residual could be made small both with and without M w , but the parameter region in which this is possible is significantly narrowed down by inclusion of M w .
Test Apparently, while f is nearly at the same value, there are multiple possibilities for combinations between k t,c and k t,d .
The implication of this needs to be stressed. Although these parameters are minima of the residual function, they need not be predictive: if we use them to simulate other properties of the reaction, for example, simply simulate forward in time, they might yield significantly different results than other optimal parameters. Hence, one would have to ask, which prediction would then have to be expected? Taking the probabilities for the different parameters into account, we will see a range of probable scenarios. This will be illustrated in Test 4.
The Bayesian PE with = 0.05 again sheds light on which parameters are how probable through the posterior distribution. As it turns out, we find a broad range of values with high probabilities (Figure 11). It implies that with k t,d unknown and the data at hand, we cannot find a reliable estimate for the three parameters.
Test 4: f, k t,c and k t,d with [M], M w , End Groups and The Molar Mass Distribution: We now include the end groups and full molar mass distribution into the data in the hope that it yields a more precise estimate of the parameters. With initial values given equally as in Test 3, we get as optimal parameters f = 0.76, k t,c = 1.66 ⋅ 10 7 , k t,d = 1.09 ⋅ 10 5 , R X = 0.124, These parameters are similar to the ones in Test 3, but they come with a much lower condition number. We see, however, that k t,d is quite close to its initial value. This should make us suspicious since it could indicate that there exist many local minima with different values of k t,d and simply the one closest to the initial value was found.
To reveal whether this is true or whether k t,d = 1.09 × 10 5 is in fact the unique best choice, we again start the Bayesian PE at these optimal parameters. Compared to Test 3, the distributions of k t,c and f become much slimmer and resemble normal distributions centered around the optimal parameters determined above (Figure 12). k t,d , however, is distributed across a vast range www.advancedsciencenews.com www.mts-journal.de  With data about the molar mass distribution and end groups included, the parameters can be much more reliably inferred since the distributions are much slimmer than in Test 3. k t,d seems not to impact the outcome of the model much since many very different values seem probable.
of values from 10 to 10 7 . This is because there exist multiple different combinations of values for k t,d , k t,c , and f which yield a similarly small residual. In other words, the parameters are correlated. This is also expressed by the fact that in the Gauss-Newton scheme, there existed two essential directions and not three. The range for the combinations, however, is much smaller for the latter two than for k t,d .
With all data points included, we were able to learn important properties of our model. We could find optimal values for k t,c and f by minimizing the residual function. Taking into account the possibility of measurement errors that perturb the data, we could quantify how this uncertainty affects the parameter estimation. We could also see that for k t,d many different values, between 10 and 10 7 , are permitted. As the last step, we will now investigate the possible outplays of the model with other recipes based on their probabilities. In doing so, we also show how to verify whether the simple model is generally a good approximation to the true model, opposed to only for the recipe used in generation of the data depicted in Figure 8.
We generate data points with the true model (51) with a different recipe, given in Table 2.
Afterward, we make forward simulations with parameters with the simple model (50) which were sampled from the 3D probability distribution for the parameters and visualize the 90% percentile in each time step for these forward simulations (i.e., in each time step we show the boundaries above and beneath of which only 5% of the forward simulations lie). We compare these forward simulations with the data and want to check if the data lie inside the range of possible forward simulations. The aim is to find out whether the simple model can accurately approximate the true model for different recipes, taking into account that the parameter estimation is subject to uncertainty. If this is the case, it means that the parameters estimated on the basis of the data in Table B1 are applicable to other recipes, too. We find (Figure 13) that while the evolutions of M w and the end groups are reasonably well captured, the concentration of the monomer takes a quite different path, indicating that the simple model is not able to predict its concentration well. One could argue that maybe the data measurement error was actually higher than initially assumed so that we are led to the wrong parameters. In order to take this into account, we set = 0.1, again sample from the probability distribution corresponding to this value for -the probability distributions are not visualized here but naturally become broaderand generate predictions. As we can see, even assuming a high measurement error, that is, allowing the idea that many different parameters are candidates to be optimal, the simple model is not capable to accurately predict the evolution of the monomer concentration. This gives us an important message that the classical PE could not have sent: there exists a modeling error in the simple model (50) in regard to the true model (51); that is, there are components of the true, unknown model that are badly or not at all captured by the simple model. Hence, we must modify our simple model if we strive to forecast the monomer concentration. Of course, if we are only interested in M w and end groups, then the simple model seems to be sufficient. For this, the Bayesian PE sheds light on where the outcomes of the true model lie with different recipes.
In order to improve the simple model, there does not exist a blueprint in the sense of a clear algorithmic way how to do that. It requires expertise in regard to the chemical model to find the missing parts of the simple model. This is also the reason why we do not take the discussion of this example further since we want to showcase specific generally recommendable steps one should take in order to gain maximal insight into the uncertainty of the www.advancedsciencenews.com www.mts-journal.de estimated parameters and how it translates to forward simulation. For completion, for this model, it is the step of termination with R with parameter k prim that should be included into the simple model to give a much better prediction for the monomer.

Conclusion
In this article, we have illustrated, compared, and combined two different approaches to parameter estimation: (1) the classical approach that focuses on minimizing the residual function which measures the distance between the outcome of a model and the observed data, and (2) the Bayesian approach which quantifies the uncertainty that the parameters underlie. Both play an important role in estimation of parameters in chemical processes such as polymerization. In this article, we argued that only the interplay between both approaches allows reliable parameter estimation in typical real-world scenarios.
In this light, it is vital to understand that both approaches do not contradict but rather complement each other. Classical PE gives the answer to the more intuitive question: it tells us what the optimal parameter is that brings the model outcome closest to the observed data. Bayesian PE, on the other hand, assumes a probability distribution on the data which translates to a probability distribution of the parameters. It complements the classical PE in the sense that it tells us the degree of reliability with which we can assume to have found the optimal parameters. If one finds that the distribution allows a large region of probable parameters, it indicates that the data at hand are either not expressive enough or their measurement error is too high. Additionally, the parameter that was determined in classical PE can be used to make forecasts for future time points of the model simulation or entirely different recipes as done in Test 4 of Example 2. Plus, one has to take into account that the uncertainty in the parameters naturally translates to these forecasts. On the basis of the probability distribution of the parameters from Bayesian PE, we can estimate the probabilities for the outcomes of these new simulations. Further, this allows us to infer whether the model we use actually offers an accurate description of the real experiment that generated the data.
The uncertainty of the PE arises if one does not have perfect reliability of the data. It must be stressed that this data measurement error must be quantified by the experimentalist first (in the form of a value for ). Afterward, the data uncertainty translates to parameter uncertainty which itself translates to a range of possible model outcomes for other recipes or future points in time.
If the data can be measured almost perfectly, the Bayesian PE will give overwhelming probability to the parameters that were determined as optimal in the classical PE and the probable model outcomes will lie in a very small range around the outcome that corresponds to the optimal parameters. On the contrary, of course, if the data underlie large measurement errors, then the result from classical PE does not deserve a large amount of trust and consequently the parameter distributions from Bayesian PE will be broad.
In order to communicate these characteristics of the approaches as well as their interplay, we explained their motivation, theory, and applicability in detail. We further introduced practical methods for both-the Gauss-Newton method with essen-tial directions for classical PE and the novel sampling technique prescaled MALA for Bayesian PE. Last, we showcased how to apply both approaches and how to interpret the results in two examples of different complexity.
In total, the whole work flow to perform Parameter Estimation with Uncertainty Quantification we suggest in this article can be summarized as follows: 1. Generate data with an experiment. 2. Define a model, for example, of the form in Equation (1). 3. Estimate optimal parameters-that is, the global minimum of the residual function, defined, for example, as in Equation (6)-with Gauss-Newton with essential directions (see Section 2.2), using multiple different initial conditions since many optimization methods are prone to get stuck in local minima. 4. Quantify the data measurement error . 5. Define the prior distribution in the Bayesian framework as the data-independent intuition of the parameters and the likelihood function (see Section 2.3). 6. Sample from the ensuing posterior distribution, for example, using prescaled MALA (see Section 2.3 and Appendix A.2) with initial values chosen as the optimal values from classical PE. 7. Visualize the parameter distribution (for parameters with dimension bigger than 2 the marginal distribution), for example, using KDE (Equation (48)). 8. Use sampled parameters to investigate probable outcomes of the model for future points in time or different recipes. 9. If the experiment (respectively its model formulation) with which the data were generated is unknown and it is unclear whether the used model is a good description, take the real experiment again with a different recipe and compare the range of predicted model outcomes with the newly generated data. If the data lie far outside the range of probable model outcomes, this indicates that the used model is insufficient to describe the real experiment. In this case, problem-specific expert knowledge is required to find an improvement.
All these steps are implemented and were executed in Predici. While in this article we focused on the theoretical foundations of classical and Bayesian PE, in part 2 of this article series, we will apply both approaches to a more comprehensive example and illustrate all the details and specifics one has to take into account to infer the maximal amount of information about parameters, predictions, and model quality from the data. www.advancedsciencenews.com www.mts-journal.de probability density T( k+1 , k ), that is, if k = , then k+1 is distributed by T(⋅, ). It is required that be the invariant distribution of this sequence. As the name suggests, this means that the density of parameters does not change over time, that is, the probability density for k is identical to the one of k+1 . Formally, the invariant distribution fulfills The sequence is generated by drawing proposals from a proposal distribution q. The choice for q is almost arbitrary. The only restrictions on q are that it be non-negative and that it holds q( ′ , ) > 0 ⇔ q( , ′ ) > 0. This proposal is then accepted with probability If q is symmetric, for example, a Gaussian distribution, the qterms cancel out and can be omitted. With this, we can make the following observation: let T( ′ , ) be the density of making a transition from to ′ in the sequence generated by MH. For T it holds since in order to reach ′ from , ′ has to be drawn as the proposal (density q(⋅, )) and additionally has to be accepted (with probability ( ′ , )). This yields Analogously, we obtain which yields This is the detailed balance property which ensures As a direct consequence, MH creates a sequence of states for which is the invariant distribution. Thus, we can draw states from this sequence whose distribution in fact converges to .

A.1.2. The Langevin Sampler
We can address the Metropolis-adjusted Langevin algorithm from a different perspective. For a stochastic differential equation (SDE) of the form where B t is the Standard Brownian motion, it holds that the invariant distribution of realizations of this SDE is given by Since the solution of an SDE is not a discrete sequence of states but rather a time-continuous function in time, invariant distribution here means that the density of t having a certain value is independent of t. Over time, the density of states converges to this invariant distribution. As a consequence, by setting V = − log( ), we find that the SDE has invariant distribution exp(log( )) = (A11) We should therefore be interested in creating realizations of the SDE In Section 2.3, we introduced as equal to the function 1 Z exp(−S X ). Thus, the SDE becomes A common method to create realizations of SDEs is the Euler-Maruyama method. [47] It approximates the SDE by taking discretized steps of the form where r j is a normally distributed random variable with mean 0 and variance 1. The normalization constant 1 Z simply is accounted for by the step size Δt.
While in the original SDE, after some time, the k should be distributed by exp(−S X ), this property might be violated for the states of this sequence due to approximation errors induced by the step size Δt. By choosing a small step size, one can install convergence of the distribution of the k to a distribution that is closer to exp(−S X ).
Until here, we have not used the acceptance probability . In fact, creating a sequence of states from the approximated SDE is referred to as the Unadjusted Langevin Algorithm (ULA). In order to achieve that the detailed balance property is fulfilled and that the sequence thus has the desired distribution as its invariant distribution, in MALA one extends ULA by the Metropolis-step of only accepting a candidatẽk +1 with probability The proposal distribution q has this form because in order to draw̃k +1 , we move to k − Δt gradS X ( k ) and then add a number that is normally distributed with mean 0 and variance 2Δt. Thus, the density of̃k +1 is a normal distribution around k − Δt gradV( k ) with variance 2Δt, which is precisely reflected in q.
In summary, MALA combines two different algorithms that have the same goal: the Metropolis-Hastings algorithm and the approximation of a solution of an appropriate SDE. In both algorithms, a sequence is generated whose states are distributed by a chosen distribution. Technically, the use of MH would suffice for that. In MH, a bad choice of q can force the candidates to mainly come from low regions of so that few candidates are accepted and it would take long for the states of the sequence to be distributed by . Through the use of the gradient step in MALA, candidates are usually chosen in regions of interest, yielding a faster convergence.

A.2. Prescaled MALA
We present here the exact derivation of the step sizes in the prescaled MALA. In the algorithm, a step has the form k+1 = k − P gradS X ( k ) + √ 2P 1 2 r k , r k ∼  (0, Id) (A17) Id is the d × d identity matrix. The proposal distribution q is then given by The choice of P is done in the following way: 1. Specify the desired average length of a step in each of the d parameter directions, denoted by 1 , … , d . The term "average" is used with respect to the posterior distribution from which the algorithm samples. Over all samples of the resulting time series, the step taken toward the proposal should have average length i in direction i. A step consists of a deterministic part P gradS X ( k ) and a stochastic part √ 2P − 1 2 r k , which both contribute to the length of the step. The average, or expected, step length in direction i, dependent on P ii , is given by It is difficult to find an analytical expression to make this property be equal to a chosen i . Still, we can find an upper bound by because of the triangle inequality and the fact that This holds because for a normally distributed random variable r ∼  (0, 1), it holds that (|cr|) = c √ 2 .
2. Draw n points 1 , … , n randomly from the parameter space and compute the gradients gradS X ( 1 ), … , gradS X ( n ). In this article, we always used n = 100. 3. In each direction 1, … , d, compute the average length of the gradient 4. We can then solve which can be transformed to Solving the quadratic equation for Q i := √ P ii we get If desired, one can easily replace the average gradient length by a maximal or minimal gradient length in step 3. Note that the time step is not adaptively chosen but still determined in advance. This yields that the property of the algorithm that we exploit, namely www.advancedsciencenews.com www.mts-journal.de that it produces the desired distribution of samples, is not damaged.

A.3. Details on Kernel Density Estimation
The Kernel function K should be nonnegative, monotonically increasing as approaches 0, and its integral over ℝ d should be equal to 1. A typical choice is With the bandwidths, one can regulate the influence of samples onp KDE ( k ) depending on the distance between k and . In case that p is a Gaussian function, a recommended choice for h i is given by [48] h where v i is the standard deviation of the ith entries of the samples. It indicates that the more samples one has at hand, the smaller the bandwidths should be chosen. With KDE, we can then evaluatep KDE on a fine grid through Equation (48) and obtain a fine visualization of an approximation of p. Note that in generalp KDE ( k ) is unequal to p( k ) but can be brought closer by decreasing the h i . This in turn is already advised by Equation (A28) but needs a sufficient number of samples.

B.2.1. Test 1: Fitting f and k t,c with Monomer Concentration [M]
We used the prescaled MALA with 12 000 steps, out of which we discarded the first 2000 to give the sequence of points time to lose dependence on the initial values. This number of steps is high for only two parameters compared to Test 2. The explanation is natural: Such a high number becomes necessary if the region of parameters with high probabilities is large because it takes more time to comb this entire region compared to a small region. The initial values were set to the optimal parameters determined above. We assume a measurement error of = 0.05 and a uniform prior with bounds given by [0.1,1] for f and [10 4 , 10 9 ] for k t,c . In the pre-scaled MALA, we set 1,2 so that in each direction a step has the average length of one 200th of the length of its parameter domain. The acceptance ratio was 72%. In the Gauss-Newton scheme with essential directions, the number of essential directions was 1 for both sets of initial values. In the Gauss-Newton with essential directions algorithm, the number of essential directions was 2. We used prescaled MALA with 6000 steps, out of which we discarded the first 1000, assuming a measurement uncertainty of = 0.05. The initial values were the same in Test 1. The acceptance ratio was 57%.

B.2.3. Test 3: f, k t,c and k t,d with [M] and M w
In the Gauss-Newton algorithm, the number of essential directions was 2. As prior distribution, we again used a uniform distribution over the interval used so far for f and k t,c , while for k t,d we used the interval [1, 10 8 ]. As initial values, we chose f 0 = 0.5, (k t,c ) 0 = 10 7 , (k t,d ) 0 = 1. We performed 6000 steps of which we discarded the first 1000 to reduce dependence on the initial values. The acceptance ratio was 72%. Since the range for k t,d was unclear a priori, we replaced it by its logarithm to the base 10 and transformed it back for the visualization. Figure B1. Molar mass distribution (MMD) generated with the "real" model in equation (51) and rate coefficients and recipe of Table . MMD is given as the outcome of gel permeation chromatography (GPC), the state of the art method to determine MMD. For various representations of MMD, see for example ref. [49].

B.2.4. Test 4: f, k t,c and k t,d with [M], M w , End Groups and The Molar Mass Distribution
In the Gauss-Newton algorithm, the number of essential directions was 2. As prior distribution, we again used a uniform distribution over the interval used so far for f and k t,c , while for k t,d we used the interval [10, 10 8 ]. As initial values, we chose f 0 = 0.5, (k t,c ) 0 = 1e7, (k t,d ) 0 = 1. Again, we performed 6000 steps of which we discarded the first 1000. The acceptance ratio was 71%.