Model-based multi-parameter mapping

Quantitative MR imaging is increasingly favoured for its richer information content and standardised measures. However, computing quantitative parameter maps, such as those encoding longitudinal relaxation rate (R1), apparent transverse relaxation rate (R2*) or magnetisation-transfer saturation (MTsat), involves inverting a highly non-linear function. Many methods for deriving parameter maps assume perfect measurements and do not consider how noise is propagated through the estimation procedure, resulting in needlessly noisy maps. Instead, we propose a probabilistic generative (forward) model of the entire dataset, which is formulated and inverted to jointly recover (log) parameter maps with a well-defined probabilistic interpretation (e.g., maximum likelihood or maximum a posteriori). The second order optimisation we propose for model fitting achieves rapid and stable convergence thanks to a novel approximate Hessian. We demonstrate the utility of our flexible framework in the context of recovering more accurate maps from data acquired using the popular multi-parameter mapping protocol. We also show how to incorporate a joint total variation prior to further decrease the noise in the maps, noting that the probabilistic formulation allows the uncertainty on the recovered parameter maps to be estimated. Our implementation uses a PyTorch backend and benefits from GPU acceleration. It is available at https://github.com/balbasty/nitorch.


Introduction
The magnetic resonance imaging (MRI) signal is governed by a number of tissue-specific parameters. While many common MR sequences only aim to maximise the contrast between tissues of interest, the field of quantitative MRI (qMRI) is concerned with the extraction of the original parameters (Tofts, 2003). This interest stems from the fundamental relationship that exists between the magnetic parameters and the tissue microstructure: the longitudinal relaxation rate R 1 = 1/T 1 is sensitive to myelin content (Sigalovsky et al., 2006;Dick et al., 2012;Sereno et al., 2013;Lutti et al., 2014); the apparent transverse relaxation rate R * 2 = 1/T * 2 can be known to interact with the quantitative parameters, so that the function that governs the weighted signal intensity can be inverted. This function emerges from the integration of ordinary differential equations (Bloch, 1946) and is therefore nonlinear and can be difficult to invert in closed-form. Therefore, early attempts at quantitative MRI targeted a single parameter at a time and used sequences with parameters carefully chosen so that other terms would disappear with a few algebraic manipulations (Gupta, 1977). However, as quantitative mapping moved from pure MR research to clinical applications, it became important to design sequences that allowed a maximum number of parameters to be estimated from a minimum number of acquisitions.
The multi-parameter mapping (MPM) protocol stems from this need: it uses multi-echo spoiled gradient echo (SPGR) acquisitions with variable flip angles and allows the quantification of R 1 , R * 2 , MT sat and proton density (PD) at submillimetric resolution and in a clinically acceptable scan time . In this context, a wider range of sequence parameters can be used, at the price of rational approximations of the signal (Helms et al., 2008b). Images acquired with a SPGR sequence are parameterised by three parameters (PD, R 1 , R * 2 ) plus an eventual fourth one if MT weighting is used. The ESTAT-ICS forward model reparameterises the SPGR signal equation using one intercept per contrast (i.e., weighted volumes extrapolated to the theoretical zero echo time) and a shared R * 2 decay Mohammadi et al., 2017). While ESTATICS provides a modelbased estimation of R * 2 , the intercepts remain weighted images and another model must be inverted to recover the quantitative parameter maps: R 1 , PD and MT sat 1 . While this inversion has an analytical solution when all contrasts have the same repetition time (Wang et al., 1987;Liberman et al., 2014;Mohammadi et al., 2017), rational approximations must be used as soon as different repetition times are used (Helms et al., 2008a).

Non-linear fitting in MR relaxometry
Many quantitative mapping methods discard the probabilistic nature of the acquired signal, which randomly de-viates from its theoretical value due to thermal and physiological noise. This noise propagates through every computational step and ends up in the maps with a non-trivial distribution (Polders et al., 2012;Cercignani and Alexander, 2006) and can even induce bias in the estimators (Sijbers et al., 1999;Chang et al., 2008). An alternative is to estimate well defined probabilistic quantities conditioned on the observed data, such as posterior distributions or maximum-likelihood estimators (Hurley et al., 2012; Tisdall and van der Kouwe, 2016) -although they can still be biased (Tisdall, 2017). The use of qMRI estimators that directly optimise a probability distribution have remained elusive, in part due to the non-linear and non-convex nature of the corresponding log-likelihood. Still, several groups have proposed to fit quantitative maps using iterative methods, either from magnitude images (Hurley et al., 2012;Tisdall, 2017) or directly from k-space measurements (Chen et al., 1998;Sumpf et al., 2011Sumpf et al., , 2014Zhao et al., 2014). In order to solve the non-linear system, Tisdall (2017) used Brent's method, which does not require derivatives; Chen et al. (1998) used a Levenberg-Marquardt algorithm, Sumpf et al. (2011Sumpf et al. ( , 2014) used a conjugate gradient algorithm, while Zhao et al. (2014) used a quasi-Newton Broyden-Flecher-Goldfarb-Shannon (BFGS) optimiser, which is not ensured to converge in general (Mascarenhas, 2014) and is highly sensitive to initialisation. Scholand et al. (2020) pushed this idea further by modelling the entire Bloch equations and spatial encoding. In their work, the problem is solved using Gauss-Newton optimisation (which corresponds to Fisher's scoring of the Hessian). However, for this much more complex inverse-problem to be well-posed, spiral trajectories are needed so that each shot acquires as many spatial frequencies as possible. In multi-exponential T 2 mapping, the most common procedure involves non-negative leastsquares fits (Whittall and MacKay, 1989), which can be very sensitive to noise and optimisation parameters (Wiggermann et al., 2020). The use of non-linear least-squares algorithms is also common and revolves around (constrained) Gauss-Newton or Levenberg-Marquardt (Milford et al., 2015;Gong et al., 2020).

Denoising
The push to higher resolutions and the use of parallel imaging reduces the signal-to-noise ratio (SNR) of the acquired images and consequently the precision of the com-puted parameter maps. Papp et al. (2016) found a scanrescan root mean squared error of about 7.5% for R 1 at 1mm, in the absence of inter-scan movement. Smoothing can be used to improve SNR, but at the cost of lower spatial specificity. In this context, providing mapping methods that are robust to high levels of noise, both in terms of bias and variance, is of major importance.
Denoising methods aim to separate signal from noise. They take advantage of the fact that signal and noise have intrinsically different spatial profiles: thermal noise is spatially independent and often has a characteristic distribution while the signal is highly structured. Denoising methods originate from partial differential equations, adaptive filtering, variational optimisation or Markov random fields, and many connections exist between them. Two main families emerge: 1. Inversion of a generative model 2 : where X is the observed data, Y is the unknown noise-free data, F is an arbitrary forward transformation (e.g., spatial transformation, downsampling, smoothing) mapping from the reconstructed to the observed data and G is a linear transformation (e.g., spatial gradients, Fourier transform, wavelet transform) that extracts features of interest from the reconstruction. E l and E p are two log-probability distributions that correspond to the likelihood of the parameters and their prior. 2. Application of an adaptive nonlocal filter: where the reconstruction of a given voxel i is a weighted average over all observed voxels j in a given (possibly infinite) neighbourhood N i , with weights reflecting similarity between patches P centred about these voxels. Convolutional neural networks naturally extend this family by stacking many such filters and making their weights learnable rather than fixed.
For the first family of methods, it was found that the denoising effect is stronger when E p is an absolute norm (or sum of), rather than a squared norm, because the solution is implicitly sparse in the feature domain (Bach, 2011). This family of methods includes total variation (TV) regularisation (Rudin et al., 1992) and wavelet softthresholding (Donoho, 1995). The second family also leverages sparsity in the form of redundancy in the spatial domain; that is, the dictionary of patches necessary to reconstruct the noise-free images is smaller than the actual number of patches in the image. Several such methods have been developed specifically for MRI, with the aim of finding an optimal, voxel-wise weighting based on the noise distribution (Coupe et al., 2008;Manjón et al., 2010;Coupé et al., 2012;Manjón et al., 2012).
Optimisation methods can naturally be interpreted as a maximum a posteriori (MAP) solution in a generative model, which eases their interpretation and extension. This feature is especially important in qMRI, where we possess a well-defined (nonlinear) forward function and log-likelihood, and wish to regularise a small number of maps. Joint total variation (JTV) (Sapiro and Ringach, 1996) is an extension of TV to multi-contrast images, where the absolute norm is defined across contrasts, introducing an implicit correlation between them. TV and JTV have been used before in MR reconstruction, e.g., in compressed-sensing (Huang et al., 2012), quantitative susceptibility mapping (Liu et al., 2011) and superresolution . Recently, Varadarajan et al. (2020) applied JTV to multi-echo SPGR data for B 0 distortion correction, with the joint prior defined across echoes. However, these problems are linear and (therefore) convex, while the MR signal equations are not. This non-convexity has limited the use of iterative methods in this context, as algorithms either have a slow convergence -in the case of first order methods -or are not ensured to converge -in the case of second order methods.

Non-linear MPM fitting
In Balbastre et al. (2020), we introduced a non-linear, spatially regularised version of ESTATICS that was fit using Gauss-Newton optimisation. However, since regularisation did not apply directly to the parameter maps but to these weighted intercepts, the denoising of the quantitative parameters was merely a by-product of the denoising of the intercepts. In this work, we instead pro-pose a generic framework that directly finds MAP quantitative (log) parameters from all SPGR data at hand. To this end, we extend our optimisation framework to work directly with the SPGR signal equation, and jointly regularise the log-parameter maps. Working with logparameters log PD, log R 1 , log R * 2 , logit MT sat ) is advantageous as their spatial gradients are independent from the choice of unit and representation (time, in s or ms, or rate, in s −1 or ms −1 ), although it introduces an additional degree of nonlinearity. Another advantage is that posterior distributions are often better approximated by Laplace's method when a log basis is used (MacKay, 1998). We name this extended model SPGR+JTV when it uses regularisation and SPGR-ML when it does not. In Gauss-Newton, positive-definiteness of the Hessian is enforced by the use of Fisher's scoring, which discards terms that involve second derivatives of the SPGR signal function. In this new work, we found that a more stable algorithm can be derived by keeping the (absolute) diagonal of this second-order term. We illustrate the improved convergence on a toy problem, and experimentally show that it generalises to least-squares ESTATICS and SPGR. Additionally, our implementation uses a quadratic upper bound of the JTV functional, allowing the surrogate problem to be efficiently solved using second-order optimisation.
We validated our method on a unique dataset -five repeats of the MPM protocol acquired, within a single session, on a healthy subject -by leave-one-out crossvalidation: maps were estimated on one repeat and used to predict the weighted images from the other repeats. Our methods (SPGR+JTV and SPGR-ML) were compared to algorithms that rely on different flavours of the ESTATICS model: LOGLIN uses a log-linear fit to recover the intercepts and decay , whereas NON-LIN, and NONLIN+JTV use a (regularised) non-linear fit . We also applied ESTATICS to echoes that had been previously denoised using the adaptive optimized nonlocal means (AONLM) method (Manjón et al., 2010), as a baseline for the denoising component of our method. Quantitative maps were then computed analytically from the intercepts obtained from all ESTATICS methods. SPGR+JTV consistently outperformed all baselines. Besides, we probed the effect of regularisation by reconstructing maps with a varying number of echoes, and explored the use of our generative model to estimate uncertainty about the inferred quantitative val-ues.

Theory
2.1. Forward model: Spoiled Gradient Echo signal equation The steady state signal intensity of a spoiled gradient echo acquired with flip angle α, repetition time TR and echo time TE is 3 where A is proportional to the proton density (with a multiplicative factor related to the sensitivity of the receiver coil).
Multi-echo techniques acquire multiple echoes, allowing the R * 2 decay rate to be mapped, while variable flipangle (VFA) techniques acquire images with multiple flipangles to map the R 1 relaxation rate. Multi-parameter mapping protocols combine these two techniques in a single imaging session.
In addition, a similar sequence can be used to measure the magnetisation transfer saturation: in a two-pool model, protons are separated between the free poolprotons in free water that can diffuse freely, which cause the MR signal -and bound pool -non-acqueous protons, e.g., in proteins and lipids of the myelin sheath encasing axons. Following selective saturation of the bound pool via an off-resonance pulse, the process of magnetisation transfer leads to partial saturation of the observable free pool. By acquiring images with and without this off-resonance pulse, the proportion of saturated protons can be measured, giving a surrogate measure of the bound pool. The resulting signal (which we write at TE=0 for brevity) can be modelled by adding an extra excitation pulse, with repetition time TR 2 and flip angle α 2 (Helms et al., 2008b): 3 A spoiled gradient echo acquisition is obtained by playing an excitation pulse that causes the bulk magnetization to rotate by an angle α (the flip angle), followed by a series of dephasing and rephasing gradients that create an echo train.
Here, α 2 is unknown (it depends on the degree of magnetisation transfer saturation) and, following Helms et al. (2008b), we name δ = 1 − cos α 2 the MT saturation. Furthermore, we assume that TR 2 = 0, and the signal equation simplifies to: . (3) In practice, the MR environment is not perfect and multiple phenomena cause deviations from this theoretical model and introduce artefacts: • The excitation field, B + 1 , is not perfectly homogeneous, causing the effective flip-angle to be locationdependent and to deviate from its nominal value. Multiple techniques allow this field to be measured (Jiru and Klose, 2006;Yarnykh, 2007;Sacolick et al., 2010).
• The receive field, B − 1 , suffers from the same issue, especially when high-density phased-array coils are used. This is particularly problematic when between-scan motion happens, as the receive field modulation then has a different value in the different contrasts (Papp et al., 2016).
• The main magnetic field, B 0 , is not perfectly flat either, due to changes in magnetic susceptibility at air-tissue interfaces and imperfect shimming. As odd and even echoes are acquired with a readout of inverse polarity, between-echo distorsion is possible in regions where field inhomogeneity is high (Varadarajan et al., 2020).
• Even with the use of gradient and RF spoiling, the assumption that no transverse magnetisation persists across TRs is rarely met and can lead to additional dependencies, on both tissue properties and hardware, in the R 1 estimates (Preibisch and Deichmann, 2009;Corbin and Callaghan, 2021).

Noise model: Gaussian
The measured signal is hampered by thermal noise, which is uniform and Gaussian in each complex coil image and becomes Rician or non-central Chi distributed in the sum-of-square magnitude image. In the high SNR regime, this noise is approximately Gaussian, and we make that approximation in this work. This means that the fitting procedure -which consists of inferring the unknown parameters from noisy measurements -is effectively a non-linear least squares problem.

Optimisation: non-linear least squares
A non-linear least square problem is one of the form: where all f i are non-linear functions. In general, there is no closed-form solution (when there is a solution at all), and the problem can be non-convex. Therefore, carefully crafted iterative methods must be used. A method of choice is Gauss-Newton optimisation. Gauss-Newton emerges from Newton's method, that is, from a second order Taylor expansion of the objective function. Let us write the gradient of function f in y as g f (y) and its Hessian as H f (y). The Taylor expansion of equation (4) about y 0 gives: Under the assumption that H (y 0 ) is positive-definite, Newton's method solves the surrogate quadratic problem by finding its stationary point: However, as previously stated, the problem is not necessarily convex and therefore the Hessian not necessarily positive-definite. Gauss-Newton circumvents this issue using Fisher's scoring, i.e., by assuming -in the Hessian -that residuals are all zero: We denote this alternative Hessian P for preconditioner. The resulting quantity is (under mild conditions) positivedefinite, and the surrogate quadratic problem now has a minimum. When it converges, the Gauss-Newton iteration converges to a stationary point of the objective function. However, convergence is not guaranteed (Mascarenhas, 2014). In the following subsections, we introduce a sufficient condition for monotonic convergence in Newton-like methods and show simple examples related to MR relaxometry where Gauss-Newton does not converge monotonically or even diverges. We then introduce a new preconditioner that is shown (experimentally) to enforce the condition for monotonic convergence. This preconditioner will be used to optimise the non-linear SPGR least square problem.
A sufficient condition for monotonic convergence Let us focus on the simpler 1D case. Let : R → R be a twice-differentiable non-linear function over the real line with a single local (and therefore global) minimum y * and no local maximum 4 . This function does not need to be convex but, by construction, its first derivative g : R → R is negative everywhere on the left of its optimum and positive everywhere on its right. Let h : R → R + * be a strictly positive function that we will use as a preconditioner (e.g., the absolute value of the curvature). Applying a Newton-like step at point y n gives an update of the form y n+1 = y n − g(y n )/h(y n ). Since h is positive, the step is ensured to be in the direction of the optimum; a sufficient condition for this step to improve the objective function ( (y n+1 ) < (y n )) is for the updated point to fall between y n and the optimum y * . Finally, this condition can be formalised as |g(y n )| h(y n ) |y n − y * | .
While this condition seems difficult to enforce or check, note that it is always true at the optimum, where g(y * ) = 0. Let us write r(y) = g(y)/h(y) and add the condition that h is differentiable; then condition (10) is enforced if the (absolute value of the) derivative of r is everywhere equal or below 1. On the right (positive) side, r (y) 1 on [y * , y n ] implies |r(y n )| = r(y n ) = y n y * r (y) dy y n y * 1 dy = y n −y * = |y n −y * | .
The same thing can be shown on the left (negative) side.
In the multivariate case, we want the update to stay in the same "quadrant" (that is, all coordinates fall between their original value and the optimum), which we formalise as: where | · | is the element-wise absolute value. However, we can apply any arbitrary rotation to the problem and this condition stays true. We can therefore relax it to: where · is the Euclidean norm.
Toy problem 1: exponential least square Let us now discuss two toy problems that are closely related to MR relaxometry. The first one consists of finding the least square solution when the non-linear function is an exponential and there is a single data point: The corresponding energy landscape, shown as a solid black line in Fig. 1, is non-convex on its left. When x > 0, there is an obvious analytical solution (y * = − log(x)), but we will use an iterative method to get to the same conclusion. Using the same notations as before, we have: If we compute the derivative of r (y) = g (y)/p (y), we find whose absolute value is not bounded by 1. We therefore introduce an alternative pre-conditioner that uses the absolute value of the second term rather than discarding it: Although we do not prove it, we find experimentally that this preconditioner enforces condition (10). The corresponding quadratic approximations are plotted in red (Gauss-Newton) and blue (proposed preconditioner) in Figure 1: Qualitative analysis of two toy problems related to MR relaxometry. The first toy problem (first row) has an objective function of the form x − exp(−y) 2 , while the other toy problem (second row) as one of the form x − exp(− exp(y)) 2 . In each case, the first column shows the objective function in black along with quadratic approximations corresponding to Gauss-Newton (red) and the proposed second order method (blue). The second column shows the magnitude of the gradient-hessian ratio. The black line corresponds to a purely quadratic objective function, whose optimum is found in one step (|g(x)|/h(x) = |x − x * |). When the ratio is under this black line, monotonic convergence is ensured. The third column shows convergence of the Gauss-Newton (red) and proposed (blue) iterative schemes when attacking from either a convex or non-convex lobe, with the optimum shown in dashed black. In each case, convergence is shown both in terms of objective value per iteration and parameter value per iteration. Fig. 1. Experimental convergence with both preconditioners is also shown. Conversely to Gauss-Newton, the proposed version is monotonic when attacking from the non-convex side and converges almost as fast as Gauss-Newton when attacking from the convex side.
Toy problem 2: nested exponential least square Our second toy problem involves nested exponentials: The corresponding energy landscape is shown in Fig. 1: its curvature now tends towards zero on both sides of the real line. Its gradient, Hessian and preconditioners are: h (y 0 ) = e y 0 −e y 0 2 + (e 2y 0 − e y 0 ) e −e y 0 (e y 0 − x) , p (y 0 ) = e y 0 −e y 0 2 , p (y 0 ) = e y 0 −e y 0 2 + (e 2y 0 − e y 0 ) e −e y 0 (e y 0 − x) . (23) Again, r is not bounded, and the Gauss-Newton preconditioner fails to enforce the convergence condition. Experimental convergence is shown in Fig. 1, where the proposed method is shown to monotonically converge while Gauss-Newton can diverge when starting in a non-convex section of the energy landscape.
Generalisation to the multivariate case The toy problems investigated in the previous section are highly relevant to MR relaxometry: the first case is related to the exponential fit of the signal decay, where each data point is seen as the exponential of a linear combination of parameters, while the second case is an extension of the first, where positive parameters are encoded by their log. Given the apparent robustness of the proposed preconditioner on these toy problems, we propose two extensions to the general multivariate case by replacing the absolute value of the scalar Hessian in the one-dimensional case with either: (1) the absolute value of the diagonal of the Hessian, H f i (y 0 ) I; (2) the sum of the absolute value of the Hessian across columns, diag H f i (y 0 ) 1 . Only the second one is ensured to majorise the true Hessian (Chun and Fessler, 2018), but we show in our experiments that the first one yields a stable convergence in the MPM context. We therefore use the preconditioner: The impact of this loading is illustrated with a pair of two-dimensional toy problems that are closely related to the one-dimensional problems presented before. The first problem involves an objective function of the form (y, z) = 1 2 x 0 − e z−t 0 y 2 + 1 2 x 1 − e z−t 1 y 2 . In the second problem, the decay parameter is encoded by its log: (y, z) = 1 2 x 0 − e z−t 0 e y 2 + 1 2 x 1 − e z−t 1 e y 2 .
Step sizes and optimisation trajectories are depicted in Fig. 2. When the Gauss-Newton preconditioner is used, a vast part of the parameter space has very large step sizes that lead to overshooting. When the proposed preconditioners are used, the entire parameter space (or most of it) lies below the monotonic convergence threshold.

Likelihood: spoiled gradient echo
The main contribution of this paper lies in the direct optimisation of A, R 1 R * 2 and δ. First, each parameter is encoded by its log (or logit in the case of δ). Let us switch from a physics to a mathematical notation, where scalars and vectors are written in lowercase while capital letters are reserved for matrices and tensors. We use a tilde to differentiate the log-parameters from their exponentiated Figure 2: Qualitative analysis of two-dimensional toy problems related to MR relaxometry. The first toy problem (top half) has an objective function of the form (y, z) = x 0 − e z−t 0 y 2 + x 1 − e z−t 1 y 2 , while the other toy problem (bottom half) has one of the form (y, z) = x 0 − e z−t 0 e y 2 + x 1 − e z−t 1 e y 2 . In each case, the first column shows the objective function along each of the two dimensions (in black) along with quadratic approximations corresponding to Gauss-Newton (red) and the proposed second order methods (blue and green). The second column shows the normalised step size for each method, color-coded such that values below one are blue and values above one are red. The optimum is marked by a red star. Convergence paths from two random initial coordinates (solid and dashed curves) are overlaid. In the second toy problem, two different loading matrices are used: Proposed uses the diagonal of the absolute Hessian diag (|H|) whereas Proposed+ uses its sum across columns |H| 1. Figure 3: Negative log-likelihood and step size function of the nonlinear SPGR model. The first column shows the log-likelihood of a single observation with respect to each log-parameter, while keeping the others fixed. The second column shows the step size function (red) obtained with the proposed approximate Hessian, compared with the theoretical perfect step size function f (x) = |x − x * |. As long as the actual step size is below this theoretical function, monotonic convergence is ensured.
version, so that in each voxel: a = exp (ã) , r 1 = exp (r 1 ) , Let us assume that all images are defined on the same grid with N voxels. We write the set of log-parameters as Y = ã,r 1 ,r 2 ,δ ∈ R N×4 and the set of observed 3D images as X ∈ R N×I . Observed images come with a set of fixed parameters Θ ∈ R 4×I , where each quadruplet θ i = α i , t ri , t ei , σ 2 i ∈ R 4 contains the flip angle, repetition time, echo time and noise variance of an image. The likelihood term of the objective function involves a sum across images followed by a sum across voxels: where s is the intensity of a spoiled gradient echo 5 : (26) For contrasts that do not include an MT pulse, δ = 0 and gradients with respect toδ are zero. We recognise a nonlinear least squares problem, and propose to use the preconditioner from equation (24), which depends on the gradient and Hessian of s. Let us focus on a single voxel and drop the (i, n) subscripts for clarity. After differentiation, the gradient is: 5 When the receive and transmit fields are known from e.g. field mapping, equation (26) is slighlty changed. The flip angle α takes a voxeldependent value α n = α nom b + n , where the nominal flip angle is modulated by the transmit field efficiency; and the whole signal is modulated voxel-wise by the net receive field b − n . and the Hessian is: The gradient g and preconditionerP of the least-square objective function are then computed as in equations (6) and (24) and summed across echo times and contrasts. The log-likelihood of one observation with respect to each parameter -keeping the others fixed -is shown in Fig. 3, along with the step size function, which is consistently under the monotonic convergence bound. Although this figure only shows the step size function for one given set of parameters, the convergence property is consistently enforced when sampling random parameters and observed values.

Spatial projection
In practice, it is common for people to move their head between scans, making inter-scan registration a mandatory initial step. While reslicing all scans in the same space is a possibility, we prefer to keep the raw data untouched and integrate the spatial projection from the reconstruction space to the different acquisition spaces in the forward model, modifying slightly the objective function. Let us write all log parameter maps as y = vec(Y) ∈ R NK , where N is the number of voxels in reconstruction space and K the number of maps, and Ψ ∈ R MK×NK a matrix that encodes linear reslicing from reconstruction space to one acquisition space 6 . The new objective function is ψ (y) = (Ψy). Application of the chain rule 6 Since the same transformation is applied to all maps, Ψ really is the Kronecker product of an M × N reslicing matrix and a K × K identity matrix. yields: However, this new Hessian is much less sparse: it cannot be seen as block diagonal anymore. It was shown in Ashburner et al. (2018) that this matrix can be majorised (in the Löwner order sense) by a block-diagonal matrix: In other words, for each contrast, the log parameter maps are resliced (pulled) to the acquisition space, where the gradient and Hessian of the least square function are computed and sent back to recon space using the adjoint of the reslicing operation (pushed).

Prior: joint total variation
Multiple types of spatial regularisation exist, with different properties. Here, we focus on joint total-variation, equivalent to an 1,2 norm over spatial gradients, which benefits from the same sparsity-inducing properties as other 1 norms, except that sparsification is shared across contrasts. This property is particularly interesting when the same object is seen through different contrasts, as is the case in the context of MPMs, since edges are shared across parameter maps. Again, let Y = ã,r 1 ,r 2 ,δ ∈ R N×4 be the set of all parameter maps, the regulariser has the form: where G n extracts all 6 forward and backward finitedifferences (in 3D) about the n-th voxel and λ k is a parameter-specific regularisation factor. This regulariser is convex but non-smooth, complicating its use in optimisation. However, several methods have been developed over the years to solve this problem; e.g., ISTA (Daubechies et al., 2004), FISTA (Beck and Teboulle, 2009), ADMM (Boyd et al., 2011) and variations thereof. In this paper, we use an iteratively reweighted least squares (IRLS) scheme (Daubechies et al., 2010;Bach, 2011), which relies on the bound: When the weight map w is fixed, this bound can be seen as a Tikhonov (i.e., 2 ) prior with nonstationary regularisation, which is a quadratic prior that factorises across parameters. Therefore, the between-parameters correlations induced by the JTV prior are entirely captured by the weights. Conversely, when the parameter maps are fixed, the weights can be updated in closed-form: Let us now vectorise all log-parameter maps: y = vec (Y). The quadratic term in equation (39) -when summed over all voxels -can be written as y T Ly with L = G T WG ⊗ diag (λ), where G ∈ R 6N×N extracts all 6 forward and backward finite-differences about all voxels and W = diag (w) ⊗ I 6 . Note further that G T G is Toeplitz and can be implemented as a convolution by a small kernel (Ashburner, 2007). Similarly, diag G T WG can be obtained by convolving the weight map w with a small "cross" kernel.

Optimisation: regularised spoiled gradient echo
Optimisation follows an IRLS scheme, where each iteration consists of alternately solving for the parameter maps y and the JTV weights w. At each iteration, the objective function in y is of the form: where c and e are contrasts and echoes. This function is non-linear and solved using the proposed second order method, from which we get a gradient g ∈ R 4N and an approximate Hessian H ∈ R 4N×4N . This matrix is actually block diagonal as it consists of N small 4 × 4 matrices. Performing the Newton update step involves solving the linear system which is done using the preconditioned conjugate gradient method. The preconditioner that we use is

Implementation
In practice, 10 IRLS iterations were used, with 5 inner Newton iterations and 32 conjugate gradient iterations per Newton step. A tolerance threshold on the gain between two subsequent iterations was set for early stopping (IRLS: 10 −5 , Newton: 10 −5 , CG: 10 −3 ). The variance of the thermal noise was estimated by fitting a two-class Rician mixture to each acquired volume (Ashburner and Ridgway, 2013); the class with lowest mean was considered as the background. Variances were then averaged across echo times using the geometric mean. Initial (constant) parameter maps were estimated using the mean of the other (non-background) class in each volume.
This algorithm, along with baseline methods, is implemented in NiTorch, a Python library with a PyTorch backend dedicated to neuroimaging. It is publicly available at https://github.com/balbasty/nitorch. The processing of a 0.8 mm MPM dataset -from which A, R 1 , R * 2 and MT sat can be estimated -fits in 12GB of memory and takes about 10 minutes on a Nvidia Titan V GPU for SPGR+JTV (5 minutes for NONLIN+JTV, < 1 minute for LOGLIN).

Convergence analysis
This experiment was based on simulated values. One thousand single-voxel datasets were simulated, each with their own relaxation parameters, TRs, TEs and flip angles. Each voxel had three contrasts (i.e., three different TRs and flip angles) -including one with an MT pulse -and five echo times per contrast. All log parameters (log A, log R 1 , log R * 2 , logit MT sat , log TR, log TE) were uniformly sampled in [−5, 5], while flip angles were uniformly sampled in [0, π 4 ] and the noise variance was fixed at σ 2 = 1. The (unregularised) SPGR model was then fitted using 10,000 iterations of the proposed second-order optimisation, and the negative log-likelihood was computed after each iteration. Two baseline algorithms were used as well: Gauss-Newton with a backtracking linesearch (at each step, the descent direction was modulated by an Armijo factor) and Levenberg-Marquardt (LM). Following Press et al. (2007), updates were only accepted if they improved the objective-function. On failure, the GN Armijo factor (respectively the LM damping parameter) was divided (resp. multiplied) by 10, while it was multiplied (resp. divided) by 10 on success. One Armijo or damping factor was defined in each virtual voxel.
A similar experiment was conducted in order to compare the use of a log, rate or time representation during optimisation. Two alternative optimisation schemes were derived such that either (A, R 1 , R * 2 , MT sat ) or (A, T 1 , T * 2 , MT sat ) are optimised instead of (log A, log R 1 , log R * 2 , logit MT sat ). These two representations required an even more stable Hessian, were |H| 1 is used to load the diagonal instead of diag (|H|).

Datasets
Two dataset were used in this study: 1. The first dataset used in this study is a single-subject, single-session reproducibility dataset: a single participant was scanned five times in a single session with a 0.8 mm MPM protocol. This protocol consists of three multi-echo gradient echo contrasts (flip angle: 21/6/6 deg, number of echoes: 8/8/6 with 2.3 ms echo spacing, off-resonance pulse: No/No/Yes, TR: 25 ms, resolution: 0.8 mm, FOV: 256 × 224 × 179 mm 3 , GRAPPA acceleration: 2 × 2 with 40 reference lines). Calibration data were acquired to additionally account for contrast-specific receive field (Papp et al., 2016) and participant-specific transmit field inohomogeneities (Lutti et al., 2010) using the protocols described in Callaghan et al. (2019). Signal-tonoise ratios were estimated by fitting a two-class Rician mixture model to the image histograms. SNRs in the first echo were 29 (PDw), 25 (T1w) and 28 (MTw). SNRs in the last echo were 16 (PDw), 15 (T1w) and 16 (MTw). 2. The second dataset is the demonstration dataset from the hMRI toolbox , which consists of the same 0.8 mm MPM protocol acquired in a single subject, with motion between the MTweighted scan and the PD-and T1-weighted scans.

Algorithms
We compared two variants of our method (ML and MAP) to methods based on a two-step process described in following Tabelow et al. (2019) (ESTATICS to estimate R * 2 and T E = 0 intercepts, followed by an analytical fit of R 1 , A and MT sat from these intercepts). These baseline methods use either a log-linear fit or a non-linear fit of ESTATICS, and use different denoising methods (preprocessing with a non-local means filter or MAP with a JTV prior): • LOGLIN performs a log-linear fit of R * 2 from which weighted images extrapolated to TE=0 are obtained. This is equivalent to ESTATICS , except that the spatial projection is integrated in the model. The Hessian matrix that is used is a (slight) majoriser of the true Hessian, so three Newton-like iterations are needed to reach the optimum (compared to a single one, had the true Hessian been used). Since all TRs are identical, A, R 1 and MT sat are then computed analytically according to Mohammadi et al. (2017), without requiring rational approximations.
• LOGLIN+AONLM applies an anisotropic non-local mean filter (Manjón et al., 2010) to all individual echoes before processing them with LOGLIN.
• NONLIN performs a non-linear fit or R * 2 instead of a log-linear fit, but does not use any regularisation. This method was not used in Balbastre et al. (2020), because Gauss-Newton was not stable enough without regularisation. The new Hessian used in this work makes its optimisation possible.
• SPGR-ML performs a non-linear maximumlikelihood fit of the log-parameters log R 1 , log R * 2 , log A and logit MT sat .
• SPGR+JTV adds JTV regularisation to SPGR-ML, and effectively performs a non-linear maximum a posteriori fit.
For each dataset, as a preprocessing step, all contrasts across all repeats (and their B +/−

Cross-validation
This experiment was based on the single-session dataset. First, one of the repeats was used to estimate optimal regularisation factors for LOGLIN+AONLM, NON-LIN+JTV, and SPGR+JTV by cross-validation across echoes. Five folds were constructed, with the first six echo times randomly permuted and split between a training set -used to estimate parameter maps -and a testing set -used to evaluate predictions. In each fold, the training echoes were used to estimate parameters with multiple regularisation values; these parameters were then used to simulate the left-out echoes and compared against the true (acquired) volumes. The mean-squared error (MSE) between the true and predicted echoes was computed and z-normalised across folds, contrast and echo times. Finally, the median z-normalised MSE was used as a metric to select the most adequate regularisation factor.
The remaining four repeats were used to compare all (optimised) methods by cross-validation across repeats. Parameter maps were computed for each repeat and used to predict individual echoes from all other repeats.

Echo decimation
This experiment was based on the reproducibility dataset. First, reference parameter maps were obtained by computing the maximum-likelihood solution (SPGR-ML) of combined data from all four test runs. Parameter maps of one of the runs were then reconstructed by either SPGR+JTV or SPGR-ML using the first 2, 4 or 6 echoes. The root mean squared error (RMSE) -within a brain ROI -was computed between each map and its reference.

Uncertainty mapping
The inverse of the Hessian of the objective function at the optimum can be used, under the Laplace approximation, to estimate the posterior uncertainty of the logparameters (MacKay, 2003). Since it is easier to invert, we rather use the inverse of the block-diagonal preconditioner and extract its diagonal: where X contains all observed volumes used to estimate the maps. Note that under the Laplace approximation, the posterior of the log-parameter maps is Normal-distributed, making the parameters themselves Log-Normal-distributed. If the parameter maps themselves are of interest, it might be more interesting to return their expected value and variance, which in the case of R 1 gives: V [r 1 ] = exp σ 2 − 1 exp σ 2 exp (r 1 ) 2 .
In this experiment, we used three echoes from the hMRI demonstration dataset to reconstruct maps using SPGR+JTV and SPGR-ML and computed the posterior uncertainty according to the above formula. This allowed us to map the posterior mean and standard deviation of log-R 1 , R 1 and T 1 . The ratio between the expected value of R 1 under the Laplace approximation and under a point estimate is given by exp σ 2 .

Convergence
The negative log-likelihood and gain after each fitting iteration in 50 randomly selected voxels are shown in Fig. 4. These plots show that all voxels converge monotonically with our approach (the gain is never negative) despite the very wide range of parameters (e.g., SNR, observed relaxation times) covered by the simulations. Conversely, Gauss-Newton and Levenberg-Marquardt fail to converge to the optimum in a large number of voxels, despite the fact that this simulation context is very favourable to them: they are allowed to define one regularisation parameter per voxel, whereas in a spatially regularised case, voxels would not be independent and would have to share the same parameter.
Note that under this setup (3 contrasts, 5 echoes per contrasts), the vast majority of these maximum-likelihood fits yield optimal negative log-likelihood that are lower than the expected negative log-likelihood of the true parameters. This shows that ML likely over-fits the data and is strong evidence that regularisation is needed. Figure 4: Optimisation of 1,000 simulated voxels for 10,000 iterations using the proposed second order method (left), Gauss-Newton (middle) and Levenberg-Marquardt (right). The top graph shows the negative loglikelihood per iteration for 50 random samples, while the middle graph shows the normalised gain per iteration for these same 50 samples. Gain is only shown for the proposed method. Note that if an iteration was non-monotonic, its gain would be negative, which never happens in our simulations. The bottom graph shows the distribution of the final loglikelihood of all 1,000 samples. The dashed line represents the expected log-likelihood of the true parameters. Log-likelihood and iterations are shown in log scale. Fig. 5 shows results for the same experiment, this time comparing different optimisation basis (log, rate or time). While all methods converge to the same optimum, the rate of convergence is clearly higher when log-parameters are optimised.

Methods cross-validation
Quantitative maps obtained with all methods are displayed in Fig. 6, along with MSE. As expected, directly fitting the parameters is beneficial over the use of sequential parameter estimation. SPGR+JTV (median MSE = 1459) outperform non-regularised approaches such as SPGR-ML (median MSE = 1706). Methods that make use of rational approximations all trail behind, with the regularised NONLIN+JTV (median 1716) again improving over the non-regularised NONLIN (median 2467). The log-linear version trails far behind (median MSE = 2873), with AONLM preprocessing only slighlty improving it (median MSE = 2799). Fig. 7 displays all four parameter maps (R 1 , R * 2 , PD, MT sat ) reconstructed using a growing number of echoes (2, 4, 6) with a regularised algorithm (SPGR+JTV) and a non-regularised one (SPGR-ML). The RMSE between all maps and a reference map (the combined ML solution of all 4 runs) was computed and the regularised maps show a much lower RMSE than the non-regularised ones, even when fewer echoes are used. Noise amplification in the center of the brain appears clearly in the SPGR-ML maps when only two echoes are used, whereas SPGR+JTV manages to preserve anatomical details in this region, despite some additional smoothing. Note that in our implementation, the noise variance and regularisation are two different parameters, even though they share a single degree of freedom in the optimisation. The noise variance is estimated on a volume-by-volume basis while the regularisation factor is kept a priori fixed. Consequently, the balance between the likelihood and the prior appears clearly in the maps displayed here, as more smoothing happens when less information (e.g., fewer echoes) is present in the data.

Uncertainty mapping
The capacity to estimate uncertainty is exemplified via R 1 : posterior expected R 1 and T 1 maps and their uncertainties are shown in Fig. 8. Uncertainty after a SPGR-ML fit is mostly driven by intensities: white matter has a higher uncertainty because its MR signal, and therefore SNR, is generally lower than the grey matter. Conversely, uncertainty after a SPGR+JTV fit is mostly driven by edges, since less local signal averaging is possible in their vicinity. Figure 6: Top: R 1 , R * 2 and MT sat maps obtained with all methods. Bottom: z-normalised mean squared error between (unseen) predicted and acquired gradient-echo images. MSE was computed across all voxels in a single volume. Lower is better. Figure 7: Decimation of the number of echoes. Maps were computed using the first 2, 4 or 6 echoes. Graphs on the left display the mean squared error (lower is better) relative to the reference maps (SPGR-ML with four repeats of 8 echoes) as a function of the number of echoes used. The corresponding parameter maps are displayed on the right.

Discussion & Conclusion
In this paper, we introduced an optimisation-based framework for multi-parameter mapping that directly fits the signal equations to all available SPGR data, and can work with (or without) a wide range of regularisers. The key innovation of our method is a second-order optimiser that makes use of a bespoke approximate Hessian that shows excellent experimental convergence. We showed on toy problems that our update steps have theoretical properties that ensure monotonic convergence even when the quadratic approximation is not a majoriser of the objective function. Although we did not offer analytical proof that this monotonic convergence extends to the multivariate case, our solution is at least as good as Gauss-Newton, which is known to converge to the optimum when it converges. Our optimiser therefore benefits from the same property and, experimentally, was never found to diverge.
We also introduced a novel encoding of the relaxation parameters, based on their log, that makes regularisation insensitive to the choice of unit. This encoding has the additional advantage that parameters live on the entire real line, making the optimisation problem unconstrained.
We evaluated the use of a joint total variation regu-lariser, an edge-preserving prior that introduces implicit correlations between channels, increasing its denoising power over parameter-wise total variation. We show that this prior yields the best parameter inference, under the metric that inferred parameters best predict unseen data. Although this paper focuses on relatively uninformative spatial regularisers, our method opens the door to the use of informative learned priors. One possibility, inspired by Brudfors et al. (2019Brudfors et al. ( , 2020, is to include a multivariate Gaussian mixture model of the log-parameter maps, with learned tissue parameters (means, covariances, tissue probability maps). Alternatively, priors based on learned dictionaries of patches could be used (Dalca et al., 2017). Although this was not the topic of this paper, it is important to investigate some properties of the maximumlikelihood estimator. While our second-order scheme may output an approximate posterior distribution under the Laplace approximation, it is centered about the mode of the posterior, which is certainly not aligned with the posterior expected value. Different corrections could be devised depending on the map of interest (e.g., R 1 , T 1 or ± log R 1 ).
The Laplace approximation allowed posterior uncertainty to be estimated. These uncertainty maps are far from perfect: they only contain aleatory uncertainty (i.e., Figure 8: Uncertainty mapping and propagation. Log parameter maps (E log r 1 =r 1 ) and their uncertainty (sd log r 1 ) were estimated using SPGR+JTV, which includes an edge-preserving prior and SPGR-ML, which has no (or infinitely uninformative) prior. The expected value and standard deviation of r 1 and t 1 under a Laplace posterior (E [r 1 ]) or a peaked posterior (exp(r 1 )) are shown. uncertainty that stems from the probabilistic nature of the variables in play, that is known and can be modelled), and are based on the assumption that the B +/− 1 fields and thermal noise variance are known, thereby discarding any uncertainty about their values. Furthermore, it relies on a Gaussian assumption that could be very wrong. Nonetheless, 'all models are wrong but some are useful', and -while flawed -these uncertainty maps are better than blind confidence. They could be used to e.g. down-weight low-quality data points in downstream regression or classification tasks.
Finally, transmit and receive field maps are assumed pre-computed and fixed in this work and spoiling is assumed perfect. However, it has been shown that key sources of variance in T 1 mapping come from imperfect spoiling and errors in the estimate of the B + 1 field (Stikov et al., 2015). In theory, transmit field map esti-mation could be integrated in the optimisation framework as long as the corresponding imaging data is included as well (Hurley et al., 2012). However, B 0 and B + 1 mapping protocols based on phase differences would be difficult to fit using second-order methods because of the circular support of phase values (Fessler and Noll, 2004). On the other hand, imperfect spoiling is challenging to integrate in the generative model because, in its current form, it is based on full numerical Bloch simulations and a posteriori transformation of apparent T 1 maps under a polynomial whose coefficients depend on the B + 1 (Preibisch and Deichmann, 2009). Of course, those post hoc corrections can still be applied to the parameters estimated here.
Some remaining approximations could also be removed: (1) our Gaussian noise model could be replaced by a more accurate non-central Chi or Rice model that can be fitted using a majorisation-minimisation frame-work similar to the one used for optimising the JTV functional in this work (Varadarajan and Haldar, 2015); (2) the true TR of the off-resonance pulse could be used instead of being set to zero (Mohammadi et al., 2017), which may remove some residual bias in the R 1 and MT sat maps. Furthermore, the generative model could be extended to more complex biophysical models, such as multi-compartment models (Deoni et al., 2008a), and to additional sequences, such as spin echo or inversion recovery. Diffusion data could also be integrated and modelled (Jian et al., 2007), as joint diffusion-relaxometry datasets are increasingly being used for accurate microstructure modelling (Hutter et al., 2018). Note that the diffusion signal can also be described using exponential-linear models, which are likely to benefit from our improved Hessian.
On the computational side, this method would benefit from any progress made on TV optimisation. Our implementation uses a relatively simple conjugate-gradient optimiser with Jacobi preconditioning, and it is certain that using preconditioners better tailored to TV would improve its convergence. Additionally, although we experimentally show the stability of our approximate Hessian, we did not provide a proof of convergence in this work. A formal proof that builds on our analysis would surely help move the field forward.
In conclusion, we have presented a flexible probabilistic generative modelling framework, which capitalises on a novel approximate Hessian to enable the robust joint estimation of multiple quantitative MRI parameters. The framework can be adapted to incorporate spatial projections, denoising schemes and additional priors. With this framework we have demonstrated enhanced performance relative to the consecutive estimation techniques commonly used in multi-parameter mapping, with additional benefits such as uncertainty mapping.