Enforcing necessary non-negativity constraints for common diffusion MRI models using sum of squares programming

In this work we investigate the use of sum of squares constraints for various diffusion-weighted MRI models, with a goal of enforcing strict, global non-negativity of the diffusion propagator. We formulate such constraints for the mean apparent propagator model and for spherical deconvolution, guaranteeing strict non-negativity of the corresponding diffusion propagators. For the cumulant expansion similar constraints cannot exist, and we instead derive a set of auxiliary constraints that are necessary but not sufficient to guarantee non-negativity. These constraints can all be verified and enforced at reasonable computational costs using semidefinite programming. By verifying our constraints on standard reconstructions of the different models, we show that currently used weak constraints are largely ineffective at ensuring non-negativity. We further show that if strict non-negativity is not enforced then estimated model parameters may suffer from significant errors, leading to serious inaccuracies in important derived quantities such as the main fiber orientations, mean kurtosis, etc. Finally, our experiments confirm that the observed constraint violations are mostly due to measurement noise, which is difficult to mitigate and suggests that properly constrained optimization should currently be considered the norm in many cases.


Introduction
Diffusion-weighted magnetic resonance imaging (MRI) captures local micro-structural information by observing diffusing (water) molecules probing their surroundings at a microscopic scale. In order to analyze this type of data one can either estimate parameters that describe the diffusion itself, which provides a somewhat abstract but accurate description of the observed stochastic motion, or one can use a model of the ambient structure that re-expresses the observed diffusion in terms of more intuitive structural parameters. Both cases generally rely on optimization to reconstruct the descriptive parameters from diffusion-weighted images, and in this work we introduce a specific set of basic constraints to improve such model reconstructions.
Constrained optimization guarantees that known or assumed bounds on parameters are respected, thereby improving the accuracy of a reconstruction. It also gives a sense of the data quality and the suitability of the model-in an ideal setting constrained and unconstrained results should be identical. In practice, large discrepancies between the two may point to problems such as artifacts or excessive noise in the data, or could indicate that the selected model is not appropriate for the considered data. Finally, constrained optimization can be important in further analysis steps, which may (implicitly) rely on physical or mathematical properties of the model/basis expansion that could be lost in the case of unsatisfied constraints. In diffusion tensor imaging (DTI) (Basser et al., 1994a) for example, a non-positive-definite diffusion tensor can result in unphysical 'negative' apparent diffusion or extreme fractional anisotropy values.
The use of constraints has been prevalent in diffusion MRI since the early days of DTI, where the diffusion tensor can be restricted to be positive-definite through various means (Pennec et al., 2006;Fillard et al., 2005;Lenglet et al., 2006;Koay, 2010;Koay et al., 2006;Wang et al., 2004). Constrained optimization has also been applied in e.g. diffusional kurtosis imaging (DKI) (Jensen et al., 2005;Veraart et al., 2011) and higher order tensor models such as the generalized DTI model proposed by € Ozarslan and Mareci (2003), cf. (Chen et al., 2013;Qi et al., 2010;Ghosh et al., 2014;Barmpoutis and Vemuri, 2010;Barmpoutis et al., 2009Barmpoutis et al., , 2007Barmpoutis et al., , 2012, while common implementations of spherical deconvolution (SD) methods (Tournier et al., 2007) and of the mean apparent propagator (MAP) method ( € Ozarslan et al., 2013b) often rely on soft constraints or regularization (Fick et al., 2016). For these and related models, non-negativity constraints based on square root representations have also been studied extensively (Hanyga and Seredy nska, 2012;Goh et al., 2009;Cheng et al., 2009Cheng et al., , 2014Vemuri et al., 2019;Sun et al., 2013). Practical square root representations impose potentially significant restrictions on the set of permitted parameter values, but they do guarantee non-negativity globally without any compromise on speed. Strict constraints have also been proposed for functions expressed in terms of spherical harmonics by Schwab et al. (2012).
Although constrained reconstruction obviously has many advantages, and its development is being actively pursued by numerous research groups, its use and availability in popular software packages is not as ubiquitous as one might expect. One reason for this is the fact that constrained optimization is (typically significantly) slower than its unconstrained counterpart, which together with the size of modern diffusionweighted data sets leads to restrictive hardware requirements. Constrained optimization algorithms generally also require a nontrivial reformulation of the parameter estimation problem to match a specific form, complicating their implementation, and in some cases efficient algorithms may not be available.
In this paper we will derive a set of basic mathematical constraints for diffusion-weighted MRI (Section 2). These constraints are based on the non-negativity of the so-called ensemble average propagator or associated functions, but reformulated as the (relaxed) condition that these functions can be written as a sum of squared (SOS) polynomials. For many commonly used models and basis expansions these constraints take the form of a positive-definiteness condition on a matrix that is linear in the model parameters, which can thus be enforced efficiently through the use of semidefinite programming. We derive explicit constraints for MAP MRI, spherical deconvolution, and for the cumulant expansion (CE) (Liu et al., 2004;Kiselev, 2010), and show that the application of these constraints should be considered essential in many situations despite the associated computational costs (Sections 3 and 4). Next, we use the presence of unsatisfied constraints as a data quality measure to review the performance of grid-and shell-based acquisition sequences, and to assess the importance of additional corrections in the preprocessing pipeline. To emphasize the generality of the strict constraints we propose here, and to contrast them with existing soft approaches to non-negative propagator constraints, we denote constrained models with a 'þ' suffix (i.e., MAPþ, CSDþ, CEþ). SOS polynomials have been considered previously in the context of diffusion-weighted MRI by several authors, specifically to enforce positive-definiteness of the (generalized) diffusion tensor ( € Ozarslan and Mareci, 2003). The first explicit use of these constraints that we are aware of is the work by Barmpoutis et al. (2009Barmpoutis et al. ( , 2007 on the constrained reconstruction of fourth order diffusion tensors-a special case in which the SOS constraints are exactly equal to the desired non-negativity constraints. In subsequent works the authors enforce non-negativity on diffusion tensors of orders greater than four, by approximating the SOS constraints with a set of linear constraints (Barmpoutis and Vemuri, 2010;Barmpoutis et al., 2012). A formulation for general higher order tensors based on semidefinite programming was later provided by Chen et al. (2013). We are unaware of any works that investigated SOS constraints for the more general models that we discuss in this work. Parts of the current manuscript have been presented in preliminary form as conference and workshop abstracts (Dela Haije and Feragen, 2018;Dela Haije et al., 2015.

Methods
In the following we first review some background on diffusion MRI, Section 2.1. We then describe a set of general constraints and their practical implementation in Section 2.2. In Section 2.3 we discuss their application in three relevant and commonly used models/basis expansions of the diffusion-weighted signal. Sections 2.4 and 2.5 give some details regarding implementation and cover the data used in the experiments presented in Section 3.

Diffusion MRI fundamentals
The normalized diffusion-weighted signal S is generated by the average molecular motion over a time Δ within a single voxel. This voxelaveraged diffusion can be described by a (bounded and Lebesgue integrable) displacement probability density function P, where PðrÞ gives the probability of a displacement r occurring within a voxel over a time Δ. The signal S and the density function P are related by a Fourier transform (Mitra and Halperin, 1995;Novikov and Kiselev, 2010;Callaghan, 1991;Callaghan et al., 1990Callaghan et al., , 1988€ Ozarslan et al., 2009) where Á denotes the standard inner product, r 2 R 3 is a displacement vector, and q 2 R 3 is the gradient wave vector. The gradient wave vector is defined as q : ¼ γδG, combining the gyromagnetic ratio γ, the gradient pulse width δ, and the applied gradient G 2 R 3 . The measured (nonnormalized) signalS is related to S with the normalization factorS 0 :1 Sð0Þ according to S ¼S=S 0 . We take PðrÞ ¼ PðÀrÞ so that the normalized signal in Eq.
(1) is real-valued, where we assume that (active) flow is negligible in our data (Mussel et al., 2017). The probability density function P is also referred to as the average propagator. A typical diffusion-weighted MRI data set consists of a relatively small set of diffusion-weighted measurements acquired with different q and/or different timing parameters. To aid the interpretation and further processing of these measurements we can reconstruct models or representations of the signal S, where the basic fitting procedure is as follows.
Given a set of applied wave vectors fq i g N i¼1 , the corresponding measurements y i (representingSðq i Þ or logSðq i Þ, and a model or basis expansion m ω (ofS or logS) with a set of parameters ω, we estimate the optimal parameters by minimizing the sum of squares of the residuals arg min ω X N i¼1 ðm ω ðq i Þ À y i Þ 2 : (2) Depending on the specific techniques and algorithms used to solve the least squares problem, this formulation can be adapted.

Relaxed non-negativity constraints
When solving the optimization problem (2) for a specific diffusion MRI signal model/expansion, it is often desired that the result is guaranteed to satisfy certain constraints. The fundamental property from which our constraints are derived is non-negativity of the displacement probability density function P, i.e., the requirement that PðrÞ ! 0 for all r 2 R 3 . Unlike e.g. constraints based on the physiologically expected values of a parameter, non-negativity of the propagator is essentially a universally accepted, purely mathematical constraint.
Ideally we would like to directly translate the non-negativity of P to a set of constraints on the signal domain parameters ω of a given expression using e.g. Bochner's theorem (Rudin, 1996), but this only leads to practical constraints in exceptional cases. The best known example is that of the diffusion tensor imaging model, where positive-definiteness of the diffusion tensor implies positivity of P (Wang et al., 2004). The vast majority of models, however, allow no useful direct correspondences between non-negativity of P and the model parameters, and in these cases we must resort to increasingly relaxed modifications of the non-negativity constraint.
In order to obtain a practical (but more restrictive) method to guarantee non-negativity, let us consider the problem of constraining (real) polynomials to be non-negative. In this setting we have exact and veri-fiable conditions for non-negativity of degree 2 polynomials of arbitrary variables, and for arbitrary degree polynomials of a single variable, but outside of these special cases it is generally an NP-hard problem to determine whether a given polynomial is non-negative (Parrilo, 2003). However, non-negativity of a degree 2d polynomial P is trivially guaranteed if P can be written as a sum of squared polynomials, i.e., if there exists a finite set of polynomials g 1 ; g 2 ; … such that P ðxÞ ¼ P k g k ðxÞ 2 , in which case we say that P is SOS. We could thus optimize over the set of SOS polynomials instead of the set of non-negative polynomials, and obtain a reconstruction that satisfies non-negativity. There is of course a cost associated to this relaxation, which comes from the fact that not every non-negative polynomial can be written as a sum of squared polynomials (Hilbert, 1933). But because non-negative polynomials (on a compact domain) can be approximated to an arbitrary degree of accuracy by SOS polynomials (Berg et al., 1976;Lasserre, 2007)-the set of SOS polynomials is dense in the set of non-negative polynomials for the L 1 norm-we consider a restriction to the set of sum of squares polynomials a warranted relaxation. In Fig. 1 we show an example of a sum of squares constrained polynomial fit in one dimension.
The relaxed non-negativity constraints based on SOS polynomials can be enforced in practice as described in the work of Parrilo (2003). Let P m be a degree 2d polynomial associated to m, for example employed for representing/approximating the propagator P. If P m can be written as a sum of squared polynomials, there must exist a positive semidefinite matrix A such that P m ðxÞ ¼ eðxÞ T Á A Á eðxÞ; (3) where e is a vectorized polynomial basis for polynomials in x up to degree d. The original optimization problem (2) can thus be reformulated to include the relaxed non-negativity constraint as arg min where ≽ 0 denotes positive semidefiniteness. This is a semidefinite programming problem that can be solved efficiently using e.g. interior point methods. Furthermore, if we fix the parameters ω in the constrained problem (4) to the solution of the unconstrained problem (2), then the interior point method will determine whether the selected parameters result in a feasible problem, i.e., if the unconstrained fit already satisfies the proposed SOS polynomial-based constraint. The remaining challenge is now to find polynomials P m for a given model m of S, such that the solution to problem (4) guarantees that the corresponding probability density function P is non-negative.

Constraints for common diffusion MRI models
We will consider sum of squares constraints for three models/basis expansions for the diffusion-weighted signal, which we consider representative of the many proposed models; see e.g. (Assemlal et al., 2011;Panagiotaki et al., 2012;Ghosh and Deriche, 2016) for a comprehensive overview.

Mean apparent propagator
The first expression we consider is the mean apparent propagator model proposed by € Ozarslan et al. (2013b). This model expresses the signal S as a weighted sum of Hermite functions ( € Ozarslan et al., 2013a( € Ozarslan et al., , 2008Abramowitz and Stegun, 1964), which endow the representation with a number of convenient properties ( € Ozarslan et al., 2013b, Introduction). Most importantly, Hermite functions are eigenfunctions of the Fourier transform that links the signal S to the propagator P, which means that they allow an analytical expression for P given a MAP reconstruction of S.

The order K MAP model has the functional form
where a n1n2n3 :¼ a n1n2n3 ðSÞ are the parameters and n 1 , n 2 , n 3 non-negative integers. The matrix S is a scaling matrix, typically defined as with λ the eigenvalues of the cumulant tensor C 2 (cf. Kiselev, 2010), see Eq. (17) for further details) and Q a matrix of its eigenvectors, so that S T Á S ¼ C 2 . With this definition the scaling matrix S ensures that the lowest order term (k ¼ 0) in the expansion (5) corresponds to the expected Gaussian behavior described by e.g. DTI. The Hermitian basis functions Φ n1n2n3 are defined as k 2 e À 1 2 kqk 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2 k n 1 !n 2 !n 3 ! p H n1 ðq 1 Þ H n2 ðq 2 Þ H n3 ðq 3 Þ; Fig. 1. A one dimensional example of a polynomial fit to an artificial data set (blue dots, generated by sampling f ðxÞ ¼ e x À 2:5 x on the range ½ À 4; 4, 100 samples) under sum of squares constraints (red) and without any constraints (black). (a) Although the actual function is strictly positive, unconstrained optimization for low order polynomials can become negative where the constrained fit remains non-negative. (b) In this noiseless example an unconstrained higher order polynomial fit will also be non-negative, in which case the constrained and unconstrained fits will be the same.
with k as before and H n the n th order Hermite polynomial H n ðqÞ ¼ ð À 1Þ n e q 2 d n dq n e Àq 2 : The corresponding expression for the basis that describes the displacement probability density function is so that with S ÀT the transpose of S À1 . By making use of the aforementioned fact that Hermite functions are eigenfunctions of the Fourier transform, S and P can thus be described using the same coefficients a n1n2n3 . The above expressions are modified somewhat from the expressions given by € Ozarslan et al. (2013b), since we restrict ourselves to the case PðrÞ ¼ PðÀrÞ and use a different definition of the gradient wave vector q. Sum of squares constraints can be applied to MAP MRI fairly straightforwardly by defining the polynomial whose non-negativity implies non-negativity of the MAP displacement probability density function, Eq. (10). This follows from the fact that the positive factors ð2πÞ À 3 2 e À 1 2 krk 2 in Eq. (9) and the factor jdetSj À1 in Eq. (10) can be factored out, leaving non-negativity of the above polynomial as the sole requirement after a change of basis.

Spherical deconvolution
The second expression for S considered here is the spherical deconvolution model (Tournier et al., 2004). This model assumes that each voxel can be viewed as a large set of separate compartments, that are all indistinguishable except for their orientation. In this scenario the signal in a voxel is generated by the sum of the contributions from these individual compartments, whose orientations are distributed according to an orientation distribution function (ODF) U : S 2 → ½0; ∞Þ. By construction the signal can then be written as a spherical convolution with an axially symmetric convolution kernel K : where the gradient magnitude q ¼ kqk is typically fixed, where b q ¼ q= q, and with dσ the standard Lebesgue measure on the sphere. The kernel K can be postulated, but is more typically estimated by combining data from different positions (Tournier et al., 2007;Tax et al., 2014) or computed from the local signal (Novikov et al., 2019). We may approximate the function U through a weighted sum of spherical harmonics Y m l : where the order L determines the number of expansion coefficients used, and thus determines the accuracy of the approximation. If we then approximate the convolution kernel K as P l¼0;2;…;L k 0 l ðqÞ Y 0 l ðb vÞ (where b q is fixed to be the north pole so that axial symmetry permits a description using a smaller number of coefficients) then Eq. (12) can be evaluated analytically and we obtain the simplified form In the case of spherical deconvolution we may assume that the Fourier transform of the kernel itself is non-negative, and observe that nonnegativity of P is then guaranteed by non-negativity of the orientation distribution function U (Tournier et al., 2007). Non-negativity of the spherical function U can be enforced using sum of squares constraints by where R m l ðvÞ :¼ q v l Y m l ðb vÞ are the solid harmonics (Abramowitz and Stegun, 1964). The function P as defined in Eq. (15) is a polynomial in the Cartesian coordinates of v, and because the restriction of this polynomial to the sphere (v ¼ 1) is equal to U (Eq. (13)), we find that a SOS condition on P guarantees non-negativity of U. Likewise, non-negativity of U implies that P is non-negative over its domain 0 v 1 by the maximum (minimum) principle for harmonic functions (Evans, 2010), and we thus find that U is non-negative if and only if P is non-negative on the unit sphere. The restriction of the non-negativity constraint on P to the sphere v 1 still allows a tractable semidefinite programming formulation, as described for example by Magnani et al. (2005) using results from Schmüdgen (1991).

Cumulant expansion
The final expression for S considered here is the cumulant expansion (Liu et al., 2004;Kiselev, 2010), which can be obtained from a Taylor series of logS. To ensure that this series exists, we assume that S has a well-defined moment generating function Together with the properties mentioned in Section 2.1, this implies that the domain of S can be extended to complex-valued arguments (H€ ormander, 1990), so that we can write for example M ðqÞ ¼ SðiqÞ. The expansion coefficients C i1⋯i k form the fully symmetric positive-definite k-tensors C k :¼ C k ðΔÞ (the cumulant tensors), and the logarithm of the signal is then given by where the diffusion time Δ is fixed and with the order K a positive even integer. The cases K ¼ 2 and K ¼ 4 are functionally identical to diffusion tensor imaging (Basser et al., 1994a, b) and diffusional kurtosis imaging (Jensen et al., 2005), where for K ¼ 2 Eq. (1) can be evaluated analytically to obtain The cumulant expansion converges in a bounded set containing the origin (Kiselev, 2010), so this model is typically considered for data sets acquired with fairly weak gradients ðb ≲ 2:5ms =μm 2 Þ. In addition to having a finite convergence radius, it holds that truncated cumulant expansions with non-zero higher order (k > 2) coefficients never correspond to valid propagators. Combined, this means that the cumulant expansion cannot describe general P accurately in their entirety, and as a result it is impossible to impose our relaxed non-negativity constraint as is.
In lieu of enforcing non-negativity on P directly we therefore proceed with defining a set of necessary but not sufficient ancillary constraints, which must be satisfied in the case of non-negative P but will not in general guarantee non-negativity. These constraints are based on the convexity property of the cumulant generating function C ðqÞ :¼ log SðiqÞ that follows from non-negativity of P, and which asserts that C ðα q 1 þð1 ÀαÞ q 2 Þ α C ðq 1 Þ þ ð1 ÀαÞ C ðq 2 Þ for all q 1 ; q 2 2 R 3 and all 0 α 1. Convexity can be verified as follows starting from Eq. (1) and using H€ older's inequality, which applies only in the case of non-negative P: It then follows that every reconstructed model/expansion that corresponds to a non-negative displacement probability density function, must also correspond to a convex cumulant generating function. The presented argument can be directly extended to include convexity of the moment generating function M ðqÞ ¼ SðiqÞ.
Unlike the non-negativity constraint, convexity of C or M is naturally expressed in the signal domain, which obviates the need for a practical expression of the propagator. Noting that convexity of a (continuous, twice differentiable) function Q on R 3 is equivalent to the requirement that for all x; s 2 R 3 , where H Q is the Hessian matrix containing the second order partial derivatives of Q , h is again a polynomial of degree 2d and s is a dummy variable of the same dimensions as x, we find that a relaxed convexity constraint (Magnani et al., 2005) can be imposed as in (4) by defining a polynomial Q % M (moment generating function) or Q % C (cumulant generating function) and setting P mω ¼ h. The relaxed convexity constraint can thus be imposed on the cumulant expansion by applying sum of squares constraints on h as given in Eq. (20), with Note finally that for the cases K ¼ 2 and K ¼ 4 the resulting relaxed constraints are in fact exact and coincide with the usual constraints, e.g. positive-definiteness of the diffusion tensor in the case of DTI.

Implementation
In our experiments we always take m ω in Eqs. (2) and (4) to be linear in ω so that we can use linear least squares. In the case of the cumulant expansion this requires linearizing both the model and the measurements, as well as the introduction of weights in Eq.
(2) to stabilize the measurement variances (Salvador et al., 2005). An initial fit is computed using (iteratively reweighted) linear least squares, and constrained optimization is only performed if the initial fit does not already satisfy the constraints, cf. Section 2.2.
The unconstrained optimum ω Ãi.e., the vector of parameters that solves Eq.
(2)-is used to reduce the complexity of the corresponding constrained optimization problem as described in Appendix A. We further simplify the constrained optimization problems by merging the constraints in Eq. (4) as follows. For a given polynomial P mω (Section 2.3) and a given basis e, we parameterize the space of potential A in Eq.
(4) as Aðω; αÞ ¼ HðωÞ þ LðαÞ; where H is any symmetric matrix satisfying e T Á H Á e ¼ P m , and LðαÞ is any parameterization of the linear space e T Á L Á e ¼ 0 with parameters α.
The matrix LðαÞ is typically very sparse, with its sparsity pattern depending on the choice for the basis e. Although it is possible to use for example a Hermite polynomial basis for e, we always use a monomial basis (Lofberg, 2009) in the experiments presented here. The resulting semidefinite programs are solved using MOSEK 8.1.0.72 (MOSEK ApS, 2018). We compare the proposed semidefinite programming reconstruction of the MAP model, MAPþ, to the weakly constrained reconstruction implemented in TORTOISE 3.1.3 (Irfanoglu et al., 2017;Pierpaoli et al., 2010), which enforces non-negativity of the propagator in a large set of individual points in the neighborhood of the origin as described in the original MAP paper ( € Ozarslan et al., 2013b). The spherical deconvolution (CSDþ) experiments are compared to the standard constrained spherical deconvolution (CSD) (Tournier et al., 2007) implementation provided by MRtrix 3, which pushes for non-negativity through soft constraints as described in the work of Tournier et al. (2007). The convolution kernel K (the single fiber response function) is estimated with the tournier method implemented in MRtrix (Tournier et al., 2013;Tax et al., 2014). The sum of squares constrained reconstructions for CSD þ are computed using the same single fiber response functions as the weakly constrained SD implementation, but in the case of MAP þ we had to recompute the scaling tensor because this information was not provided directly by TORTOISE. The constrained cumulant expansion (CEþ) reconstructions are only compared to unconstrained cumulant expansion fits.

Data
The MAP MRI experiments presented here make use of a subject in the MGH Human Connectome Project (HCP) adult diffusion database. This data set has 1:5 mm isotropic voxels, and b ¼ f1; 3; 5; 10g ms=μm 2 shells with f64; 64; 128; 256g measurements respectively, as well as 10 b ¼ 0 ms=μm 2 images. Further details can be found in the references (Setsompop et al., 2013).
For the spherical deconvolution experiments we use images of a single subject in the WU-Minn HCP (Van Essen et al., 2013). The data has an isotropic voxel size of 1:25 mm, with 90 approximately uniformly distributed orientations acquired for each of the three shells, b ¼ f1; 2; 3g ms=μm 2 and 18 baseline images. We will only make use of the b ¼ 3 ms=μm 2 shell.
For the cumulant expansion experiments we rely on the MASSIVE data set made available by Froeling et al. (2017), which consists of a large number of single-subject measurements distributed over five shells and two Cartesian grids. We generate randomly downsampled data sets for both the Cartesian and the spherically sampled subsets. For a given order K, random samples are generated from the shells with b ¼ 1; 2; …; K 2 ms=μm 2 , and in the Cartesian case we pick random samples such that b < K 2 ms=μm 2 always. The number of samples is a multiple (the 'sampling rate') of the number of degrees of freedom PK 2 i¼0 ði þ 1Þð2i þ 1Þ, excluding four baseline images.
To get an idea of the importance of an optimized preprocessing pipeline, we use artificial data sets created by adding Gaussian noise to the function values estimated from maximum K MAPþ, CSDþ, and CE þ fits. For the CE þ experiments we use a shell-based scheme with a sampling rate of 10, and for the others we simulate data using the same acquisition schemes as described above. The resulting data sets have ideal noise characteristics, so can be viewed as the output of a perfect bias correction and artifact removal algorithm. Furthermore, these data sets can act as ground truth data sets for validation purposes, under the assumption that the considered model and its reconstruction are approximately valid for the measured data. We set the standard deviation of the artificial Gaussian noise based on a robust estimate of the standard deviation in the residuals for each diffusion-weighted image (1.4826 times the median absolute deviation, cf. Rousseeuw and Croux, 1993)).

The prevalence of constraint violations
Sum of squares optimization can be used to enforce non-negativity constraints on the propagator for a number of common models (Section 2.3), but comes at significant computational costs (Ahmadi and Majumdar, 2019). In this first set of experiments we investigate the necessity of these constraints in the generic case, i.e., in the case where we have no application-specific arguments to favor speed over reconstruction accuracy. The general methodology that we follow is similar to the work of Veraart et al. (2013): we use a high quality data set suitable for the specific model under consideration, perform an unconstrained or weakly constrained reconstruction, and then determine in what percentage of voxels in this data set we can prove that all constraints are satisfied.
For MAP MRI and spherical deconvolution we compare three implementations: an unconstrained least squares fit of the parameters, the commonly used weakly constrained quadratic programming reconstructions ( € Ozarslan et al., 2013b;Tournier et al., 2007) that enforce non-negativity on a finite set of points, and the here proposed semidefinite programming reconstruction. Fits produced by the newly proposed method always satisfy the constraints. For the cumulant expansion we compare only fully constrained and fully unconstrained reconstructions for different orders. An overview of the percentages of voxels with violated constraints is presented in Table 1, and Fig. 2 shows some examples of the spatial distribution of voxels with unsatisfied constraints for the different models. These results show quite clearly that violated constraints are commonplace in all models and orders except for the K ¼ 2 cumulant expansion (DTI). The commonly used K ¼ 8 MAP and CSD reconstructions produce violated constraints in essentially all voxels.

Impact on applications
Although the results of the previous subsection are quite alarming, a noteworthy caveat in their interpretation is that they do not reflect the severity of the violated constraints. It might be that the non-negative parts of the propagator have a very small magnitude, or that violations occur in regions that are not necessarily of interest for a specific application. In the case of spherical deconvolution one might for example be interested only in the directions of the ODF maxima, and based on e.g. a visual comparison of ODFs (Fig. 3) one could be tempted to argue that these directions could be obtained reliably under weaker constraints. If it Table 1 The percentage of voxels in the brain where the constraints derived in Section 2 are satisfied for different reconstruction methods. Percentages are computed relative to the total number of voxels inside the mask (439 248 for MGH HCP, 936 256 for WU-Minn HCP, 81 423 for MASSIVE). All cumulant expansion computations were performed on a shell-based acquisition (of the appropriate b-values) in the subsampled MASSIVE data (sampling rate 10), corresponding approximately to a high quality clinical acquisition. The standard MAP and CSD results were obtained by verifying the constraints in reconstructions generated by TORTOISE and MRtrix as described in Section 2.4, and the unconstrained reconstructions for all models were computed with custom code. (The unconstrained and standard MAP are based on different scaling tensors, while all SD experiments are performed with the same response function. TORTOISE failed in a number of voxels; these are not counted as violations in this table.) The 'þ' postfix is used to indicate a reconstruction that imposes sum of squares constraints. This method by definition does not generate results with violated constraints. We have highlighted a number of commonly used configurations: MAP order 8, CSD order 8, and CE orders 2 (DTI) and 4 (DKI). (We limit K to values that we consider to currently be of interest in the literature, but in principle the proposed techniques can be applied to models of any order.) Fig. 2. The spatial distribution of voxels where constraints are not satisfied for different orders K, for a standard MAP reconstruction (top row), a standard CSD reconstruction (middle row), and an unconstrained CE reconstruction (bottom row). The cumulant expansion reconstruction was computed for a subsampled version (sampling rate 10) of the shell data in the MASSIVE data set. Voxels are colored red if they lie inside the brain mask and if the constraints are not satisfied in the stated reconstruction, and in the background we show fractional anisotropy (MAP) or T1-weighted images. Our results generally follow the expected trend where the number of voxels with violated constraints generally increases with K for each model, although MAP at K ¼ 4 has a surprisingly low number of violated constraints in the more isotropic gray matter. Overall, voxels with violated constraints appear to be concentrated in more anisotropic regions and near tissue interfaces.
is a priori known that (mild) violations of the constraints do not cause problems in a specific pipeline, then it could be acceptable (but not recommended!) to use unconstrained optimization. To illustrate how the effects of violated constraints can propagate in a given analysis pipeline, we examine this propagation for two typical applications of our models.
For the cumulant expansion and for MAP MRI we consider the effect of violated constraints on various scalar measures, which are increasingly being used for example as summary statistics in population studies. For MAP MRI these measures are the return-to-origin probability (RTOP) and the non-Gaussianity measure (NG), and for the cumulant expansion we look at mean kurtosis (MK; K ¼ 4). Common scalar measures for the K ¼ 2 cumulant expansion (DTI), like fractional anisotropy or mean diffusivity, are not considered because the unconstrained reconstruction for K ¼ 2 is practically identical to the constrained reconstruction (recall Table 1). The precise definitions of these scalar indices can be found in the references ( € Ozarslan et al., 2013b;Basser and Pierpaoli, 1996;Jensen et al., 2005;Jensen and Helpern, 2010).
The differences between strictly and weakly constrained reconstructions can already be appreciated visually in the case of MAP MRI, cf. Fig. 4. The impact of a constrained reconstruction varies between scalar indices, with larger deviations materializing in the RTOP measure than in the NG measure. The differences in RTOP are also much more clearly confined to a specific tissue, whereas the differences observed in NG are more homogeneous.
For the cumulant expansion the visual differences between constrained and unconstrained fits are less pronounced, simply because the fits are the same in most voxels. We therefore consider only the 13 106 voxels in our data set where constraints are violated, Fig. 5. Again we observe significant differences between the constrained and unconstrained fits, and, remarkably, we find that all MK values derived from the constrained fit lie in the expected biological range of ½0; 3. Note that this improvement over the unconstrained case is really a reflection of the improved accuracy associated with properly constrained fits; the algorithm does not actually enforce the MK to lie within this range.
As a second application we consider whether constrained reconstruction is necessary if one wishes to do tractography (Jbabdi and Johansen-Berg, 2011), which is commonly based on spherical   4. A comparison between scalar measures computed from strictly constrained (left) and weakly constrained reconstructions (middle), together with their relative differences (right). The return-to-origin probability plots are scaled so that white corresponds to an RTOP 1/3 value of 265mm À1 , and in the non-Gaussianity plots white corresponds to an NG of 0.76. (The acquisition parameters are not tailored to examine regions with very large diffusivity, thus localized overfitting to Rician noise gives rise to abnormally high NG values in the cerebrospinal fluid.) The difference plots show the absolute differences between the figures in the left and middle column, relative to the strictly constrained values shown on the left (black is 0, white is 0.5). RTOP appears to be more sensitive to constraint violations than NG, with NG differences distributed rather evenly compared to the RTOP differences that are focused mainly in white matter. The red arrowhead points to a region where unphysical non-Gaussianity values in the unconstrained map result in black voxels.
deconvolution. For now, we are solely interested in seeing whether there will be any significant effect at all when applying more stringent constraints, so we consider first what happens with the orientation distribution function as a whole (relevant for e.g. probabilistic streamline (Friman et al., 2006;Lazar and Alexander, 2005;Berman et al., 2008;Sherbondy et al., 2008;Parker and Alexander, 2005;Hosey et al., 2005), geodesic (Schober et al., 2014;Fuster et al., 2016;O'Donnell et al., 2002;Pichon et al., 2005;Melonakos et al., 2008;P echaud et al., 2009;Sepasian et al., 2014Sepasian et al., , 2012, and (to a lesser extent) global tractography approaches (Jbabdi et al., 2007;Christiaens et al., 2015;Reisert et al., 2011Reisert et al., , 2014), as well as what happens with the orientation of its peaks (relevant for e.g. deterministic streamline tractography (Conturo et al., 1999;Wedeen et al., 2008;Basser et al., 2000;Mori et al., 1999)). These results are shown in Figs. 6 and 7, and highlight the severe difference in function values and main orientation that result from the small changes in the fitted parameters (see Fig. 8 for a histogram comparing the sum of squared residuals between CSDþ and CSD). These large differences are very likely to have a significant effect on the outcome of common tractography algorithms, in particular because local deviations typically accumulate.

Sampling and preprocessing
In the previous subsections we have used minimally processed data, as is typically used in standard analysis pipeline. In this section we build on these results to evaluate the importance of improved preprocessing (bias correction, Gibbs ringing correction, etc.) and the effect of different acquisition design choices (sampling rate and shell-versus grid-based acquisitions). For the former we rely on the artificial data described in Section 2.5, and again look at the percentage of voxels where the unconstrained reconstruction results in constraint violations. To do the latter we perform an unconstrained cumulant expansion reconstruction for different subsampled versions of the MASSIVE data set, and then check in what percentage of the voxels in the brain mask the constraints are not satisfied.
The percentages of violated constraints in the artificial data are shown in Table 2. Except for some unexpected differences in the MAP percentages obtained with TORTOISE, these results do not change much compared to the results on in vivo data (Table 1). Overall we find that reconstructions using artificial data are slightly worse than the reconstructions based on the original data, which is likely due to a slight overestimation of the noise standard deviation. As we have access to the ground truth parameters in these artificial data sets, it is possible to demonstrate how much fitting accuracy is actually improved by the proposed constraints, see e.g. Fig. 9. Likewise, we get closer to the true values of quantities derived from the estimated reconstructions, such as the orientations of the dominant peaks from CSD (Fig. 10). Fig. 11 shows graphs of the number of constraint violations observed in CE reconstructions for different numbers of measurements, where data points are computed as the mean numbers of voxels with unsatisfied constraints in 10 randomly generated data sets, recall Section 2.5. As expected, we observe a decreasing trend when increasing the sampling rate, and it turns out that this decrease is significantly more rapid in the case of spherical sampling. Voxels where the constraints are not satisfied appear to be concentrated near interfaces between tissue types (white matter/gray matter/cerebrospinal fluid) in all cases (Fig. 12).

Computational cost
The sum of squares formulation has a lot of advantages, the foremost being guaranteed non-negativity (Lofberg, 2009) and a wide applicability. Its main disadvantage however is that it scales quite badly with model complexity, see e.g. discussions in the work by Ahmadi and Majumdar (2019). This can be seen clearly in Table 3, which contains the   6. A plot showing the L 2 norm of the differences between the ODFs computed with weakly constrained spherical deconvolution and those computed with the proposed semidefinite programming reconstruction for the same slice as shown in Fig. 2, together with a histogram of the differences over the whole data set. The differences are shown relative to the L 2 norm of the semidefinite programming reconstruction, with white pixels corresponding to a difference of 100%. Although there is some structure visible in the difference map, significant differences can be observed quite randomly throughout the brain. The large differences visualized here can be explained to some extent by overall size differences between the strict and weak constrained reconstructions, recall e.g. Fig. 3. wall clock running times for the reconstructions used in the previous subsections. Note that the number of parameters, the number of data points, the number of iterations needed to stabilize the measurement variance, the exact formulation of the constraints, and the number of voxels that require constrained optimization all have an impact on the timings. Although timings increase quickly with model order, we can conclude based on these timings that current techniques are sufficient for state-of-the-art acquisitions, even using our current implementation prototype which relies on suboptimal MOSEK calls generated through an interface with Wolfram Mathematica (Wolfram, 2003). Table 1 clearly highlights how crucial constrained reconstruction is under typical experimental conditions-only for the Gaussian diffusion model (K ¼ 2 cumulant expansion, equivalent to DTI) do we obtain an acceptable percentage of voxels with satisfied constraints. For higher order cumulant expansion reconstructions, as well as for MAP MRI and spherical deconvolution, violated constraints are extremely prevalent. In these cases, regions of high anisotropy in the white matter as well as tissue interfaces seem particularly susceptible (Fig. 2). As the brain white matter is the main object of study in for example tractography, these results strongly imply that non-negativity ought to be enforced through strict constraints to ensure that subsequent data analyses are reliable.

The necessity of strict constraints
We also found that the number of constraint violations in real data remains largely the same if we consider artificial data with ideal noise characteristics (Table 2). From this we conclude that the observed violations are not likely to be caused to any significant extent by errors in our implicit modeling assumptions (Gaussian noise, no artifacts, etc.), but are instead effected predominantly by measurement noise. However, this is partly a reflection of the quality of the public data that we have used in our experiments, and we can assume that in noisy or corrupted data further constraint violations will likely be present.
In the case of appropriately processed, high quality data, it follows that the only effective strategies to decrease the number of constraint violations must necessarily reduce the effective power of measurement noise, and we can see in Fig. 11 that optimally distributed samples and an increased number of measurements are both advantageous in this regard. Fig. 7. A comparison between the peaks estimated from standard CSD and from CSDþ in white matter. (a) A plot and histogram of the angular differences between the dominant peak orientations. Because dominant peaks are defined as the orientations where the ODF is maximal, secondary peaks occasionally present as dominant due to noise. This results in the relatively high number of voxels with angular differences around 90 for example. (b) The difference with the neighboring peak found through a local maximum search in the weakly constrained reconstruction, initialized at the location of the dominant peak of the strictly constrained reconstruction. The typical angular difference between the two types of constrained reconstructions is only a few degrees, but there are many cases where the angular difference is much larger-differences greater than 5 ∘ can be observed in about half of the voxels inside the mask. We can reasonably expect that less prominent (secondary) peaks will suffer from larger deviations than those observed here. Because local errors accumulate in most common tractography methods, a lack of strict constraints will surely have a significant impact in this application.
However, if we extrapolate the plotted trends, it becomes clear that the number of measurements required to obviate the need of a constrained reconstruction quickly becomes untenable. For a K ¼ 4 cumulant expansion (DKI) we would need hundreds of measurements-not unreasonable in a modern preclinical setting although infeasible in a clinical one-but for K ¼ 6 this number already runs in the thousands. Considering Table 1 we can expect other non-Gaussian models to require similarly large numbers of measurements, and it therefore seems likely that constrained reconstruction will be required in many cases even taking into account possibly significant future gains in acquisition speed and quality.

Consequences for applications
While constraint violations should typically be avoided, their occurrence is not necessarily fatal in practical applications. We therefore considered a number of example scenarios, and the proposed sum of squares constrained reconstruction method proved to offer considerable practical benefits in all cases. Parameters obtained through constrained optimization differ significantly from the standard weakly constrained and unconstrained reconstructions, with the estimation accuracy improving noticeably in the constrained case (Figs. 5,9,and 10). Correspondingly, the constrained fits produce more accurate estimates of derived scalar measures, orientation distribution functions, and dominant fiber orientations.
In the case of spherical deconvolution the favorable effects resulting from this accuracy gain can be appreciated immediately from the reduced incidence of spurious peaks (Fig. 3, also reported by e.g. Cheng   Fig. 8. A plot and a histogram of the differences between the sum of squared residuals for CSDþ and standard CSD, shown as percentages relative to the sum of squared residuals of CSDþ. Negative values correspond to residuals where CSD þ outperforms CSD, and this is the case in the vast majority of the voxels. This trend was actually unexpected to us, since the SOS constraints are more strict than the constraints employed in CSD, but can be reasonably explained by implementation-specific issues unrelated to the theoretical performance of the methods. Unconstrained reconstructions have slightly lower residuals still compared to the SOS constrained fits (typically less than 1% relative difference). Based on these observations, we can be confident that our fitting quality is acceptable. In the plot on the left, white corresponds to an absolute difference of 2%, and black corresponds to 0%, and the slice is again the same as in Fig. 2. In the case of CSD shown here we clearly find that the largest deviations occur in the highly anisotropic white matter regions.

Table 2
The percentage of voxels in the brain where constraints are satisfied for different reconstruction methods under ideal noise conditions. The artificial data used to compute these results is based on the maximum K MAPþ, CSDþ, and CE þ reconstructions obtained in Table 1, and generated by adding Gaussian noise to the function values computed for those constrained reconstructions using the gradient schemes described in Section 2.5. The standard deviation of the artificial noise is estimated from the residuals in the in vivo data. We have again highlighted commonly used configurations in the table: MAP order 8, CSD order 8, and CE orders 2 (DTI) and 4 (DKI).  et al. (2014) for other strict constraints). As spurious peaks and other inaccuracies in local orientation estimates are detrimental to the performance of for example tractography and connectivity algorithms, we expect that enforcing these constraints will have a significant effect on such applications.
For MAP MRI and for the cumulant expansion, our constraints turn out to provide a more natural solution than explicit biologically motivated constraints (e.g. the assumed absence of negative mean kurtosis, cf. Veraart et al., 2011) to cases of problematic 'black voxels' that are often seen in scalar maps, see for example Fig. 4. Furthermore, we found that although these scalar values clearly still deviate from their correct values when computed under weaker constraints, they are often not obviously Fig. 10. (a) A histogram of the L 2 norm of the differences between constrained reconstructions and ground truth functions based on artificial data, computed for CSDþ and CSD reconstructions and shown relative to the L 2 norm of the ground truth functions. (b) A histogram of the angular differences between the reconstructed dominant peak and the ground truth in artificial data, computed for CSDþ and CSD reconstructions in white matter. In both cases we see clearly that the strict constraints imposed in CSD þ contribute to more accurate estimates of function values and peak orientations. Fig. 11. These plots show the percentage of voxels where the reconstructed model parameters (cumulants up to order K) do not satisfy the constraints derived in Section 2.2. The parameters used here are computed directly from the subsampled MASSIVE data set described in Section 2.5, with no additional preprocessing performed. This percentage decreases if more samples are included, with space-filling (Cartesian) sampling schemes consistently producing more 'incorrect' voxels than spherically sampled data sets. Sampling rates of 15/10/6 respectively correspond approximately to WU-Minn HCP quality data. The error bars show the estimated standard deviations. Fig. 12. The spatial distribution of voxels where constraints are not satisfied when using an unconstrained reconstruction of the cumulant expansion. The figures are generated using subsampled data sets with a sampling rate of 15 for K ¼ 4 and 45 for K ¼ 6, with the foreground color indicating the number of data sets where the reconstruction failed (varying from black to red, with red indicating that none of the reconstructions satisfied the non-negativity constraints). The orders and data sets used in generating each figure are indicated in the titles. A T1-weighted image is shown in the background for anatomical reference. wrong-i.e., they typically do not lie outside of the range of physically plausible values (Fig. 5). It would thus be difficult to detect these deviations without a proper constrained reconstruction, which highlights the fact that simpler constraints such as the linear/quadratic programming constraints considered by e.g. Veraart et al. (2011) cannot be considered viable alternatives. Note that if there are reasons to enforce these constraints in a specific application, they could still be incorporated straightforwardly in the proposed framework as additional constraints.

Future work
The constrained optimization approach we proposed here extends straightforwardly to a large number of other (linearizable) models, including generalized DTI (as previously considered in related works (Barmpoutis et al., 2009(Barmpoutis et al., , 2007Chen et al., 2013)) and multi-shell multi-tissue type spherical deconvolution approaches , and we expect that the inclusion of the proposed constraints will offer similar advantages in those cases. Sum of squares constraints can also be used for other constraints than non-negativity, as we have seen in the CE case where we essentially enforce convexity, although identifying other useful constraints may prove challenging. Finally, it is at least theoretically possible to extend our approach to problems with non-linear objective functions. This includes sum of squares constrained non-linear least squares reconstructions, for example of many common multi-compartment models (Panagiotaki et al., 2012), as well as maximum likelihood estimation based on a Rician noise model. This may be especially valuable for models that rely on inherently noisier large gradient strength acquisitions, although the running time will likely increase significantly. There is on the other hand an ongoing effort to improve the scalability of sum of squares optimization techniques (Permenter and Parrilo, 2019; Lofberg and Parrilo, 2004;Ahmadi and Majumdar, 2019), and we generally expect that the benefits of constrained optimization will continue to outweigh the computational costs. For specific applications, we can also investigate whether existing methods capable of strictly enforcing non-negativity (Cheng et al., 2014;Qi et al., 2010;Chen et al., 2013;Schwab et al., 2012;Barmpoutis and Vemuri, 2010) can produce comparable constrained reconstructions at a reduced computational cost. Square root representations (Cheng et al., 2014) and linear SOS approximations (Barmpoutis and Vemuri, 2010) for example, can generally be expected to be faster than the proposed approach, although care must be taken to ensure that these methods can accurately represent all possible data.

Conclusion
In this work we have proposed to use sum of squares optimization to enforce non-negativity of the displacement probability density function in diffusion MRI model reconstruction problems. This versatile approach was developed for three commonly used and quite different classes of models, and can likely be extended to several others. To show the practical benefit and necessity of our approach we investigated how prevalent constraint violations are in different scenarios, and how these violations impact different applications. Based on these results we can draw three conclusions. First, diffusion-weighted MRI parameter estimation schemes that do not use constraints or that only enforce weak constraints, will routinely produce physically impossible parameter values. A large segment of current approaches to (constrained) model reconstruction are thus insufficient in practice, and should only be trusted in very specific cases. Second, these unphysical fits are likely to have a significant effect in practice, as they occur in clinically relevant regions and can introduce large deviations from the more reliable constrained result. Finally, we note that these constraint violations appear for almost all tested models (the exception being the DTI model), and that improvements to the acquisition pipeline are unlikely to resolve this issue sufficiently. Weaker constraints, bias correction, artifact removal, additional measurements (i.e. noise reduction), and even localized lower model orders-none of these steps were found to offer adequate, practicable solutions. Considering this, it would be our recommendation that the weak constraints that are used in essentially all current implementations be replaced with the strict constraints proposed here, even taking into account that this will inevitably add some computational costs and complexity to those implementations. We note that computational costs will be relatively small for models where constraint violations are relatively infrequent, and that these costs are rendered insignificant by the absolute necessity Table 3 Approximate wall clock running times for constrained reconstructions based on different models. All computations were done with an implementation prototype on an Intel Core i7-7600U CPU @ 2:80 GHz Â 4 laptop. The cumulant expansion computations were performed on a shell-based acquisition in the subsampled MASSIVE data as before (sampling rate 10 for all K), but it should be noted that because of the result obtained in Appendix A the effect of the number of data points to these timings is relatively minor-it is instead the increasing complexity of the constraints that leads to very large semidefinite programs. The time spent on computing the response function in the case of CSDþ and of the scaling matrix in the case of MAPþ is not included in the reported values, but the unconstrained reconstructions that are used to initialize all 'þ' reconstructions and the checks for constraint violations are counted. The total number of voxels inside the mask for each data set is as follows: 439 248 for MGH HCP, 936 256 for WU-Minn HCP, 81 423 for MASSIVE. We include the timings of our custom unconstrained implementation for comparison.
for proper constraints in other cases.

Acknowledgements
Tom Dela Haije and Aasa Feragen were supported by the Center for Stochastic Geometry and Advanced Bioimaging and by a block stipendium, both funded by the Villum Foundation (Denmark). Evren € Ozarslan was supported by the Link€ oping University Center for Industrial Information Technology (CENIIT, Sweden) and by VINNOVA/ITEA3 17021 IMPACT (Sweden/Netherlands). The authors would also like to acknowledge Schloss Dagstuhl and the organizers of Dagstuhl seminars 16142 and 18442 for facilitating discussions that improved the quality of this paper.
Anton Mallasto and Line Kühnel are acknowledged for their help in formalizing the complexity reduction technique presented in Appendix A. The authors thank Chantal Tax and Alard Roebroeck for their suggestions regarding the alternative preprocessing pipeline and ground truth experiments, and Andrea Fuster and Luc Florack are acknowledged for their contributions to preliminary works focused on the cumulant expansion.