Stable Deep MRI Reconstruction Using Generative Priors

Data-driven approaches recently achieved remarkable success in magnetic resonance imaging (MRI) reconstruction, but integration into clinical routine remains challenging due to a lack of generalizability and interpretability. In this paper, we address these challenges in a unified framework based on generative image priors. We propose a novel deep neural network based regularizer which is trained in a generative setting on reference magnitude images only. After training, the regularizer encodes higher-level domain statistics which we demonstrate by synthesizing images without data. Embedding the trained model in a classical variational approach yields high-quality reconstructions irrespective of the sub-sampling pattern. In addition, the model shows stable behavior when confronted with out-of-distribution data in the form of contrast variation. Furthermore, a probabilistic interpretation provides a distribution of reconstructions and hence allows uncertainty quantification. To reconstruct parallel MRI, we propose a fast algorithm to jointly estimate the image and the sensitivity maps. The results demonstrate competitive performance, on par with state-of-the-art end-to-end deep learning methods, while preserving the flexibility with respect to sub-sampling patterns and allowing for uncertainty quantification.


Introduction
A vast literature is dedicated to reducing examination time in magnetic resonance imaging (MRI) while retaining the diagnostic value of the resulting images.On the hardware side, parallel imaging (PI) [1] exploits spatially varying sensitivity maps of coil arrays.Such PI hardware has become standard in clinical systems, but noise amplification limits the potential speed up with classical reconstruction techniques [2].On the algorithmic side, compressed sensing (CS) theory [3] and variational approaches enable greater acceleration under the assumption that the reconstruction has a sparse representation in some basis.Often, this basis is hand-crafted and only reflects rudimentary aspects of the prior information in the underlying distribution.As a prominent example, the total variation (TV) assumes sparsity in the spatial image gradients (which translates to piecewise constant image intensities), and has been successfully applied to parallel MRI [4].
Hand-crafted prior information, in general, fails to capture the complexity of the underlying distribution [5].On the other hand, data-driven approaches have shown promising results MRI reconstruction, often leading to superior results compared to classical CS techniques [6]- [12].Here, information is encoded in a deep neural network, which is trained on reference data in an off-line step.Data-driven approaches have successfully been applied as pre-processing steps in k-space [7] as well as post-processing steps in imagespace [8].Variational networks (VNs) [6], [13]- [15] imitate the structure of an iterative reconstruction scheme by unrolling an optimization algorithm.In a dynamical setting, this approach has been proven useful for quantifying flow in four-dimensional MRI [16].At the pinnacle, purely datadriven methods disregard any physical measurement model and instead learn direct maps from k-space to image-space (AUTOMAP, [17]).
Data-driven methods typically require large training datasets.In particular, state-of-the-art networks mimicking iterative schemes like the end-to-end VN [19] as well as AUTOMAP [17] (and related methods such as [10]) require image-data pairs for training.Such image-data pairs are scarce in medical applications, whereas reconstructed images are much more abundantly available in the form of DICOM data [8].Data-driven methods that only assume access to reference images (but still allow for some kind of data-fidelity) include generative adversarial networks (GANs) [11] and score-based diffusion models [12].Both methods share the problem that it is not obvious how the encoded prior information is best used: GANs suffer from the range-dilemma [20] and authors have proposed to optimize the parameters of the GAN at inference time [11], effectively turning it into a deep image prior [21].Diffusion models have shown remarkable results in MRI reconstruction [12], [22], [23], but it is still an open question how to optimally incorporate data-fidelity in the reverse diffusion process (see [24] and the discussion in Section 1.3 for an overview of proposed methods).In addition, (with the exception of score-based diffusion models) . . .u (20)   Σ (20)   . . .u (40)   Σ (40)   . . .
Figure 1: Sketch of the reconstruction algorithm: To jointly reconstruct the spin density u and the sensitivity maps Σ, we impose data-fidelity, image-regularity, and coil-regularity in the iterations of iPALM [18].The function H incorporates our learned regularizer acting on u.Details are discussed in Section 2.2.
all of the mentioned models effectively act as a point estimator, mapping a k-space datum to an image.However, having access to a distribution of reconstructions and, consequently, being able to quantify uncertainty in the reconstruction is of utmost importance in medical applications.Further, these point estimators typically assume a particular acquisition modality and do not generalize with respect to acquisition masks [12].In fact, it has been shown that reconstruction quality of some data-driven approaches deteriorates when more data becomes available [25].This severely hampers adoption of these methods in clinical practice.In any case, a method that marries the great representation power of data-driven deep neural networks with the interpretability of variational approaches is still sought after.
In this work we pursue a principled approach to parallel MRI that combines the merits of modern data-driven approaches with the benefits of classical variational approaches.In particular, we learn a highly expressive image-space prior in the well-known maximum-likelihood framework.In contrast to most other data-driven approaches, training our model does not require image-data pairs, but only assumes access to a database of reference reconstructions.To solve different PI reconstruction problems, we propose a novel algorithm to jointly estimate the image as well as the sensitivity maps in a sensitivity encoding (SENSE)-type [1] forward model.A sketch of our proposed approach is shown in Fig. 1.We demonstrate that combining this prior with suitable data-likelihood terms yields competitive performance for a multitude of PI problems.Specifically, the variational formulation makes our method agnostic to sub-sampling patterns.Additionally, the accompanying probabilistic interpretation naturally enables experts to explore the full posterior distribution of any reconstruction problem.In particular, we analyze the posterior expectation as well as the pixel-wise marginal posterior variance.

On the Hunt for the Optimal Regularizer
The classical variational approach for solving inverse problems amounts to minimizing an energy functional Here, x is the reconstruction given the datum z.The datafidelity term D models the physical acquisition process, including stochasticity.For example, if A models the physical acquisition process, D(x, z) = 1 /2 ∥Ax − z∥ 2 2 is optimal under the assumption of Gaussian measurement noise.On the other hand, the regularization term R encodes prior knowledge about the solution.Classical regularizers R include magnitude penalization (e.g. the Tikhonov functional x → ∥x∥ (Dx) Often, Eq. ( 1) is used without fully acknowledging its probabilistic interpretation: It computes the maximum aposteriori (MAP) estimate of the posterior p Hand-crafted regularizers, in general, can not faithfully represent the intricate statistics of the underlying distribution.Since these regularizers are overly simplistic, methods like early stopping the optimization of Eq. ( 1) have emerged [26].Clearly, the need for such heuristics arises only because hand-crafted regularizers are a bad model for natural or medical images.
In this paper we pose the following question: Can we learn a regularizer such that p R is indistinguishable from the underlying reference distribution?We answer this by encoding the regularizer as a neural network endowed with learnable parameters, which we train on reference data using maximum likelihood (see Section 2).This method of learning an unnormalized probability density is known as learning an energy-based model (EBM) [27], [28], and has recently been used in the context of MRI reconstruction by [29]- [31].After training, the network can be used in arbitrary reconstruction tasks by embedding it in Eq. ( 1), thereby combining the versatility of variational approaches with the great representation power of deep neural networks.
On one hand, concerning versatility, the strict separation of data likelihood and prior makes this approach agnostic to the number of receiver coils and sub-sampling pattern.Further, the probabilistic interpretation enables access to a distribution of reconstructions p(x | z), and we can compute different Bayesian estimators.In addition to the MAP estimate (δ x is the Dirac measure centered at x) we exploit the distribution to compute the minimum meansquared-error (MMSE) estimate as well as the pixel-wise marginal variance (at the i th pixel) On the other hand, the representation power is empirically demonstrated by two experiments: First, our model is capable of synthesizing realistic images without any data.Second, we show that we can faithfully reconstruct images from severely ill-posed problems, e.g. when using random acquisition masks, where profound knowledge of the underlying anatomy is required.

Parallel MRI
In modern parallel MRI systems, the view of Eq. ( 1) is overly simplistic: These systems typically utilize coil arrays with spatially varying sensitivity maps [1] which are, in general, entirely unknown.In other words, the physical acquisition model is not well specified.Thus, for any real practical reconstruction problem, a great challenge lies in estimating these sensitivity maps accurately.
Many strategies for estimating the sensitivity maps in an off-line step have been proposed (e.g.ESPIRiT [32], [33]).Typically, these require the acquisition of fully sampled auto-calibration lines (ACL) in the k-space center such that, in essence, a low-resolution reconstruction along with the corresponding sensitivity maps can be estimated.The disadvantage of such methods is two-fold: First, assuming a non-Cartesian sub-sampling pattern, the acquisition of the ACL essentially requires an additional (albeit low-resolution) scan.This increases examination time and opens the possibility of patient movement and subsequent misalignment artifacts.Second, high-frequency regions of the k-space are not taken into account for estimating the sensitivity maps.Generally, the sensitivity maps can be assumed to be smooth (indeed we also make this assumption).However, since errors in the sensitivity estimation adversely affect downstream tasks significantly, it is vital to exploit all available data to the best extent.
In this work, similar to [4], [34], [35], we propose to estimate the image and sensitivity maps jointly.This joint estimation hinges on the observation that the sensitivity maps are much smoother than the imaged anatomy.To enforce this smoothness and simultaneously resolve ambiguities, we impose a simple quadratic penalization on the spatial gradient of the sensitivity maps.The joint estimation allows us to utilize all available k-space data to estimate the image as well as the sensitivity maps.The resulting optimization problem can be efficiently solved by accelerated non-convex optimization algorithms.Reconstructing a parallel MRI image takes about 5 s on consumer hardware.

Related Work
In this section we review some related work on learned MRI reconstruction, but restrict our attention to methods that share similarities with our proposed approach.For a more general overview of data-driven MRI reconstruction we refer to [36].The authors of [25] give a very broad overview of potential risks of using modern techniques for medical image reconstruction in general.

Energy-based Models
After submission of this paper, we were made aware of concurrent works that also utilize EBMs for MRI reconstruction.In [30], the authors propose to learn a regularizer and use proximal gradient descent for inference.The main difference to our approach is that for parallel MRI, their algorithm assumes access to sensitivity maps, which have to be precomputed using, e.g., ESPIRiT [32].Additionally, their algorithm requires hand-tuning of step-sizes and is relatively slow.On the contrary, we propose an algorithm for joint reconstruction of image and coil sensitivities which does not require hand-tuning of step-sizes, and is fast due to being accelerated.
The authors extended their work in [31], where they learn an EBM for both image-and k-space.While this obviates the need for sensitivity estimation, it requires to train two independent networks that need to be balanced at inference time.Additionally, the k-space EBM requires fully-sampled reference k-space data, which is scarcely available.In contrast, we only require reference DICOM (magnitude) images to train one network.In addition, we perform a data-independent analysis of the learned regularizer as well as uncertainty analysis.

Diffusion Models
Diffusion models [22], [23], [37], [38] aim to model the gradient of the log-prior while undergoing a diffusion process: Let p t denote the data distribution at diffusion time t.The aim is to learn a time-conditional score network s θ such that s θ ( • , t) ≈ ∇ log p t for all t > 0. This approach shares many similarities with EBMs, where indeed it should hold that ∇R = s θ ( • , 0).
Diffusion models can be used to generate unconditional samples from the data distribution [37].This is computationally demanding as it requires solving a stochastic differential equation (SDE) with high accuracy, which typically needs thousand of gradient evaluations [12].For inverse problems, it is not clear how to optimally incorporate data-fidelity into the SDE.Proposed approaches include data projection [22], annealed Langevin dynamics [38] and diffusion posterior sampling [39].These approaches require hand-tuning of parameters, sometimes at each step of the reverse diffusion [38], [39].The recent work of [24] showed that none of these methods generate samples from the true posterior distribution and propose to augment the score models with normalizing flows.While their inference algorithm is parameter-free, the approach is opaque due to the introduction of the normalizing flow and is still computationally demanding.
In contrast, EBMs enjoy a natural probabilistic interpretation with access to an analytic expression of the posterior.In addition, MAP inference does not require solving an SDE, but can be done by efficient optimization algorithms.Having access to the function value (as opposed to only the gradient) also has practical applications, such as being able to inspect the regularization landscape (see Fig. 2) and utilizing backtracking in optimization algorithms [18].
In the works of [23], [38], parallel imaging is tackled by offline sensitivity estimation, which comes with the drawbacks outlined in Section 1.2.The authors of [12] propose to reconstruct individual coil images using a model trained solely on root-sum-of-squares (RSS) reconstructions.While the results are impressive, the computational cost consequently depends on the number of coils utilized in the physical scanner, and the authors report reconstruction times of up to 10 min.In our joint reconstruction algorithm, the gradient of the network is only evaluated once per iteration, irrespective of the number of coils.Imposing spatial regularity on the coils is extremely fast, where at each iteration we only have to compute very few fast Fourier transforms (see Section 2.2).

Joint reconstruction
The common approach to parallel imaging is based on a two-step approach: Coil sensitivities are estimated (typically from ACL regions) in an off-line step, after which the image is reconstructed.A joint reconstruction was first proposed by [35] based on alternating minimization, where they explicitly parametrize the sensitivities with low-order polynomials.In [34], the authors propose an iteratively regularized Gauss-Newton algorithm that enforces spatial smoothness of the sensitivity maps during the iterations.Their algorithm is not guaranteed to converge for arbitrary initializations, requires hand-tuning of parameters at each update step and leads to noisy solutions when run for too long.This algorithm was extended to incorporate classical variational penalties, such as the TV [40] or the Total Generalized Variation (TGV) [41], by the authors of [4], in order to suppress noisy solutions.Indeed, their approach shares a lot of similarities with our proposed approach.The differences can be summarized as follows: Instead of hand-crafted regularizers, we employ modern generative learning techniques to learn an expressive regularizer from data, which leads to state-of-the-art reconstructions.In addition, instead of the iteratively regularized Gauss-Newton algorithm with additional non-trivial sub-problems for the variational penalties, we employ the inertial proximal alternating linearized minimization (iPALM) algorithm [18] for optimization.Thus, we can guarantee convergence and only require hand-tuning of two parameters (see Section 2.2).A sketch of our reconstruction algorithm is shown in Fig. 1.
For completeness, we mention that the end-to-end variational network of [19] also estimates the sensitivities jointly with the image.However, their approach differs significantly to ours in that they learn a mapping from k-space to imagespace discriminatively, where they utilize the estimated sensitivity maps to enforce data fidelity.Thus, the network only works well with a particular sub-sampling pattern and coil configuration.In contrast, we learn image features generatively and impose hand-crafted spatial regularity onto the sensitivity maps, therefore being agnostic to the sub-sampling pattern as well as coil configuration.

Maximum-likelihood training
To learn a regularizer such that its induced distribution is indistinguishable from the underlying reference distribution we proceed as follows: We equip the regularizer with parameters and denote with {R θ : θ ∈ Θ} the family of θ-parametrized functions R θ : X → R. Θ is a suitably selected set of parameters, X is the space of the underlying distribution.We discuss our particular choice of the θ-parametrized family in Section 2.4.R θ induces a Gibbs distribution on X with density where ) dξ is the partition function.We find the optimal parameters by maximizing the likelihood of reference data -drawn from a reference distribution p data -under our model: The loss function Eq. ( 6) admits the gradient [42] For any interesting regularizer, computing the partition function Z θ is intractable.Thus, we resort to Markov chain Monte Carlo (MCMC) techniques to approximate In particular, following [27] we use the unadjusted Langevin algorithm (ULA) [43], which iterates Here, N (µ, Σ) denotes the normal distribution on X with mean µ and covariance Σ, and • , • denotes an integer interval.Further, ζ > 0 is the discretization time step of the associated continuous-time Langevin diffusion stochastic differential equation, which is known to be p θ -stationary [43], [44].The time discretization biases ULA asymptotically, which could be corrected via metropolization.However, the poor non-asymptotic performance of metropolized ULA [45], [46] makes this less useful in practice, and we do not use metropolization.Note that by Eq. ( 5) ∇ log p θ = −∇R θ .
In general, maximum-likelihood based learning of EBMs is known to be unstable [28], [47] due to the slow convergence of ULA (7).To minimize the burn-in time and aid convergence of our sampler, we utilize persistent initialization [48], [49]: Assuming p θ changes only slightly during each update to θ, samples from previous learning states are good initial guesses for the current iteration.We implement this idea using a replay buffer that holds samples from previous iterates.After iterating (7) starting from x (0) drawn from the replay buffer, we write x (J) back into the buffer with 1 − π reinit chance.Otherwise, we draw a random sample from p data to write it to the buffer.Upon reinitialization, to help mode coverage, we randomly permute the pixels of (on average) every second sample.We refer to [27], [28], [42] for more information of maximum-likelihood training of EBMs.

Reconstruction algorithm
We utilize an image-space SENSE-type [1] forward model for parallel MRI.In detail, we relate the real-valued spin density x ∈ R n ≥0 to the noisy measurement data z = (z 1 , . . ., z C ) ⊤ ∈ C m .Here, m = CK combines the K acquired k-space data points (not necessarily on a Cartesian grid) of C ∈ N receiver coils.The forward operator utilizes the sensitivity maps (σ c ) C of the C receiver coils, as well as the sampling operator F M : C n → C K defined by a k-space trajectory M .In our model, we assume a real-valued spin-density and empirically demonstrate good performance on the fastMRI dataset [8] in Section 3.4.In phase-sensitive imaging however, a complex spin-density needs to be assumed.Our approach can be generalized to this setting by, e.g.splitting real and imaginary channels as in [11], [19], and training the regularizer on this data.
To ease notation, we define Σ : (| • | is the complex modulus acting element-wise on its argument.)In the case of sub-sampling on a Cartesian grid, F M encodes the Fourier transform followed by the multiplication with a binary mask.In such SENSE-type setups, to quantitatively compare to fully-sampled RSS reconstructions, the spin density x has to be re-weighted by |Σ| C [34]: Denoting the reconstructed image u ∈ R n ≥0 , it is given by u = x ⊙ |Σ| C , which motivates an immediate change of variables Instead of estimating the spin density (and the sensitivity maps), we propose to directly estimate the image u (and the sensitivity maps).Formally, given the noisy measurement data z = (z 1 , . . ., z C ) ⊤ we propose to find arg min where combines data fidelity and image-regularization and encodes the smoothness prior on the sensitivity maps.In the above, δ R n ≥0 enforces non-negativity on the spindensity using the indicator function D : R n → R 2n is the discrete gradient operator such that (Dx) contains vertically stacked horizontal and vertical first-order finite differences (see, e.g., [50]).It implements physically motivated Dirichlet boundary conditions (σ c = 0 outside of the image domain ∀c).λ ∈ R + and µ ∈ R + are scalars trading off the strength of the regularization on the image and the sensitivity maps respectively.Observe that the division by |Σ| C in Eq. ( 11) follows from the identification Eq. ( 9).This formulation has the advantage that the sensitivity maps (σ c ) C c=1 are implicitly normalized, and we do not have to impose a normalization constraint explicitly during optimization.Further, the existence of a minimizer is ensured by coercivity of Eq. (10), which is a direct result of the enforced Dirichlet boundary conditions.
We solve Eq. ( 10) using the iPALM algorithm [18] with Lipschitz backtracking, summarized in Algorithm 1.The algorithm utilizes the combined fidelity-regularity functional H defined in (11).For our learned regularizer, we would have R = R θ as in (16), whereas for a hand crafted regularizer R may be TV (18).
All gradients are understood in the CR-sense [51].Recall that the proximal operator prox αG : H → H of a proper extended real-valued retrieves the positive part of its argument, that is prox δ R n + (x) = (x) + .prox µF can be solved in closed form by utilizing the discrete sine transform (see [52,Chapter 19.4] for a more rigorous discussion) as (denoting î := √ −1) Here, Q µ : y → S −1 diag(ξ i + µ) −1 S(µy) uses the discrete sine transform S and the eigenvalues ξ i of the discrete Laplace operator, which are of the form ξ i = 2 − 2 cos ϕ i for equally spaced angles ϕ i .We initialize the algorithm with the zerofilled (ZF) RSS reconstruction and the corresponding sensitivity maps (for all c ∈ 1, C ) A sketch of the reconstruction algorithm is shown in Fig. 1.

Experimental data
For all experiments we utilize the fastMRI knee dataset [53].Specifically, the training data are the RSS reconstructions of size ñ = 320 × 320 of the multi-coil coronal proton-density weighted (CORPD) training split.We used the central 11 slices to ensure reasonable training data, resulting in a total of 5324 training slices.To have consistent intensity ranges during training, we normalized each slice by x → x − mini xi /∥x∥ ∞ − mini xi individually (this normalization was not performed for any of the reference methods).For validation and testing, we used the multi-coil CORPD validation split, discarding samples with width w / ∈ {368, 372}, leaving 91 scans.The scans were split into 30 validation samples and 61 test samples by lexicographic ordering of the filenames.To be consistent with training, we again restrict our interest to the central 11 slices, resulting in 330 validation slices and 671 test slices.For the out-of-distribution experiments, we used the central 11 slices of the coronal proton-density weighted fat-suppressed (CORPD-FS) scans (again excluding width / ∈ {368, 372}) in the fastMRI knee validation dataset.

Network architecture and implementation details
The family of θ-parametrized functions we consider in this work follows the simple structure ∪ {W FC }.We do not impose any constraints on any of the parameters, thus Θ ∼ = R np , where n p = 21 350 640 is the total number of parameters.Although our network is quite sizable, it has significantly less learnable parameters than, e.g., the discriminative end-to-end VN of [19] (3 × 10 7 ) or the scorebased diffusion models of [12] (6.7 × 10 7 ).
We optimize Eq. ( 6) with AdaBelief [54] (β 1 = 0.9, β 2 = 0.999).In contrast to most previous works [27], [30], [31] we did not find it necessary to regularize our model by means of, e.g., Lipschitz regularization, magnitude penalization or similar techniques.We use a learning rate of 5 × 10 −4 , exponentially decreasing with rate 0.5 at update steps {500, 2000, 3000, 5000, 7000}, using a batch size of 50 for 27 000 parameter updates.To stabilize training, we smooth the data distribution by convolving it with a normal distribution of standard deviation 1.5×10 −2 .For approximating E p θ , we run ULA for J max = 500 steps.To accelerate training in the early stages, we use an exponential schedule, detailed by J h = ⌈J max (1 − exp( − h /1000))⌉, at the h th parameter update.
For persistent initialization, we use a replay buffer holding 8000 images with reinitialization chance π reinit = 1 %.Training took approximately one month on a machine equipped with one NVIDIA Quadro RTX 8000.

Simulation study and posterior sampling
To construct a real-valued simulation study, we use the RSS validation and test data detailed in Section 2.3, and retrospectively sub-sample the corresponding k-space data.To approximately map the reconstructions to the same intensities seen during training, we normalize the data by z → z / u (0) ∞ , and normalize the final reconstruction by u * → u * u (0) ∞ .To run inference on this data, we utilize Algorithm 1 with one sensitivity map that is fixed at the identity (C = 1, σ 1 = 1 is fixed as the one-vector).We found the optimal regularization parameter λ by grid search on the validation dataset.
In addition to computing MAP estimators in the sense of Eq. ( 10), we also examine the posterior distribution.In particular, we approximate the MMSE (Eq.( 3)) and variance (Eq.( 4)) integrals by MCMC: We run ULA on the Gibbs density of H, where again C = 1 and σ 1 = 1 is fixed as the one-vector, i.e. p(u | z) ∝ exp − H(u, 1) .The algorithm is initialized with uniform noise.We discard the first 10 000 samples to reach a steady state, and then save every 15 th iteration, for a total of 160 000 iterations (resulting in 10 000 saved samples).We found that the regularization parameter λ barely influenced the results of the posterior sampling and consequently set it to λ = 1 for all sub-sampling patterns.

Parallel imaging
For the parallel imaging experiments, we run Algorithm 1 with K = 100.As in the real-valued simulation study, we normalize the data by z → z / u (0) ∞ , and normalize the final reconstruction by u * → u * u (0) ∞ .To find the optimal regularization parameters, we fix µ = 10 and obtain λ by linear least-squares regression of the initial residuum against min λ u * (z val,i , λ) − u val,i 2 2 (found by grid search) for all image-data pairs (u val,i , z val,i ) in the validation set.Here, we view the initial image u (0) , the sensitivity maps (σ c ) C c=1 , and the optimal reconstruction u * as maps to make dependencies explicit.This regression is performed only for 4-fold Cartesian sub-sampling with 8 % ACL and other sub-sampling patterns use the same fit.Generalization experiments marked with † use the linear λ-fit calculated on CORPD data.For experiments marked with * , we re-ran the regression on CORPD-FS data on a 4-fold Cartesian sub-sampling with 8 % ACL.
A particular characteristic of our reconstruction approach is that its intensities are not quantitatively comparable to the reference.In detail, although we normalize the recon-struction by the RSS of the sensitivity maps, we observed that especially in low-intensity regions (e.g.air) the reconstruction did not match the reference.To remedy this and allow for fair quantitative evaluation, we utilize the validation data to fit a spline curve (cubic splines, 5 equally spaced knots) against the scatter of reconstructed and reference intensities.For the generalization experiments, we fit the spline curve again on an independent CORPD-FS validation dataset.The spline curves for both CORPD and CORPD-FS are shown in Fig. 8 in the appendix.The insets show that our reconstructions prefer zero-intensity in background regions, whereas the reference images have non-zero background intensity.

C c=1
).This amounts to performing Langevin sampling on the Gibbs density of The sensitivities may also be included in the Langevin procedure, but we empirically found no noticeable difference to freezing them.We believe that this is due to the strong imposed spatial regularity.
We evaluate the quality of the estimated sensitivity maps by computing the null-space residual [32]: Let u c = F * (z c ), c = 1, . . ., C, denote the fully-sampled coil images.The null-space residual where • denotes complex conjugation, should only contain noise since u i = σ i u when σ i is exact.Thus, any residual signal points to sub-optimal sensitivity estimates.

Comparison and evaluation
We compare our approach to the following methods: For a hand-crafted prior, we chose the Charbonnier smoothed total variation with ϵ = 10 −3 .In the real-valued simulation study, we compare against the fastMRI baseline method [8] as well as the diffusion-based approach of [12].However, due to time and computational constraints, we limit our comparison to one arbitrarily picked image per sub-sampling pattern.The implementation as well as the trained model are taken from their github repository, thus the training database differs to ours as it includes the CORPD-FS data.We use 2000 steps in the reverse diffusion.As a state-of-the-art discriminative approach for parallel MRI, we compare against the end-toend VN approach from [19].The implementation was taken from the fastMRI github repository with default parameters.The fastMRI baseline method as well as the VN were trained on the subset of the fastMRI dataset detailed Section 2.3, with masks generated using random 4-fold Cartesian subsampling with 8 % ACL.We compare the reconstructions quantitatively using the peak signal-to-noise ratio (PSNR), normalized mean-squared error (NMSE) and structural similarity (SSIM) [55].SSIM uses a 7 × 7 uniform filter and parameters K 1 = 0.01, K 2 = 0.03.We define the acceleration factor (denoted Acc. in tables showing quantitative results) as the ratio of the image size and the acquired k-space points on the Cartesian grid.
In Fig. 2 we show where T = 7 was chosen to yield visually pleasing results, along with the Langevin trajectory.The figure shows two interesting properties: First, the samples from the Langevin process are almost indistinguishable from the reference data, which demonstrates that our prior is extremely strong and faithfully represents the distribution of the training data.In particular, this implies that the learned regularizer -by construction -should only be applied to reconstruction tasks where the underlying data is also drawn from the same distribution.The experiments in Section 3.5 show that our regularizer can reasonably reconstruct knee images of different contrast by adapting regularization strength, but prior work [58] demonstrates that performance quickly degrades when confronted with, e.g., rotated images.Second, in two dimensions, the landscape appears smooth on the considered domain and almost log-concave around modes of the induced distribution.Although not applicable directly, these findings should be taken as evidence that the high-dimensional landscape is also reasonably well-behaved.This is corroborated empirically by the ease of optimization: For all reconstruction tasks, we only need in the order of 10 iterations of iPALM.

Simulation Study
We show qualitative results of the retrospectively subsampled real-valued data in Fig. 3 (left), where the results are shown in increasing acceleration.In the second row we show the results from 4-fold Cartesian sub-sampling with 8 % ACL.Note that this is the setting on which the U-Net was trained, and indeed it yields satisfactory reconstructions.The TV reconstruction for this task is not able to fully remove the sub-sampling artifacts, but increasing regularization would lead to significant loss of detail.Our approach is at least on par qualitatively with the discriminative U-Net, and the quantitative analysis in Table 1 shows superiority over all reference methods.
The third row details the reconstructions for spiral subsampling.Here, the acceleration factor is approximately 5, with a more densely sampled k-space center, which is typically advantageous for traditional reconstruction techniques.The TV reconstruction removes most of the sub-sampling artifacts, although some are still visible, especially in the background.The discriminative U-Net approach in this task (and indeed all tasks apart from Cartesian sub-sampling, on which it was trained) struggles to discriminate between details in the anatomy and sub-sampling artifacts.Thus, it hallucinates details into the reconstruction that are not reflected in the data.In contrast, our approach is able to faithfully reconstruct the knee with no visible artifacts.The results are similar for the pseudo-radial sub-sampling pattern with 45 spokes shown in the fourth row.
The 3-fold random sub-sampling shown in the first row is particularly interesting.Here, the k-space center is not more densely sampled than any other region, which manifests in the zero-filling reconstruction by large-scale intensity shifts.None of the reference methods are able to correct this, since they do not have knowledge of the anatomy of the human knee.On the other hand, our approach, due to its generative nature, can restore the general shape of the knee well, and due to the variational approach also faithfully keeps details that are present in the data.The generative approach alleviates the requirement for sampling schemes to densely sample the k-space center, which is in stark contrast to the general theory.
We show a quantitative comparison against the diffusionbased approach of [12] in Table 2 that includes reconstruction time and number of trainable parameters.The accompanying qualitative results are shown in Fig. 9 in the appendix.The results are separated since we only evaluated on one image per sub-sampling pattern due to time and computational constraints, and thus it does not constitute a comprehensive comparison.Our MMSE estimate consistently beats the diffusion-based approach, and our MAP estimate is only inferior for the random sub-sampling pattern.To generate the 10 000 samples for the MMSE estimate takes only about 30 % more time compared to the one sample generated by the reverse SDE.On the other hand, computing our MAP estimate is about 80 times faster.

Uncertainty Quantification through Posterior Sampling
The natural probabilistic interpretation of our variational approach gives rise to a distribution of reconstructions for any particular reconstruction problem.We exploit this distribution to compute pixel-wise variance maps as well as the The Langevin trajectory (Eq.( 7)) is shown in gold along with some representative states.MMSE estimate.The results in Fig. 3 (right) show large variance around small structures, and more variance when less data is available: The three-fold random sub-sampling shows the least variance, while the approximately six-fold radial sub-sampling shows the most.The MMSE estimator also yields visually pleasing reconstructions.In fact, Table 1 and Table 2 establish quantitative superiority over the MAP estimate (see Section 4 for a possible explanation).

Parallel Imaging
The first row of Fig. 4 shows a classical Cartesian subsampling pattern, with an acceleration factor of 4 and using 8 % ACL.This coincides with the training setup of the discriminative end-to-end variational network approach of [19].Consequently, it shows the best performance quantitatively as well as qualitatively.Our method also yields competitive results, achieving a PSNR of 35.23 dB.This is in line with our expectations, as in general we can not expect a generative approach to beat a discriminative counterpart.However, the strength of our approach becomes apparent when we slightly change the sub-sampling pattern: The performance of the end-to-end VN deteriorates significantly when the phase-encoding direction is swapped, or when less ACL are acquired.We emphasize that between these tasks the number of acquired data points is fixed.With these minimal changes in the sub-sampling pattern, the method of [19] is no longer able to fully remove back-folding artifacts or introduces severe hallucinations, while our method yields comparable results for all three tasks.This is also reflected in the quantitative evaluation in Table 3, where our method beats the VN approach by a significant margin.Ours (MAP)  Shifting towards radial and 2D Gaussian sub-sampling patterns, we observe that the VN approach introduces severe artifacts.We believe that this is a combination of the sensitivity estimation sub-network failing as well as the image sub-network being confronted with previously unseen subsampling artifacts.Indeed, for this task the more general TV approach is superior to the VN, although reconstructions appear over-smoothed.In line with expectations, our approach reconstructs the image satisfactorily, yielding the best performance quantitatively as well as qualitatively.
To analyze our reconstruction algorithm also with respect to the estimated sensitivity maps, we show example estimations along with ESPIRiT [32] in Fig. 6.The figure shows estimations for the 4-fold Cartesian sub-sampling with 8 % ACL reconstruction problem shown in the first row of Fig. 4. Due to the re-parametrization Eq. ( 9), our reconstruction algorithm does not require pixel-wise normalized sensitivity maps.Hence, they look very physically plausible and match the reference well.
We additionally analyze the estimation qualitatively by visualizing the RSS null-space residual [32] |(π c ) C c=1 | C .In Fig. 7, we compare the sensitivity maps from our joint estimation with ESPIRiT for Cartesian sub-sampling patterns with 8 % ACL and 4 % ACL (first and third row in Fig. 4

respectively).
For both tasks, the sensitivity maps from our joint estimation algorithm can reproduce the data very well.In particular, while the ESPIRiT estimation leads to slightly better results when 8 % ACL are available, the estimation deteriorates for 4 %.In contrast, our estimation remains stable, as it can exploit data that is not in the k-space center.

Generalization
By construction, the learned model encodes the distribution of the training data.In the previous sections we have demonstrated thoroughly that the prior is agnostic to shifts in the sub-sampling pattern, which is an expected consequence of the generative approach.However, a natural question is whether the approach is also robust with respect to shifts in the underlying distribution.To study this, we apply the regularizer to an underlying distribution of fat-suppressed images.
The quantitative and qualitative results in Table 3 and Fig. 5 indicate that our method generalizes better to unseen data than the end-to-end VN approach of [19], although the performance degrades significantly in both cases.To highlight the advantage of the tuneable regularization parameters in our approach, we show results using the λ-fit (see Section 2.6) calculated on non-fat-suppressed data, as well as using parameters adapted to the task: The ability to tune the influence of the regularizer leads to improved performance when confronted with previously unseen data.

Discussion
Fig. 2 demonstrates that our learned regularizer encodes the training distribution.As a consequence, we can reconstruct images from severely ill-posed problems (such as random sub-sampling pattern) satisfactorily.However, this also means that the performance of our learned regularizer is expected to significantly decline when applied to, e.g., different anatomy (rotation is explored in [58]).Diffusion models [12] are another instance of models that are capable of generating realistic-looking images from the data-distribution, but for inverse problems it is not clear if this is needed.Since our network is purely convolutional, it can easily be "localized" by removing the deeper layers, effectively recovering the fields-of-experts model [29].This would remove the majority of the trainable parameters (and consequently the training cost), and we hypothesize that it would lead to better generalization as local features are largely shared between different anatomical structures.
The results in Section 3.4 demonstrate that the regularizer can reconstruct images from multi-coil data satisfactorily, irrespective of the acquisition mask.This is a very significant advantage over the reference methods, since in practice a particular situation might necessitate a different sub-sampling pattern.As an example, phase encoding direction is often swapped when a blood vessel is located in a particularly disadvantageous position.In addition, our regularizer does not need re-training when a new sub-sampling pattern is discovered to be advantageous.
The results on multi-coil data also empirically demonstrate that a real-valued regularizer suffices to reconstruct the underlying spin-density, at least for the acquisition sequences used in our dataset [8].Similarly to [22], [23], our approach could easily be extended to account for complex images if needed.In particular, this is the case for contrasts that inherently rely on complex-valued information, such as phase-contrast MRI.
Generalization is not limited to the acquisition mask: The regularizer also performs well in out-of-distribution experiments, which highlight the importance of the ability to control its influence.Table 3 shows that adapting the strength of the regularization to the underlying data strongly improves performance.We note that the regularization strength was always tuned to a 4-fold Cartesian sub-sampling task with 8 % ACL (see Section 2.6).The results for radial sub-sampling indicate that this is a sub-optimal fit for the out-of-distribution data: The fit calculated on the CORPD data ( † column) performs better than the fit calculated on CORPD-FS data.Fig. 5 suggests that the adapted parameters lead to an oversmoothed reconstruction.Clearly, the performance could be improved by adapting the regularization strength to the CORPD-FS data and the radial sub-sampling.In general, adapting the regularization strength to acquisition mask as well as the data is greatly beneficial.However, this is typically not possible with other data-driven approaches.
The accompanying probabilistic interpretation is significant in two ways: First, the regularizer can be inspected       by data-independent analysis as in Fig. 2, where experts can easily visualize preferred structures.This aids in interpreting the encoded information and thus improves the confidence with which clinicians may view the reconstruction.Second, the variance maps provide the clinician with additional information about the inherent uncertainty in the reconstruction.Fig. 3 (right) clearly shows high variance around small anatomic structures.If the clinician's decision making is based on such regions, they might deem it necessary to acquire additional information.
In the single-coil simulation study, quantitative and qualitative analysis shows superiority of the MMSE estimate over the MAP estimate.We believe that this is related to the training procedure: During training, the regularizer is only ever confronted with slightly noisy images in the Langevin process (see Section 2.1), and injecting the same noise in the reconstruction algorithm improves overall performance.We are unsure why this performance improvement does not translate to the parallel imaging experiments, but hypothesize that the our joint estimation biases the reconstruction such that this effect is no longer observable.
Finally, we emphasize that our algorithm for reconstructing parallel MRI is fast: Algorithm 1 converges in around 100 iterations, which takes around 5 s on an NVIDIA Titan RTX using approximately 2 GB of memory.This is in stark contrast to score-based diffusion models, where the reverse diffusion process typically has to be solved with high accuracy and consequently reconstruction time is in the order of 10 min [12].Speed is advantageous in practice, as images can be viewed while the patient is still in the scanner.This allows fast changes to the sub-sampling pattern, should they be needed.

Conclusion
We utilize modern generative learning techniques to train a regularizer such that it encodes the underlying distribution of the training data faithfully.By embedding the regularizer in a variational reconstruction framework, we can satisfactorily reconstruct single-coil and parallel MRI by adapting the data likelihood term.Quantitative and qualitative analysis indicate competitive reconstruction performance, on-par with (or sometimes superior to) fully supervised methods.In addition, our approach is agnostic to changes in the acquisition mask, while the supervised reference methods introduce severe hallucinations when confronted with previously unseen data.The knowledge of the human anatomy encoded in the regularizer even allows us to reconstruct high quality images from random sampling masks.Finally, on out-of-distribution data, our method is still on-par or better than hand-crafted regularizers such as the TV or other supervised methods.This indicates that our method also generalizes better with respect to shifts in the underlying distribution.
In addition to a competitive reconstruction performance, our method has a natural accompanying probabilistic interpretation through statistical modeling and Bayesian inference.This allows experts to explore a distribution of reconstructions, while other data-driven approaches only act as point estimators.Our experiments indicate that this distribution encodes important diagnostic information, such as high uncertainty around small anatomical structures.We believe that this additional information can aid the practitioner's decision making.Further, we can perform a data-independent analysis of the model by visualizing the induced distribution.This allows to interpret the information encoded in the regularizer, whereas other data-driven methods solely act as black-boxes in this respect.
To reconstruct parallel MRI, we propose a fast algorithm to jointly estimate the image and sensitivity maps.In contrast to off-line sensitivity estimation, our approach does not require auto-calibration lines, can easily be applied to non-Cartesian sampling trajectories, and uses all available data.The resulting optimization problem can be solved efficiently using non-convex optimization algorithms.Highfidelity reconstructions are available in approximately 5 s on consumer hardware.
In general, we believe that reconstruction approaches based on generative priors have huge potential.A natural extension of our work would be to combine the image-prior with a learned sensitivity-prior.In addition, future research could focus on extending the simple architecture used in this paper with more modern building blocks, such as attention layers.A related direction is to research if local convolutional models can replicate the performance of our generative prior.

2 2 )
or encode smoothness assumptions (e.g. the TV x → n i=1 : R n → R ñ center-crops the validation images of size n = 640 × w to the size of the training images.The layer S l : x → ( W l (W l x + b l ) + bl ), l ∈ 1, L utilizes the leaky ReLU : x → max{γx, x} with leak coefficient γ = 0.05.The weights and biases {W l , b l , W l , bl } in the l th layer encode the convolution kernels of size 3 × 3 where W l has stride 2. The number of features in each of the L = 6 layers follows a geometric progression with common ratio 1.75, starting at 48 features in the first layer.Finally, F C : x → W FC x is a fully connected layer mapping to a scalar and U is the absolute value.The entirety of the learnable parameters can be summarized as θ = {W l , b l , W l , bl } L l=1

Figure 3 :
Figure 3: Real-valued simulation study results:1 st row: Random sub-sampling with acceleration factor 3, 2 nd row: 4-fold Cartesian sub-sampling with 8 % ACL, 3 rd row: Spiral sub-sampling (5-fold acceleration), 4 th row: Radial sub-sampling using 45 spokes.The inlays show the sub-sampling pattern (white), a detail zoom (blue), and the magnitude of the difference to the reference (red, 0 0.2).Left: Reconstruction results.The numbers show PSNR (over the entire test dataset), with the best method emphasized in red.Right: Uncertainty quantification through marginal posterior variance (0 0.0025).

Figure 6 :Figure 7 :
Figure6: Magnitude of the estimated sensitivities using our joint estimation algorithm (top) versus the ESPIRiT[32] estimation (middle) for the reconstruction problem shown in Fig.4(1 st row).The bottom row shows the reference coil sensitivities computed with the fully-sampled data.

Figure 8 :
Figure 8: Spline fit computed on a validation set for CORPD (top) and CORPD-FS data (bottom).The figures show the log-histogram of reconstructed versus reference intensities with the insets show the region around zero, where they deviate the strongest.

Table 1 :
Quantitative results for the real-valued simulation study using different k-space trajectories.Numbers in parenthesis indicate the acceleration factor, bold typeface indicates the best method.

Table 3 :
Quantitative results for parallel imaging with different sub-sampling patterns on in-and out-of-distribution data.The † column shows results using the CORPD λ-fit, while the * column has CORPD-FS-adapted parameters (see Section 2.6).