Recurrent Inference Machines as inverse problem solvers for MR relaxometry

In this paper, we propose the use of Recurrent Inference Machines (RIMs) to perform T1 and T2 mapping. The RIM is a neural network framework that learns an iterative inference process based on the signal model, similar to conventional statistical methods for quantitative MRI (QMRI), such as the Maximum Likelihood Estimator (MLE). This framework combines the advantages of both data-driven and model-based methods, and, we hypothesize, is a promising tool for QMRI. Previously, RIMs were used to solve linear inverse reconstruction problems. Here, we show that they can also be used to optimize non-linear problems and estimate relaxometry maps with high precision and accuracy. The developed RIM framework is evaluated in terms of accuracy and precision and compared to an MLE method and an implementation of the ResNet. The results show that the RIM improves the quality of estimates compared to the other techniques in Monte Carlo experiments with simulated data, test-retest analysis of a system phantom, and in-vivo scans. Additionally, inference with the RIM is 150 times faster than the MLE, and robustness to (slight) variations of scanning parameters is demonstrated. Hence, the RIM is a promising and flexible method for QMRI. Coupled with an open-source training data generation tool, it presents a compelling alternative to previous methods.


Introduction
MR relaxometry is a technique used to measure intrinsic tissue properties, such as T 1 and T 2 relaxation times. Compared to qualitative weighted images, quantitative T 1 and T 2 maps are much less dependent on variations of hardware, acquisition settings, and operator (Cercignani et al., 2018). Additionally, because measured T 1 and T 2 maps are more tissue-specific than weighted images, they are promising biomarkers for a range of diseases (Cheng et al., 2012;Conlon et al., 1988;Erkinjuntti et al., 1987;Larsson et al., 1989;Lu, 2019).
Thanks to their low dependence on hardware and scanning parameters, quantitative maps are highly reproducible across scanners and patients (Weiskopf et al., 2013), presenting variability comparable to test-retest experiments within a single center (Deoni et al., 2008). The low variability allows for direct comparison of tissue properties between patients and across time (Cercignani et al., 2018). However, to ensure that quantitative maps are reproducible, mapping methods must produce estimates with low variance and bias.
Conventionally, quantitative maps are estimated by fitting a known signal model to every voxel of a series of weighted images with varying contrast settings. The Maximum Likelihood Estimator (MLE) is a popular statistical method used to estimate parameters of a probability den-sity by maximizing the likelihood that a signal model explains the observed data and is extensively used in quantitative mapping (Ramos-Llorden et al., 2017;Smit et al., 2013;Sijbers and Dekker, 2004). Usually, MLE methods estimate parameters independently for each voxel. This may lead to high variability for low SNR scans. Spatial regularization can be added to the MLE (referred to as the Maximum a Posteriori -MAP) to enforce spatial smoothness, but demands high domain expertize. Additionally, for most signal models, MLE/MAP methods require an iterative non-linear optimization, which is relatively slow for clinical applications and might demand complex algorithm development.
Despite the current success of deep learning methods in the medical field, their application to Quantitative MRI (QMRI) is still affected by the lack of large in-vivo training sets. Specifically in MR relaxometry, the use of neural networks is still limited. Previous works successfully applied deep learning in cardiac MRI (Jeelani et al., 2020) and knee (Liu et al., 2019), but they required the scans of many subjects to train the networks and were dependent on alternative mapping methods to generate training labels. This limitation was addressed in Cai et al. (2018) and Shao et al. (2020) by using the Bloch equations to generate simulated data to train convolutional neural networks in T 1 and T 2 mapping. However, estimation preci-sion, a central metric in QMRI, was not reported. It is unclear, therefore, how well these methods would perform with noisy in-vivo data.
In this paper, we propose a new framework for MR relaxometry based on the Recurrent Inference Machines (RIMs) (Putzky and Welling, 2017). RIMs employs a recurrent convolutional neural network (CNN) architecture and, unlike most CNNs, learns a parameter inference method that uses the signal model, rather than a direct mapping between input signal and estimates. This hybrid framework combines the advantages of both datadriven and model-based methods, and, we hypothesize, is a promising tool for QMRI.
Previously, RIMs were used to solve linear inverse problems to reconstruct undersampled MR images (Lønning et al., 2019) and radio astronomy images (Morningstar et al., 2019). In both works, synthetic, corrupted training signals (i.e. images) were generated from high-quality image labels using the forward model.
A significant limitation on the use of deep learning in MR relaxometry is the lack of large publicly available datasets. The acquisition of in-vivo data is a costly and time consuming process, limiting the size of training datasets and reducing flexibility in terms of the pulse sequence and scanning parameters. Using model-based strategy for data generation (in contrast to costly acquisitions) allows the creation of arbitrarily large training sets, where observational effects (e.g., acquisition noise, undersampling masks) and fixed model parameters are drawn from random distributions. This represents an essential advantage over other methods that rely entirely on acquired data. Yet, the lack of high-quality training labels (i.e. groundtruth T 1 and T 2 maps) limits the variability of training signals. Here, we also generate synthetic training labels to achieve sufficient variation in the training set.
We compared the proposed framework with an MLE method and an implementation of the ResNet as a baseline for conventional deep learning QMRI methods. In contrast to MLE methods with user-defined prior distribution to enforce tissue smoothness, the RIM learns the relationship between neighboring voxels directly from the data, making no assumptions about the prior distribution of values. This might improve mapping robustness to acquisition noise.
We evaluated each method in terms of the precision and accuracy of measurements. First, noise robustness was assessed via Monte Carlo experiments with a simulated dataset with varying noise levels. Second, we evaluated the quantitative maps' quality concerning each method's ability to retain small structures within the brain. Third, the precision and accuracy in real scans were evaluated via a test-retest experiment using a hardware phantom. Lastly, we used in-vivo scans to evaluate precision in a test-retest experiment with two healthy volunteers.

Signal modeling
Let κ be the parameter maps to be inferred, such that κ(x) ∈ R Q is a vector containing Q tissue parameters of a voxel indexed by the spatial coordinate x ∈ N D . Then, we assume that the MRI signal in each voxel of a series of N weighted images S = {S 1 , ..., S N } follows a parametric model f n (κ(x)) : where (x) is the noise at position x.
For images with signal-to-noise ratio (SNR) larger than three, the acquired signal at position x can be well described by a Gaussian distribution (Sijbers et al., 1998;Gudbjartsson and Patz, 1995), with probability density function denoted by p(S n (x m ) | f n (κ(x m )), σ), where m ∈ {1, ..., M } is the voxel index, M the number of voxels within the MR field-of-view and σ is the standard deviation of the noise.

Regularized Maximum Likelihood Estimator
The Maximum Likelihood Estimator (MLE) is a statistical method that infers parameters of a model by maximizing the likelihood that the model explains the observed data. Because the MLE is asymptotically unbiased and efficient (it reaches the Cramér-Rao lower bound for a large number of weighted images) (Swamy, 1971), it was chosen as the reference method for this study.
Assume P (S|f (κ), σ) is the joint PDF of all independent voxels in S from which a negative log-likelihood function L(κ, σ|S) is defined. Additionally, let Ψ(κ) be the log of a prior probability distribution over κ, introduced to enforce map smoothness. Then the ML estimatesκ are found by solvinĝ in which we assume that σ can be estimated by alternative methods and is, therefore, not optimized. Note that, although Eq.2 strictly defines an MAP estimator, we choose to use the term regularized MLE to emphasize that Ψ(κ) is only applied to promote maps that vary slowly in space. In this work, regularization is used to encourage spatial smoothness of the inversion efficiency map (i.e. B 1 inhomogeneity), while maps linked to proton density and tissue relaxation times are not regularized and their estimation occurs exclusively at the voxel level. Herein, we refer to this method simply as MLE.

ResNet
The Residual Neural Network (ResNet) is a type of feed-forward network that learns to directly map input data to training labels using a concatenation of convolutional layers. It was developed by He et al. (2016) as a solution to the degradation problem that emerges when building deep models (He and Sun, 2014). Skip connections between layers of the network allow the ResNet to fit to the residual of the signal, rather than to the original input, making identity learning simpler, and ensuring that a deeper network will not perform worse than its shallower counterpart in terms of training accuracy (He et al., 2016). For that reason, and because it was shown to be a suitable method for QMRI (Cai et al., 2018), we chose the ResNet as the reference deep learning method for this study.
Let Λ φ : R N → R Q represent a ResNet model for QMRI, parameterized by φ, that maps the acquired signal S to tissue parameters κ, specificallyκ = Λ φ (S). The learning task is to find a model Λφ such that the difference betweenκ and κ is minimal in the training set, that iŝ (3)

The Recurrent Inference Machine: a new framework for QMRI
In the context of inference learning (Chen et al., 2015;Zheng et al., 2015), the Recurrent Inference Machine (RIM) (Putzky and Welling, 2017) framework was conceived to mitigate limitations linked to the choice of priors and optimization strategy. By making them implicit within the network parameters, the RIM jointly learns a prior distribution of parameters and the inference model, unburdening us from selecting them among a myriad of choices.
With this framework, Eq.2 is solved iteratively, in an analogous way to a regularized gradient-based optimization method. The RIM uses the gradients of the likelihood function to enforce the consistency of the data and to plan efficient parameter updates, speeding up the inference process. Additionally, because this framework is based on a convolutional neural network, it learns and exploits the neighborhood context, providing an advantage over voxelwise methods. Note that, rather than explicitly evaluating Ψ(κ), the RIM learns it implicitly from the labels in the training dataset.
At a given optimization step j ∈ {0, ..., J −1}, the RIM receives as input the current estimate of parameters,κ j , the gradient of the negative log-likelihood L with respect to κ, ∇ κ , and a vector of memory states h j the RIM can use to keep track of optimization progress and perform more efficient updates. The network outputs an update to the current estimate and the memory state to be used in the next iteration. The update equations for this method are given by where ∆κ j+1 is the output of the network and denotes the incremental update to the estimated maps at optimization step j + 1 and g γ represents the neural network portion of the framework, called RNNCell, parameterized by γ. A diagram of the RIM is shown on the left of Fig. 1a.
Predictions are compared to a known ground-truth and losses are accumulated at each step, with total loss given byγ where J is the total number of optimization steps andγ is the optimal inference model given the training data.

Conv 3x3
Conv 1x1  It is important to notice that the RIM uses two distinct loss functions. The likelihood function L(κ|S, σ) is used to provide the gradient ∇ κ to the network and is evaluated in the data input domain (i.e. weighted images). In contrast, Eq.6 is used to update the network parameters γ, and is evaluated in the parametric map domain (e.g. T 1 or T 2 relaxation maps).
A relevant feature of this framework is that the architecture of the RNNCell, more specifically, the number of input features in the first convolutional layer, only depends on Q, and not on N . This means the RIM can process series of weighted images [S n ] for ∀N > 0.

Sequences and parametric models
The choice of parameters κ and the form of the parametric model f n depend on the pulse sequence used for acquisition.
For the T 1 mapping task in this work, we used the CINE sequence (Atkinson and Edelman, 1991), based on a (popular) fast T 1 quantification method (Look and Locker, 1970). It uses a non-selective adiabatic inversion pulse, applied after the cardiac trigger, with zero delay, and simulated at a constant rate of 100 beats per minute using a pulse generator developed in-house. For this sequence, a common parametric model is given by , where τ n is the n th inversion time and κ(x m ) = (A, B, T 1 ) T is the tissue parameter vector at position x m , in which A is a quantity proportional to the proton density and receiver gain, B is linked to the efficiency of the inversion pulse and T 1 is the longitudinal relaxation time. The operator | · | represents the elementwise modulus.
For T 2 experiments and quantification, we used the 3D CUBE Fast Spin-Echo sequence (Mugler, 2014) with model given by f n (κ(x m )) = Ae − τn T 2 , where τ n is the n th echo time and κ(x m ) = (A, T 2 ) T , with A proportional to the proton density and receiver gain and T 2 the transverse relaxation time.

Generation of simulated data for training
In this work, we opted to generate training data via model-based simulation pipeline. Training samples are composed of ground truth tissue parameters κ and their corresponding set of simulated weighted images S. To generate training samples with a spatial distribution that resembles the human brain, ten 3D virtual brain models from the BrainWeb project (Cocosco et al., 1997) were selected. We randomly extract 2D patches from the brain models during training, with patch centers drawn uniformly from the model's brain mask. To introduce the notion of uniform tissue properties within subjects but distinct between subjects, for each patch and tissue separately, the parameters in κ were drawn from a normal distribution with values given in Table 1. To enable recovery of intra-tissue variation, voxel-wise Gaussian noise was added to each parameter in κ, except for B. Because the B value is related to the efficiency of the inversion pulse in IR sequences, it is not tissue-specific, and as such, cannot be modeled as above. Its value was simulated as 2 − Γ, where Γ is independently sampled, per patch, from the half-normal distribution (Leone et al., 1961) with standard deviation σ Γ = 0.2.
Using κ, S was simulated via Eq. (1), with (x) an independent zero mean Gaussian noise where, for each patch, standard deviation σ acquisition was drawn from a log-uniform distribution with values in the range [0.0065, 0.255], corresponding to SNR levels in the range of 100 to 3, respectively.

Evaluation datasets
We performed all scans on a 3T General Electric Discovery MR750 clinical scanner (General Electric Medical Systems, Waukesha, Wisconsin) with a 32-channel head coil.

Hardware phantom
Phantom scans were carried out using the NIST/ISMRM system phantom (Keenan et al., 2017) with parameters for the acquisition of T 1 weighted (T 1 w) and T 2 weighted (T 2 w) images presented in Table 2 (datasets HP T1 and HP T2 , respectively). The FOV contained the phantom's T 1 array for T 1 w scans and the T 2 array for T 2 w scans. To evaluate the repeatability of each mapping method, C = 4 consecutive acquisitions were performed without moving the phantom and with minimal time interval between scans.

In-vivo
Our Institutional Review Board approved the volunteer study and informed consent was obtained from 2 healthy adults. C = 2 repeated scans per volunteer were acquired for both T 1 and T 2 experiments to evaluate repeatability with in-vivo data. The FOV used was similar for T 1 and T 2 experiments and was oriented in the axial direction, with the middle slice positioned at the level of the body of the corpus callosum. These datasets, acquired with a slice thickness of 3mm, are referred to as IV T1 and IV T2 , respectively. Details on acquisition settings are given in Table 2. Finally, to evaluate the performance of the estimators under low SNR conditions, we repeated the T 1 w acquisition using a slice thickness of 1.5mm (dataset called IV noisy

T1
), in which a single slice, positioned above the corpus callosum, was acquired. Again, C = 2 repeated scans were acquired for each volunteer to assess each method's repeatability.

Implementation details
The codes for all methods, trained models and the data used in the experiments are available online 1 .

MLE
In the experiments in this study, Ψ(κ) is set as the sum over voxels of the voxel-wise square of the (spatial) Laplacian of B. A weighting term λ B is introduced to control the strength of the regularization and was empirically set to 500 to reduce the variability of the T 1 estimates. The remaining maps in the T 1 and T 2 mapping tasks are not regularized.
1 https://gitlab.com/e.ribeirosabidussi/qmri-t1-t2mapping-rim To prevent the estimator from getting stuck in a local minimum far from the optimal target, we initialize κ via an iterative linear search within a pre-specified range of values per parameter. Following initialization, parameters are estimated with a non-linear trust region optimization method. The estimation pipeline was implemented in MATLAB with in-house custom routines (Poot and Klein, 2015).

Network training
To train both neural networks, 7200 2D patches of size 40 × 40 per brain model were generated during training and arranged in mini-batches of 24 samples, for a total of 3000 training iterations.
We used the ADAM optimizer with an initial learning rate of 0.001 and set the initial network weights with the Kaiming initialization (He et al., 2015). PyTorch 1.3.1 was used to implement and train the models. The networks were trained on a GPU Nvidia P100, and all experiments (including timing) were performed on an Intel Core i5 2.7 GHz CPU.

ResNet architecture
Our implementation of the ResNet is a modified version of He et al. (2016). Pooling layers were removed to ensure limited influence between distant regions of the brain, effectively enforcing the use of local spatial context during inference. Additionally, our ResNet does not contain fully connected layers to adapt the network for a voxel-wise regression problem. All convolutions are zero-padded to maintain the patch size.
The first convolutional layer has a 1 × 1 filter, and it is used to increase the number of features from N (the number of weighted images) to 40. This layer is followed by a batch normalization (BatchNorm) layer and a ReLu activation function. The core component of the network, denoted as the residual block (RB), comprises two 3 × 3 convolutional layers, two BatchNorm layers, and two ReLu activations, arranged as depicted on the right of Fig. 1b. Within a given RB, the number of features in each convolutional layer is the same. The skip connection is characterized by the element-wise addition between the input and the output of the second BatchNorm layer. In total, G = 12 residual blocks are sequentially linked, with number of feature channels in each block empirically chosen as [40,40,80,80,160,320,160,80,80,40,6]. The network architecture is completed by one 1 × 1 convolutional filter, used to reduce the number of features to Q. Details on the general architecture are presented on the left of Fig. 1b.
Note that, due to differences in the inversion times used for the acquisition of T 1 weighted datasets (Table  2), we trained three ResNet models for the T 1 mapping task: (1) Training dataset generated with N = 23 inversion times (ResNet T1:23 ), (2) with N = 25 inversion times (ResNet T1:25 ), and (3) with N = 31 inversion times (ResNet T1:31 ). Finally, a fourth model was trained on the T 2 mapping task, denoted as ResNet T2 , with N = 6 echo times.

RIM architecture
In this work, the RNNCell (shown in detail on the right of Fig. 1a) is composed of four convolutional layers and 2 GRUs. The first 3 × 3 convolutional layer is followed by a hyperbolic tangent (tanh) link function, and its output, with 36 feature channels, is passed to the first GRU, which produces 36 output channels. The output of this unit (h 1 j+1 ), also used as the first memory state, goes through two 3 × 3 convolutional layers with 36 output features, each followed by a tanh activation. The data then passes through a second GRU, which generates the second memory state h 2 j+1 . The last layer is a 1 × 1 convolutional layer used to reduce the dimensionality of the feature channels, and it outputs Q features, corresponding to the number of tissue parameters in κ. All convolutional layers are zero-padded to retain the original image size.
The parameter vectorκ was initialized as A = MIP(S), B = 2, T 1 = 1000 ms and T 2 = 100 ms, where MIP is the Maximum Intensity Projection per voxel over all weighted images in the set. We used J = 6 optimization steps for all RIM models.
Similarly to the ResNet, we trained three RIM models on the T 1 mapping (RIM T1:23 , RIM T1:25 , and RIM T1:31 ) and one model on the T 2 task (RIM T2 ). Notice that, while all T 1 datasets could be processed by a single RIM model, as the number of input features in the first convolutional layer does not depend on N , slight variations in inversion times might affect estimation error. This aspect will be assessed in Section 5, as it supplies information on the RIM's generalizability.

Quantitative evaluation
The prediction accuracy was evaluated in terms of the Relative Bias between the reference parameter values κ and the estimated parametersκ c ∈ {κ 1 , ...,κ C } for each repeated experiment c, defined as where C is the number of repeated experiments and denotes the element-wise division. The Coefficient of Variation (CV) was used to measure the repeatability of the predictions, and it is given by where SD c denotes the standard deviation over C estimatesκ.

Noise robustness
To assess each method's robustness to noise and mapping quality, we generated the simulated T 1 w data with the process described in Section 4.2 using a 2D slice of a virtual brain model not included in the training, matrix size 256 × 256 and inversion times of dataset IV T1 .
For the same ground-truth T 1 , A and B maps, C = 100 realisations of acquisition noise were simulated per SNR ∈ [3, 5, 10, 30, 60, 100]. The Relative Bias and CV were computed per pixel and their distribution over all pixels within a brain mask is shown. The models RIM T1:31 and ResNet T1:31 were used in this experiment.

Blurriness analysis
We assessed the quality of the quantitative maps in terms of blurriness. Here, we defined blurriness as the amount of error introduced to a pixel, in terms of Relative Bias and CV, due to the influence of its neighbors and vice-versa. In this experiment, our interest lies on how well each mapping method can preserve the true T 1 value in small structures (e.g., one pixel), specifically hypo and hyper-intense regions that are at risk of being blurred away by the neural networks.
To simulate the presence of these small anatomical structures, we changed the T 1 value of selected pixels in a ground-truth T 1 map (Fig. 3a), described as follows: Ω hypo point is a hypo-intense pixel (T 1 = 400ms) within the gray mater of this map (shown in detail in Fig. 3b); Ω hyper point is a hyper-intense pixel (T 1 = 1200ms) within the white mater (WM); Ω vert line is a hyper-intense vertical line (T 1 = 1200ms) in the WM; and Ω horz line is a hyper-intense horizontal line (T 1 = 1200ms) also in the WM.
We measured the Relative Bias and CV per pixel in a Monte Carlo experiment with C = 100 noise realizations (SNR=10). Each metric's median and standard deviation are reported for two disjoint regions in the estimated T 1 map, referred to as Structure and Neighborhood (Fig. 3c). This scenario, containing simulated structures is called E2, and was compared to the baseline error in the same regions in the original T 1 map (scenario E1). An independent ttest was applied to identify significant differences between E1 and E2. The models RIM T1:31 and ResNet T1:31 were used in this experiment.

Evaluation with hardware phantom
We manually drew ROIs within every sphere in the phantom and calculated the Relative Bias and CV per pixel within each ROI for T 1 and T 2 tasks.
Since nominal parameter values within the spheres, as reported in Keenan et al. (2017) and used as the reference κ, include relaxation times shorter and longer than the τ used for training (Table 2), we calculated the overall accuracy and repeatability as the average Relative Bias and CV over all pixels in spheres with parameter value in between the lowest and highest τ . Because this dataset was acquired with 23 inversion times, models RIM T1:23 and ResNet T1:23 were used.

Evaluation with In-vivo scans
To evaluate the precision of estimates from in-vivo data, we compared T 1 and T 2 maps from all methods in terms of pixel-wise CV for all in-vivo scans. We also performed a visual comparison of the maps.
We evaluated the mapping quality in in-vivo scans regarding the sharpness of the boundary between gray mater and white mater. Twenty lines perpendicular to the tissue interface ( Fig. 7a) were manually drawn in the measured quantitative maps. For each line, linear interpolation was used to reconstruct the T 1 values along them and a sigmoid model, given by y(x) = V /(1+e −υ(x−x0) )+b, was fit using the MSE as objective function. The parameter υ denotes the slope of the fitted sigmoid and was used as a measure of boundary sharpness. A paired t-test was performed to evaluate significant differences between mapping methods.

Model generalizability
In this experiment, we evaluated how well the RIM can generalize to datasets with different acquisition settings, specifically, the variation of the inversion times in the three T 1 w datasets. In contrast to the ResNet architecture, which depends on the number of weighted images in the series, the RIM can process inputs of any length.
We used the three RIM T1 models (RIM T1:23 , RIM T1:25 and RIM T1:31 ) to infer T 1 maps from each T 1 w dataset, and computed the CV for the repeated experiments in each. The results were compared to the MLE and datasetspecific ResNet models.   Figures d), e) and f) shows the Coefficient of Variation for the same maps. The boxplot represents the distribution of the metric over all pixels in the brain mask. The box extends from the lower to upper quartile values of this data, with a line at the median. The whiskers extend from the box to show the minimum and maximum values for each metric within the brain mask.

Simulated dataset
Figures 2(a)-(c) show the Relative Bias measured for A, B and T 1 maps in the experiment with simulated T 1 w data. For most cases where SNR > 3, all methods produced quantitative maps with comparable median Relative Bias, but both neural networks displayed a larger range of values than the MLE. The CV for all SNR levels is shown in Figs. 2(d)-(f) for the same data. The RIM presented lower CV than the other methods for all SNRs. In comparison, the MLE displayed significantly higher CV compared to RIM and ResNet, accentuated in low SNR. The results of the experiments with simulated T 2 w data were similar and are shown in Fig. A1 of the Supplementary Results. Figures 3(d)-(g) show the results of the blurriness analysis. Specifically,Figs. 3(d) and 3(f) depict the Relative Bias and CV measured per pixel within the Structure area. We observe that both neural networks presented increased Relative Bias compared to scenario E1. For the RIM, the highest increase occurred for Ω hypo point , with Relative Bias going from 0.68% to 3.43%. This difference represents an average error of 11ms over the ground-truth T 1 value of 400ms, or a loss of 0.81% in T 1 contrast between the pixel and its neighbors, with average T 1 of 1350ms. The The blue areas are the structures, and red areas are their immediate neighborhood. d) Relative Bias over one hundred repetitions within the Structure region. e) Relative Bias over one hundred repetitions within the Neighborhood region. f) CV over one hundred repetitions within the Structure region. g) CV over one hundred repetitions within the Neighborhood region. In all plots, the box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. The vertical black lines at the top of the bars (plots f) and g)) show the standard deviation over the data. Significant differences between scenarios E1 and E2 are indicated by * and * * , representing p < 0.05 and p < 0.01, respectively.
ResNet showed considerably higher bias than RIM when small structures were added, while for the MLE, the difference between scenarios E1 and E2 is not significant (with exception for Ω horz line ). The RIM showed increased CV for all structures compared to the baseline, but values were still lower than the MLE's and comparable to the ResNet's. Figures 3(e) and 3(g) show the Relative Bias and CV for the Neighborhood region. We observe higher Relative Bias for RIM and ResNet than the MLE, with a wider range of values, but we found no significant differences between E1 and E2 for any of the cases.
The average computing time to produceκ from N = 31 weighted images (with size 256 × 256 pixels) was measured as 3.8s for the RIM T1:31 , 27s for ResNet T1:31 and 575s for the MLE.

Evaluation with hardware phantom
The T 1 quantification results are shown in Fig. 4. In Fig. 4(a) we present the Relative Bias for the different spheres in the phantom. The average Relative Bias was computed over the spheres in the restricted τ domain (full-color lines), in which the RIM T1:23 model shows lower error (1.34%) compared to the MLE (1.71%) and ResNet T1:23 (31.06%). The CV as a function of T 1 values is shown in Fig. 4(b). The average CV over the restricted τ domain was measured as 3.21% for RIM T1:23 , 7.56% for MLE and 7.5% for ResNet T1:23 .
The results for the T 2 mapping task with the hardware phantom are shown in Fig. A2    (c). We observe the presence of outliers in the MLE and ResNet T1:31 (white arrows in Figs. 5(d)-(f)), while the RIM T1:31 produced a clean T 1 map. The scatter plot in Fig. 5(h) shows that the RIM estimate is nearly unbiased when compared to the MLE's, while the ResNet presented overestimated T 1 values ( Fig. 5(g)).
T 1 maps inferred from the noisier dataset IV noisy T1 are shown in Fig. 6(a). The RIM T1:25 showed increased noise robustness compared to the MLE and ResNet T1:25 , clearly outperforming these methods in terms of outliers. The CV maps, computed per pixel, are presented in Fig. 6(b) and shows that the RIM T1:25 model produces low-variance quantitative maps, with average CV over all pixels equal to 6.4%, compared to 17.1% from the MLE and 11.06% from the ResNet T1:25 . Figure 7(c) shows the result of the image quality analysis for in-vivo scans. The figure depicts the distribution of the sigmoid slope k for each method across all 20 lines. The whiskers indicate the minimum and maximum k values, the boxes show the lower and upper quartiles and the solid horizontal line their median. The paired t-test shows no significant differences between methods. Figures 8(a)-(c) show the T 2 maps generated by each mapping method. The RIM T2 predicted T 2 values that are similar to the reference MLE, with average difference in T 2 of −1.13ms across all pixels in the brain, while the ResNet T2 again showed overestimated relaxation times compared to the MLE, with an average difference of 26.2ms.
Difference maps between the MLE and both neural networks are shown in Figs. 8(d) and 8(e). The scatter plots in Figs. 8(f) and 8(g) depict the agreement between the neural network estimates and the reference MLE.
6.4. Model generalizability Fig. 9 illustrates the CV of the different models evaluated on all T 1 w datasets. The graph shows that the RIM produces estimates with lower variance than the MLE and ResNet, regardless of the number of inversion times used to create the training set. Note that, in every case, the RIM trained for the specific data performs slightly better than the other RIM models. However, we found no significant differences in repeatability between these models. c) The box plot depicts distribution of the absolute sigmoid slope (υ) for all 20 lines for each mapping method. We found no significant differences between the methods.

Discussion
This work presented a novel approach for MR relaxometry using Recurrent Inference Machines. Previous works showed that RIMs produce state-of-the-art predictions solving linear reconstruction problems. Here, we expanded the framework and demonstrated that it could be successfully applied to non-linear inference problems, outperforming a state-of-the-art Maximum Likelihood Estimator and a ResNet model in T 1 and T 2 mapping tasks.
In simulated experiments, we observed that the RIM reduces the variance of estimates without compromising accuracy, suggesting higher robustness to acquisition noise than the MLE, and attesting to the advantages of using the neighborhood context in the inference process. In addition, for low SNR, the RIM had lower variance than the ResNet, suggesting that the neighborhood context alone is not the sole responsible for the increased quality, and that the data consistency term (likelihood function) in the RIM framework helps to produce more reliable estimates. : V olunt.2 (25 TIs) d) dataset IV T 1 : V olunt.1 (31 TIs) e) IV T 1 : V olunt.2 (31 TIs). The median CV over all pixels containing tissues of interest (phantom spheres or brain tissue) is shown.
This showcases a major advantage of the RIM framework over current conventional and deep learning methods for QMRI.
The phantom experiments performed to assess the Relative Bias and CV in real, controlled scans showed that the RIM has the lowest Relative Bias among the evaluated methods. The ResNet presented significantly higher error, which indicates that the ResNet does not generalize well to unseen structures, and the use of simulated training data with this model should be carefully considered. Because the RIM can generalize well, using simulated data for training represents a significant advantage over mod-els trained with real-data when considering dataset flexibility, since any combination of parameter values can be simulated and the training dataset can be arbitrarily large.
In all in-vivo scans, the RIM produces quantitative maps similar to those from the MLE, with higher robustness to noise. Although the ResNet estimates parametric maps consistent with reported T 1 and T 2 relaxation times of brain tissues, they are often overestimated compared to the MLE. In terms of coefficient of variation, the RIM results are superior compared to the other methods, independently of the dataset.
The anatomical integrity of quantitative maps is an essential factor when evaluating the quality of a mapping method. The RIM and the ResNet use the pixel neighborhood's information to infer the parameter value at that pixel, which creates valid concern regarding the amount of blur introduced by the convolutional kernels. We demonstrated in simulation experiments that, although the RIM does introduce a limited amount of blur to the quantitative maps, small structures are still confidently retained, and the error introduced by the pixel neighborhood does not represent a significant change in the relaxation time of those structures. Additionally, in in-vivo experiments, both deep learning methods produce relaxation maps with similar structural characteristics to the maps inferred by the MLE. More concretely, the T 1 relaxation times in the interface between gray and white mater follow a similar transition pattern to the MLE, further suggesting that the RIM does not introduce sufficient blur to alter brain structures, even in in-vivo scans.

Conclusion
We proposed a new method for T 1 and T 2 mapping based on the Recurrent Inference Machines framework. We demonstrated that our method has higher precision than, and similar accuracy levels as an Maximum Likelihood Estimator and higher precision and higher accuracy than an implementation of the ResNet. The experimental results show that the proposed RIM can generalize well to unseen data, even when acquisition settings vary slightly. This allows the use of simulated data for training, representing a substantial improvement over previously proposed QMRI methods that depend on alternative mapping methods to generate ground-truth labels. Lastly, the RIM dramatically reduces the time required to infer quantitative maps by 150-fold compared to our implementation of the MLE, showing that our proposed method can be used in large studies with modest computing costs.