Multi-echo gradient echo pulse sequences: which is best for PRFS MR thermometry guided hyperthermia?

Abstract Purpose MR thermometry (MRT) enables noninvasive temperature monitoring during hyperthermia treatments. MRT is already clinically applied for hyperthermia treatments in the abdomen and extremities, and devices for the head are under development. In order to optimally exploit MRT in all anatomical regions, the best sequence setup and post-processing must be selected, and the accuracy needs to be demonstrated. Methods MRT performance of the traditionally used double-echo gradient-echo sequence (DE-GRE, 2 echoes, 2D) was compared to multi-echo sequences: a 2D fast gradient-echo (ME-FGRE, 11 echoes) and a 3D fast gradient-echo sequence (3D-ME-FGRE, 11 echoes). The different methods were assessed on a 1.5 T MR scanner (GE Healthcare) using a phantom cooling down from 59 °C to 34 °C and unheated brains of 10 volunteers. In-plane motion of volunteers was compensated by rigid body image registration. For the ME sequences, the off-resonance frequency was calculated using a multi-peak fitting tool. To correct for B0 drift, the internal body fat was selected automatically using water/fat density maps. Results The accuracy of the best performing 3D-ME-FGRE sequence was 0.20 °C in phantom (in the clinical temperature range) and 0.75 °C in volunteers, compared to DE-GRE values of 0.37 °C and 1.96 °C, respectively. Conclusion For hyperthermia applications, where accuracy is more important than resolution or scan-time, the 3D-ME-FGRE sequence is deemed the most promising candidate. Beyond its convincing MRT performance, the ME nature enables automatic selection of internal body fat for B0 drift correction, an important feature for clinical application.


Introduction
In hyperthermia (HT) treatments, tumor tissue temperature is increased to sensitize it for other forms of cancer treatment such as radio-, chemo-and immunotherapy.Real-time temperature feedback during HT treatments is essential, as clinical outcome depends heavily on the temperature reached [1].The target treatment temperature usually lies between 39 and 43 C.The temperature achieved in the region of interest (ROI) is commonly measured using temperature probes that are placed intraluminally or punctured into tissue in, or close to, the target volume.These probes provide temperature information for a limited region, and may sometimes be difficult or even unfeasible to place [2].MR thermometry (MRT) promises to enable 3D noninvasive, real-time temperature measurement during HT treatments.These temperature maps could visualize the temperature in the target volume and identify hot-and cold-spots during treatments, hence enabling the optimization of treatment quality.Ultimately, good quality thermal dosimetry will facilitate enhanced understanding of the relationship between target temperature and treatment outcome.Hence, MRT carries the potential to make HT treatments more repeatable, safer and more effective for the patient.
In MR imaging, many temperature sensitive magnetic properties can be exploited for temperature measurement.Proton resonance frequency shift (PRFS) is the most frequently used, and considered the most accurate method [3], as it varies linearly over a large temperature range and is independent of tissue type (with the exception of fat) [4].
Measuring the temperature induced changes in the local resonance frequency by phase mapping may be confounded by other changes in the magnetic field occurring simultaneously, such as magnetic field drift [5].Hence, even though MRT is relatively straight forward and regularly used when performing HT in patients with static tumors such as sarcoma [6], it's much more challenging in body sites where more movement can be expected e.g., abdomen, thorax and head&neck [7].
The scientific and clinical communities agree on the importance of having accurate and reliable MRT measurements [8].The desired performances have been clearly defined, but as explained above, successful MRT is largely dependent on the treatment location [9,10].One way to improve MRT is by increasing the SNR, leading to better precision of the phase maps.This is largely made possible by novel MR HT devices that position multiple receiver coils close to the body [11].Other strategies to improve MRT include various motion correction techniques and faster imaging; these are comprehensively discussed in Yuan et al. [12].
For the brain, motion is not a primary concern.Hence, in preparation of developing motion confounder corrections for other body sites, this location is ideal for testing which sequence will provide accurate temperature mapping.Consequently, the results can be of general value also when performing MRT in other (more challenging) anatomies, because the effect of the sequences are not being intermixed with motion effects.As such, any difference in error that will be encountered in other anatomies can be seen as advantages and drawbacks that can be expected in that particular anatomy, rather than a general trend of the sequence.High precision MRT in the brain has been demonstrated in the past [13,14], using the help of field monitoring to correct for non-temperature induced frequency shift.While this method gives good results, it is expensive and thus not an option for most institutions.Also, the MRT approaches presented don't have sufficient spatiotemporal resolution to monitor temperature during thermal therapies [14].
The only clinical standard pulse sequence available for MRT is the DE-GRE sequence [15,16], which hence is the baseline for comparison in this paper.DE-GRE has been compared to other pulse sequences before [17,18], but never on a broader scale.Also multi-echo sequences have been used to calculate MRT, but often rely on creating a library of images as a baseline [19].This takes additional time for imaging at the start of the experiment.Cheng et al. [20] developed a dual-step iterative temperature estimation algorithm, that improved both accuracy and precision.However, this only applies to areas where fat fractions are between about 10% and 90%.In areas with homogeneous water and fat distributions, fat-referenced thermometry has been shown to work well [21].
In this work, we compare the performance of different clinically available ME-GRE sequences for MRT.PRFS MRT performance, especially accuracy, is investigated in phantom and unheated volunteers.By investigating MRT in the brain, we aim to demonstrate the possible improvements also for more challenging treatment areas, whilst keeping the door open for further developments that could build on these results.The sequence used for clinical MRT monitoring during HT at most institutions (including ours) is double-echo gradient-echo, DE-GRE (two echoes, 2D).We hypothesize that alternative multi-echo gradient-echo sequences might allow better monitoring of HT treatments than DE-GRE, and therefore compare it to two other multi-echo sequences: ME-FGRE (11 echoes, 2D) and 3D-ME-FGRE (11 echoes, 3D) [22].MRT of the sequences having more than two echoes are being evaluated with a multi-peak and multi-echo fitting method developed previously: MMT-PRFS [23] (see Appendix A.1).

Phantom design
A phantom was created with different fat percentages, containing T1 and T2 ranges similar to what is found in human tissue [24], following the recipe by Bush et al. [25] (see Figure 1).The water mixture (0% fat), positioned in the center, is the only one being evaluated in this study.It consists of a mixture of distilled water, gadolinium-diethylenetriaminepentacetate (DTPA) contrast agent, water-soluble surfactant, agar, and sodium benzoate, and thus solidifies.The mixtures were filled in 50 ml Polypropylene conical vials (30 Â 115 mm) and placed in a PVC pipe with a lid.The space between tube and pipe was filled with distilled water.Preliminary experiments of the different mixtures excluded susceptibility artifacts and evaluated SNR, and T1 and T2 values.Subsequently, catheters were added to facilitate easy insertion of temperature probes.The temperature probes are part of our PYREXAR BSD2000-3D-MRI deep HT system, and they serve as the ground truth temperature measurements.According to the vendor specification, the thermistor type sensors are non-perturbing and electromagnetically insensitive with an accuracy of ±0.2 C over a range of 25 to 52 C [26].

Theory/method
Post-processing was conducted offline in MATLAB and SPSS. Figure 2 gives an overview of the post-processing pipeline that we developed, and the steps are described below in more detail.The code is publicly available on GitLab: https:// gitlab.com/radiology/quantitative-mri/mr-thermometry.

Inputs
The acquisition of sequences was repeated at multiple time points and the input into the pipeline differed between the sequences.The ME acquisitions were automatically reconstructed by the scanner into DICOM images of the real and imaginary components.For the DE-GRE acquisition the raw multi-channel k-space data was imported into MATLAB.In order to combine the k-space data for the individual coils, for each time point, one set of coil sensitivity maps is obtained by applying the ESPIRiT method [27].The method was applied to each slice, with a calibration area spanning the center 40% of k-space in both in-plane dimensions and using all echoes in the calibration matrix.More details on the corrections applied for DE-GRE images from phantom experiments are described in Appendix A.2.

Image registration
For volunteers it was in hindsight deemed necessary to register the coil-combined complex valued data, to correct for subject motion along the time points.This was deemed especially important for performing the B0 drift correction (Section 2.2.4) using the thin fat layer, see the Supplementary Material for support of this.For our experiments subject motion was not restricted, however patients motion will be severely constrained by the water bolus of the HT device.To compensate motion, we performed a rigid pair-wise image registration of the magnitude images using Elastix [28]. 1  Subsequently the found deformation was applied to the complex valued images.Since the 2D acquisitions had a slice gap, and phase consistency across slices was not ensured, image registration could only be reliably applied in in-plane directions.However, 3D registration was used to quantify the zmotion for all data; scans for which the motion in the z-direction exceeded 20% of the slice thickness were excluded.

Change in off-resonance
For the DE-GRE acquisitions in volunteers both echoes were used, to correct for other not-temperature-related changes such as the conductivity [29].The change in off-resonance Dx of water only voxels, varies lineally with temperature and can be described by Equation (1): where u is the phase at the time point in question, u r is the phase at the reference time point r (here, r ¼ 1 is used) and TE is the echo time of the acquisition.
For the ME-FGRE and 3D-ME-FGRE sequence MMT-PRFS was used to calculate the change in off-resonance frequency, previously introduced by Salim et al. [23].This proposed model is described in detail in Appendix A.1.

B0 drift correction
In order to ensure that the MR temperature reading is not influenced by the B0 field drift of the scanner, a drift correction needs to be applied to all acquired images.Fat experiences a much smaller temperature related off-resonance frequency shift than water and tissues [30]; in the order of a¼À1.8Ã 10 À10 / C [31], compared to a ¼ 1.0 Ã 10 À8/ C of water [32].Therefore, as fat still experiences all other disturbances of the magnetic field, it can be used for field drift correction.Firstly, fat voxels need to be identified.For the phantom experiments, ROIs were drawn in the external fat tubes; For the volunteer data, we developed an automatic fat selection similarly to [33], selecting the internal body fat based on the water-and fat-density maps at zero echo time obtained from MMT-PRFS, see Figure 3.For DE-GRE, the fat mask could not be calculated with the same method and was hence imported from the ME-FGRE acquisition.The process is illustrated with example images and described in more detail in Appendix A.3.

MRT
Once Dx corr is calculated, the change in temperature can be found by Equation ( 2): where c is the gyromagnetic ratio, a is the change coefficient for PRFS, and B 0 is the magnetic field strength of the MRI scanner.The constants used were c ¼ 267.513Ã 10 6 rad/s/T, a¼À0.01 ppm/ C and B 0 ¼1.5T.
To mitigate a possible bias at higher temperatures, the phantom results are also calculated for the clinical temperature range of 37-43 C, referencing to the time point in each sequence that was closest to and above 37 C.
Since our method relies on the changes in the resonance frequency in the hydrogen of water molecules, we exclude voxels containing more than 20% fat as in these voxels the off-resonance frequency estimate might be biased.This selection was made using the fat and water density maps, which contain the signal at zero echo time: q f > 0:2 q w : Additionally voxels with low SNR were excluded; we defined those voxels with q w 10% of the mean q w of the three voxels with the highest q w in that time point and slice.

MRT performance quantities
To give a complete comparison of MRT performance for different sequences, we followed the standard suggested in [9] in order to calculate performance measures for the phantom experiments, i.e., reporting on bias (ME), spatial temperature standard deviation (spatial SD), temporal temperature standard deviation (temporal SD) and accuracy (MAE).The definitions with their respective formulas are shown in Table 1.
The acquired MRT measurements for the phantom experiment were compared to the 'ground truth' obtained from the temperature probes.The slice that was evaluated with MRT corresponded to the depth of the temperature probe.The ROI had a diameter of about 12 pixels (18.0 mm), resulting in 107 voxels selected.This was drawn as large as possible whilst excluding the wall of the vial (Figure 4(A)).For the MRT performance assessment in volunteers an elliptical ROI was drawn in the brain.The ROI was made as large as possible, while keeping it constant for all volunteers (Figure 4B).
Accuracy in volunteers was calculated similarly to the accuracy as described in Table 1: the mean deviation from zero of the apparent temperature change in the ROI for each slice.The mean absolute deviation of each slice was then averaged over all slices and all volunteers.
SD in the volunteer data was calculated as the standard deviation of temperatures measured in each ROI evaluated (one for each slice) and then averaged over all slices and volunteers, just as the spatial precision in the volunteer experiments, defined in Table 1.
The statistical significance of the differences among the sequences is calculated in SPSS, using a two-sided ttest (p < .05).

Experiments
All experiments were conducted using a 1.5 T MR scanner (GE Healthcare, Waukesha, WI, USA) and a 22-channel GE Head&Neck coil.For comparison between sequences, we aimed to match them as closely as possible to DE-GRE, with the settings used in the clinic for MRT in the pelvis, regarding the following properties: range of echo times, spatial resolution and the number of echoes for the ME sequences.Additionally, the SNR estimate for each set-up, as indicated by the command window before scanning, was matched as well.The sequences run for both experiments were DE-GRE (TR620), DE-GRE (TR200), ME-FGRE and 3D-ME-FGRE.Scans were run continuously and the FOVs were copied between the sequences so that they covered the same area.All MRT images were acquired with a spatial coverage of 192 Â 192 mm 2 and resolution 1.5 Â 1.5 Â 5 mm 3 .
Notes: E j is the MRT measurement at time point j, A is the temperature measurement (ground truth), and n is the number of time points.The minimum requirement has been defined previously as being the benchmark for a successful HT treatment; table copied and edited from [9].The SNR of the sequences were measured to be 42 for DE-GRE (both TRs), 47 for MEFGRE, and 64 for 3D-ME-FGRE.

Phantom
The water in between the vials with the different mixtures was replaced with hot water (at 70 C) and a temperature probe was inserted into the vial with the water mixture via the catheter.Two experiments were run, both spanning a time of about 205 min, where the MR sequences were played out continuously whilst the phantom was slowly cooling down.In the first experiment all sequences were run, but only the DE sequences produced usable results, due to an error in the reconstruction of the ME coil-combined images from the scanner.Hence, for the second experiment, ME-FGRE and 3D-ME-FGRE were run again leading to more measurement points.
The acquisition parameters for different sequences are presented in Table 2.

Volunteers
All volunteers signed an informed consent (protocol MEC-2014-096, approved by the Erasmus MC review board) and were screened prior to entering the MR room.For imaging, the volunteers were placed in the scanner in supine position with the head coil around the head.In order to investigate the MRT accuracy in-vivo, we acquired brain images from a total of ten volunteers.Detailed acquisition parameters are shown in Table 3.

Phantom cooling-down experiment
Figure 5 shows example MRT maps of the phantom cooling down over time.We excluded low SNR voxels as well as voxels with more than 20% fat, as temperature estimates are unreliable in those voxels with the current method.Note that some temperature errors at the edges of the phantom and the tubes with water-fat mixtures remain.
Figure 6 displays the temperature changes measured by MRT (as the mean of the ROI evaluated) against the temperature measured by the probe for each time point.The time points in A are referenced to the first time point acquired during the experiment at the highest temperature, and shows the entire temperature range of the experiment (205 min).B only presents the HT clinical temperature range of 37-43 C, and the MRT measurements are referenced to the time point closest to 37 C, as would be the case during treatment, spanning 65 min.
The performances from the phantom experiment are summarized in Table 4. Considering the whole temperature range of the experiment, ME-FGRE had the best accuracy and bias of 0.46 C and À0.08 C. Spatial SD was only sufficient for the DE-GRE sequences and the temporal SD was sufficient in all.Regarding the clinical temperature range of 37-43 C, the 3D-ME-FGRE sequence performed best regarding bias, spatial SD and accuracy.

Volunteer experiment
Figure 7 shows example MRT estimations for one volunteer for all sequences.Also here, low SNR voxels as well as voxels with more than 20% fat were excluded, as temperature estimates are unreliable in those voxels with the current method.It can be seen, that 3D-ME-FGRE estimates the temperature change more homogenous across the slice, as well as closer to 0 C than the other pulse sequences.
The MRT performances for the volunteer experiments in brain are presented in Table 5.One volunteer and one ME-FGRE acquisition of another volunteer was excluded because the motion in the slice direction was larger than 20% of the slice thickness.Excluding individual slices due to too little fat voxels being selected for a B0 drift correction was necessary in two volunteers and resulted in excluding a total of six slices for both DE-GRE sequences, five slices for ME-FGRE and four slices for 3D-ME-FGRE.The effect of the exclusion criteria is shown in Figure 8.
When testing for the statistical significance of the improvements of the accuracy after exclusion, DE-GRE TR200 and ME-FGRE were non-significant (p ¼ .415and p ¼ .088,respectively).3D-ME-FGRE, however, shows a significant improvement in accuracy compared to the original DE-GRE TR620 sequence, with p < .001.Regarding the SD over the investigated ROI, the multi-echo sequences achieved better

Note:
The/shows the differences between the two DE-GRE implementations with 620 ms and 200 ms.
Table 3. Parameters of all the sequences as well as their respective scan times for the volunteer experiment using the multi-channel GE head coil (22 channels).values after exclusion, though these improvements were not significant.

DE-GRE
An aspect that stood out, was that with 3D-ME-FGRE, only 5% of slices had to be excluded due to selecting too few fat voxels, compared to up to 30% for other sequences.

Discussion
In this work the MRT performance of different multi-echo gradient-echo sequences was compared, both in a phantom and in volunteers.This study was aimed to be a first step in improving PRFS-MRT to a more reliable and robust monitoring modality for HT treatments.
The analysis in volunteers presented here was performed only in brain, an ideal testing location due to little motion disturbances, although subject motion had to be compensated, see also the Supplementary Material for an illustration of this.However, we believe that the results can also be useful when applied to other anatomies.This is because when moving to more challenging areas in the future, we know that any difference in behavior and the likely increase in temperature errors will be caused by motion and other magnetic field disturbances, and not be due to the sequences and settings themselves.In other words: it is precisely because the effects of the sequences are not being intermixed with big adjustments in the analysis, that they will likely be helpful when considering different anatomical regions.
Accuracy is the most important measure investigated, enabling reliable detection of small temperature increases during the relatively long HT treatment times.ME-FGRE and 3D-ME-FGRE achieve better accuracy and bias than DE-GRE in phantom over the whole temperature range, consequently satisfying the minimum requirement for successful monitoring during a HT treatment.This improvement in performance was expected, due to the increased number of echoes, which provide more data to estimate the off-resonance frequency.The smaller echo spacing of ME-FGRE also means that we can have a larger bandwidth, which leads to less problems with wrap-around of the phase when dealing with large temperature changes (due to larger ranges being able to unambiguously be identified).This will also affect the B0 compensation positively, especially when fat is sparsely distributed.
Considering the (smaller) clinical temperature range, the errors decrease and all investigated sequences satisfy the minimum requirements.The best accuracy in the clinical temperature range was obtained with 3D-ME-FGRErepresenting a clinically relevant and statistically significant improvement of 46% compared to DE-GRE TR620.The trends observed in phantoms is mirrored by the measurements in volunteers.We expected the multi-echo sequences to achieve better (lower) values of SD's compared to the standard DE-GRE, and indeed observe up to a 22% improvement (not statistically significant).Most impressive is the 61.7% statistically significant increase in accuracy shown by 3D-ME-FGRE compared to the standard DE-GRE TR620 sequence.In general, the reduced range is more indicative of the thermometry performance that can be expected in the clinicalso because it spans a shorter cooling time (65 min, much closer to the HT treatment time) and comparable effects of B0 drift can be expected.
The possibility to implement automatic fat map selection is an advantage of using multi-echo sequences, making the addition of external fat references superfluous.In current clinical MR-compatible hyperthermia applicators, external fat references are attached to the wall of the applicator.As a result, the fat references are not places close to the subject, but often at the boundary of the imaging FOV.Internal body fat selection provides the opportunity to simplify treatment set-up and reduce the FOV for MRT, therefore shortening treatment time.Additionally it provides a reference as close as possible to the tissue of interest, likely improving accuracy of the field drift correction.Additional benefits include the saving the time of manual delineation of fat, as is common practice when using body fat as a fat reference [16,34], and which can be a very time consuming task [15].
One of the explanations for the robustness of 3D-ME-FGRE is the sequence's echo spacing, which is optimized for detecting and quantifying fat by the three-point Dixon method [35], more so than the 2D implementation, making 3D-ME-FGRE superior when selecting the fat mask for reliable B0 drift corrections; something that is of particular  importance for the clinic, where the scanner drift can be substantial during the long treatment times of >60 min.This robustness is reflected in the lower amount of slices that had to be excluded (5%) due to selecting too few fat voxels, compared to other sequences (up to 30%).Furthermore, 3D-ME-FGRE is acquiring in 3D, and hence has 3D phase consistency.This gives an advantage for the HT application, because it would enable estimating the field drift only once for the whole volume, rather than for every slice, which is the case for the 2D sequences.
In regards of the results of both phantoms and volunteers, it is noteworthy that DE-GRE TR200 performs as well as the original DE-GRE TR620, whilst the acquisition time is shortened by more than 65%.The accuracy over both temperature changes investigated are similar, and the difference in accuracy in volunteers is not significant.If there is a dependency in clinical scenarios on using a DE-GRE sequence, reducing the TR would provide an easy way to reduce scan time without compromising MRT performance.
Whilst this study presents a comprehensive comparison of different performance metrics, there are some limitations that need to be mentioned.The study only analyses four different sequences with particular acquisition settings, which were matched as closely as possible to the clinically used DE-GRE, but may not be optimized even for imaging the brain, as was done here.
The automatic fat selection method that we constructed is able to select reliable fat voxels for the B0 drift correction.However, sometimes there are not enough fat voxels selected toward the outer edges of the acquisition volume.It may not always be feasible or desirable to acquire a larger volume to avoid these effects, thus improvements in the fat selection method could benefit the HT application.Regarding this aspect, 3D acquisitions could prove beneficial due to the consistency of the off-resonance frequency among slices.
With some volunteers, we observed substantial motion in the z-direction.Since the 2D acquisitions had a slice gap, 3D image registration was not possible and sets of data were excluded where z-motion was exceeding our criteria.We expect that this exclusion of data due to prominent motion is not going to be an issue during HT treatments, as such motion will be restricted by the water bolus of the HT applicator.Of course, for 3D sequences, motion could be compensated in all three dimensions, potentially providing an additional benefit in HT.
Image processing has been performed offline, and is not yet optimized for performing temperature monitoring during HT treatments in real-time.When undertaking HT treatments with physical probes, the time for transferring the data and analysis takes 1-2 min.Thus, the response time for adjusting the heating is in the order of minutes and relying heavily on patient feedback, due to the point-like temperature readings of the probe.With MRT we can achieve a 3D temperature distribution and potentially react to hot spots before the discomfort of the patient, as well as improving temperature coverage.

Conclusion
Different multi-echo gradient-echo sequences were assessed for their MRT performance in a phantom and volunteers.3D-ME-FGRE offers a statistically significant improvement in  accuracy of 62% and 46% in unheated volunteers and temperature varying phantom (in the range of 37.0 C-43.0 C), respectively, compared to the clinical standard DE-GRE sequence.Thus, 3D-ME-FGRE presents a promising improvement to noninvasive MR temperature monitoring that can be implemented promptly.We believe this to be the first essential step toward improved MRT for HT treatments and anticipate to further investigate this 3D multi-echo sequence for MRT by optimizing the acquisition settings and corresponding post-processing method.

Appendix A A.1. MMT-PRFs
In this section, the multi-peak multi-echo thermometry with PRFS, referred to as MMT-PRFS, will be described in more detail.

A.1.1. Signal model
Yu et al. [36] proposed a signal model to describe the measured signal of a voxel containing a mixture of water and fat in a multi-echo GRE sequence.However, this signal model doesn't take the initial phase offset u 0 into account.We hypothesized that neglecting u 0 may hinder accurate PRFS based thermometry, and thus use the following extended model to describe the signal of a voxel for MMT-PRFS: where q w and q f are the density values of water and fat; K is the total number of fat frequency peaks; a k is the relative amplitude of the kth fat peak with P K 1 a k ¼ 1; Df k is the frequency shift of the kth fat peak relative to that of water; t e is the echo time; x is the off-resonance frequency; and R Ã 2 is the effective relaxation rate of the transverse magnetization.The values for a k were taken from Table 1 from [36], as [0.62 0.15 0.10 0.06 0.03 0.04], corresponding to frequencies (at 1.5 T) of [210 159 -47 236 117 23] Hz.

A.1.2. Maximum likelihood estimator
Based on the work of Poot et al. [37], we developed a maximum likelihood estimator (MLE) to simultaneously fit q w , q f , R Ã 2 , x and u 0 to the complex-valued measurements of multiple echoes with arbitrary echo times.
Let h i denote a vector containing the tissue parameters to be estimated (q w , q f , R Ã 2 , x, u 0 ) at voxel i 2 1, . . ., N f g , and let h represent an array containing the vectors h i of all voxels.Let Y i, j denote the measured signal of voxel i 2 1, . . ., N f gin image j 2 1, . . ., M f g : Since every image corresponds to a single echo time, S te is substituted by S j in the rest of this section.
To distinguish between the real and the estimated values we use the notation h and ĥ, respectively.The estimation procedure uses the model S j ðh i Þ described in Equation ( 4) to predict the intensity of a voxel i in an image j: The MLE procedure is defined as the following minimization problem: where p is the likelihood function of the data from a single measurement, for which we used a complex-valued Gaussian distribution, and X is the spatial domain (region of interest) in which the fit is performed.
For each voxel, the maximum likelihood estimation begins with a grid search, with 2 values of R Ã 2 e 40, 120 ½ s À1 , 101 values of x e À1100, 1100 ½ Hz, and 6 values of u 0 e p 6 , p Â Ã rad: For all combinations of R Ã 2 , x, and u 0 , the least squares solution of q w and q f was implicitly obtained, and the combination with the lowest residual norm was used for subsequent non-linear optimization.The initialization range was determined based on the expected values of the five parameters.The outcome of the algorithm is not very sensitive to the initialization, but a reasonable initialization can improve the processing time.

A.2. Coil sensitivity correction details
This section described the coil sensitivity corrections that were put in place for the phantom DE-GRE experiments, when only using the second echo.
The individual coil sensitivity maps at every time point maximize SNR of the images (and would reduce artifacts in accelerated scans).However, they can differ (slightly) in phase so that the phase of the reconstructed images may vary spatiallywithin the smoothness constraints on the coil sensitivity maps.This is because the measurements depend on the product of coil sensitivity maps with the images and not on the phase images directly.However, as the temperature change measurements are based on phase changes, it is relevant to minimize the phase differences among the coil sensitivity maps of different time points.
Equation (5) shows how this is accomplished: by adjusting the phase of the maps to be as close as possible to the phase of the coil sensitivity map of the reference time point.Specifically, for each voxel x and time point j: where Cj x ð Þ is a column vector with the uncorrected coil sensitivity maps (for every channel) of a location x and time point j: Cr ðxÞ is a column vector with the coil sensitivity maps at the reference time point r (here, r¼1 is used), H is the complex conjugate transpose and C j x ð Þ is a column vector with the corrected coil sensitivity maps.
In order to obtain the complex valued images for all time points, firstly we multiply the complex conjugate of the corrected coil sensitivity maps with the Fourier transform of the fully sampled k-space data.The summation over all channels of that multiplication is equal to the coilcombined complex valued images.

A.3. Internal body fat selection
This section describes in more detail how the internal body fat was selected for the volunteers: 1. Initial mask: An initial fat mask selects voxels in which the apparent density of fat q f is at least four times higher than that of water q w and where the fat density is larger than 10% of the maximum: fat mask ¼ q f > q w Ã4 ð Þ\ q f > 0:1Ãmax ðq f ÞÞ (6) 2. Elimination of water-fat swapped voxels: Sometimes, especially around tissue boundaries, the water and fat components can be swapped.These voxels can visually easily be spotted, as they vary greatly from their correctly identified neighbors.In order to identify these swapped voxels and eliminate them from the fat mask, we applied Gaussian blurring with a sigma of 4 voxels in-plane to the change in off-resonance frequency map Dx: The swapped voxels are then identified as where the absolute difference between the filtered off-resonance frequency and the unfiltered off-resonance frequency is bigger than 200 rad/s (31Hz).These voxels are subsequently excluded from the fat selection.3. Remove low SNR voxels: A signal constraint was also added in order to avoid selecting nonfat voxels in regions of very low signal, that can be missed in step 1.It's effect can be seen in Figure A1, going from subplot 2 to subplot 3.For the good fat-water contrast in the magnitude image, the second echo was chosen for DE-GRE (19.2 ms).For the other two sequences the echo closest to 19.2ms was selected.For every time point and every slice, the signal is taken as the mean of the three voxels with the highest signal.Voxels that are less than 10% of that signal threshold, are excluded from the fat mask.4. Eliminating residual outliers & B0 drift field: A linear bias field is fitted within the fat pixels of the current mask of the change in offresonance frequency map Dx: This method is described in detail in Glover et al. [35].The difference between Dx and the predicted regression on the fat voxels in the current fat mask should not be very large and abruptly changing.This property can be exploited to identify and exclude outliers in the fat mask by applying a threshold to the absolute difference (chosen to be 100 rad/s (16 Hz)).These steps are iterated over all time points and slices until all the outliers have been removed.The resulting fat mask is updated and used to calculate the linear bias field.To obtain the field-drift corrected change in off-resonance maps the bias field is subtracted from Dx:  4) eliminating the residual outliers which is at the same time also the final fat mask.For this particular volunteer there were no residual outliers, so step 3 to 4 was superfluous.

Figure 1 .
Figure 1.Schematic of the home-made phantom.In this study, only the MRT of the vial in the center, containing the water mixture, is analyzed.The outer four fat vials are used for correcting for the B0 drift.

Figure 2 .
Figure 2. Flow chart of post-processing of experimental data.More detailed information on the methods in general can be found in the text, and specifically on MMT-PRFS in Appendix A.1.

Figure 3 .
Figure 3. Example fat mask (of bone marrow) with volunteer 1, slice 1.The fat mask is overlaid on the magnitude image of echo 1 at time point 2.

Figure 4 .
Figure 4. ROIs chosen for MRT evaluations.(A) shows the ROI for the phantom: only the vial in the center containing water mixture was selected; and (B) for the volunteers: encompassing an elliptical volume of brain tissue, as large as possible while constant among all volunteers.

Figure 5 .
Figure 5. Temperature change MRT maps of the phantom cooling down, shown at different time points.

Figure 6 .
Figure 6.Temperature measured by MRT (mean of the ROI evaluated ± SD) vs. the temperature measured by the thermometer probe during the cooling down of the phantom experiment.Each dot represents a temperature reading at a distinct time point for which the acquisition time from the MRI scanner is taken.The dashed line illustrates the perfect scenario, where the temperature of the probe and the temperature measured with the MRI are matching.A shows the entire temperature range, referenced to the first time point acquired during the experiment; and B shows only the clinical temperature range of 37-43 C, referenced to the time point closest to (and above) 37 C.The bias, spatial SD, temporal SD and accuracy are presented in for both the extended range and the reduced treatment relevant range.

Figure 7 .
Figure 7. MR temperature estimations for all sequences (shown for one slice and one volunteer).

Figure A1 .
Figure A1.Example fat mask selection (of bone marrow) with volunteer 1, slice 1, (echo 1 of magnitude image), at time point 2. (1) Shows the initial mask, (2) eliminating fat-water swapped voxels, (3) including a signal threshold and (4) eliminating the residual outliers which is at the same time also the final fat mask.For this particular volunteer there were no residual outliers, so step 3 to 4 was superfluous.

Table 1 .
MRT performance quantities used for comparison between sequences.
n P n j¼1 ðE j À AÞ j 0.5 Cj Spatial precision Spatial standard deviation (spatial SD) SD 2 ¼ 1 n P n j¼1 ðE j À EÞ 2 0.5 C Temporal precision Temporal standard deviation (temporal SD) Variability at different time points over 90 min 0.5 C Accuracy Mean absolute error (MAE) MAE ¼ 1 n

Table 2 .
Acquisition parameters and scan times for different sequences for the phantom experiment.

Table 4 .
MRT performances from the multi-coil phantom experiment for the whole temperature range (top) and the clinically HT range 37-43 C (bottom).Values that are within the required standard are printed in bold.

Table 5 .
MRT performances averages of 10 volunteers in brain over all slices, including all and excluding motion and too little fat voxels.Accuracy achieved for all sequences in the volunteer experiments, including all scans and excluding scans with too much motion and too few fat voxels (9/10 volunteers remaining).Accuracy was calculated as mean absolute deviation over all slices of the mean deviation from zero of the apparent temperature change in the ROI for each slice.