Progressive Pretraining Network for 3D System Matrix Calibration in Magnetic Particle Imaging

Magnetic particle imaging (MPI) is an emerging technique for determining magnetic nanoparticle distributions in biological tissues. Although system-matrix (SM)-based image reconstruction offers higher image quality than the X-space-based approach, the SM calibration measurement is time-consuming. Additionally, the SM should be recalibrated if the tracer’s characteristics or the magnetic field environment change, and repeated SM measurement further increase the required labor and time. Therefore, fast SM calibration is essential for MPI. Existing calibration methods commonly treat each row of the SM as independent of the others, but the rows are inherently related through the coil channel and frequency index. As these two elements can be regarded as additional multimodal information, we leverage the transformer architecture with a self-attention mechanism to encode them. Although the transformer has shown superiority in multimodal fusion learning across several fields, its high complexity may lead to overfitting when labeled data are scarce. Compared with labeled SM (i.e., full size), low-resolution SM data can be easily obtained, and fully using such data may alleviate overfitting. Accordingly, we propose a pseudo-label-based progressive pretraining strategy to leverage unlabeled data. Our method outperforms existing calibration methods on a public real-world OpenMPI dataset and simulation dataset. Moreover, our method improves the resolution of two in-house MPI scanners without requiring full-size SM measurements. Ablation studies confirm the contributions of modeling SM inter-row relations and the proposed pretraining strategy.

Two conventional reconstruction methods [12] for MPI are available: X-space- [13] and system-matrix (SM)-based [4] methods.Compared with the X-space-based method, the SMbased method achieves a higher image quality [14], but the SM measurement is time-consuming.For the SM measurement, a delta MNP sample should be repeatedly moved across each voxel in the field of view (FOV), and the corresponding signals are recorded.Each measurement takes approximately 15 h for an MPI system with a small 3D FOV (30 mm × 30 mm × 30 mm) [15].Multiple averaging is commonly required to improve the SM measurement quality, significantly increasing the calibration time (averaging ten measurements can take more than 100 h).More importantly, the SM should be recalibrated when changes to the tracer'ss properties or magnetic field environment occur.Frequent SM recalibration results in excessive labor and time costs.Therefore, fast SM calibration is an area of research interest for MPI.Several compressed sensing (CS)- [16], [17] and deep learning-based methods [18], [19] have recently been proposed to reduce the SM calibration time.However, despite the success of existing studies on SM calibration, as reviewed in Section II, there is much room for improvement.In this study, we devise SM calibration improvements in two aspects: 1) Introduction of coil channel and frequency index to model SM inter-row relations.Existing methods often treat an SM row as an independent data point.This modeling approach neglects the SM integrity and the relationships between frequency components.In fact, the SM frequency components are not entirely independent.For example, each frequency component contains two additional information elements: the coil channel (i.e., the receiving coil obtaining a specific frequency component) and the frequency index.These elements can be regarded as additional multimodal information.Consider a result on OpenMPI data (calibration 5) for illustrating the influence of the two elements.The dimension of each SM row is reduced using t-distributed stochastic neighbor embedding (t-SNE), as shown in Fig. 1(a), and the visualization results are shown in Fig. 1(b), (c).SM rows in the same receiving coil or with a close frequency index are usually clustered.Because the fusion of multimodal information can improve the model performance [20], [21], we integrate the coil channel and frequency index as multimodal information into a model to improve the SM calibration accuracy.
2) Use of unlabeled SM data through progressive pretraining.Deep learning methods have achieved great success for fast SM calibration [18], [19].However, existing supervised models are limited because they require a large, labeled dataset (high-resolution SM).Insufficient labeled data may cause overfitting and poor performance.Because unlabeled SM data (low-resolution SM) can be obtained relatively fast and affect model performance, we use unlabeled data to increase the SM calibration accuracy.
Driven by the abovementioned analysis, we propose a progressive pretraining transformer-based network called ProTSM to handle multimodal information for fast 3D SM calibration.Because transformer has shown superiority in multimodal fusion learning across many fields [22], [23], we use the self-attention mechanism to integrate coil information.In particular, the coil information is interpreted as tokens by embedding layers and interacts with the SM row through the transformer'ss self-attention.Additionally, to prevent overfitting owing to the high complexity of the transformer, we propose a pseudo-label-based progressive pretraining strategy that uses unlabeled data.The proposed ProTSM was evaluated on real-world and simulation datasets for 3D SM calibration, and it notably outperformed similar methods.

TABLE I SYMBOLS AND INTERPRETATIONS
Our main contributions of the proposed work are summarized as follows: • We firstly take the coil channel and frequency index into consideration for SM calibration.Our visualization analysis shows that frequency components are not independent, and we explicitly model their relationships using the transformer to improve the calibration.
• We propose using unlabeled data with a progressive pretraining strategy.We generate pseudo-labels for the isolated unlabeled pretraining dataset.These data are used to train our model, which is then finetuned on accurately labeled data.Our results show that pretraining accelerates model convergence and improves the SM calibration performance.
• We propose a transformer-based 3D SM calibration framework.ProTSM is evaluated on real-world and simulation datasets and outperforms the state-of-theart methods.Additionally, the proposed ProTSM is embedded into two in-house MPI systems to generate high-resolution images without requiring a full-size SM measurement.

II. RELATED WORK
Interpolation-based methods are straightforward and easy to implement for super-resolution SM calibration.The performance of bicubic and nearest-neighbor interpolation has been investigated in SM calibration [24].Simple linear interpolation can help resolve high-resolution structures.Additionally, CS-based methods have admirable performance in super-resolution SM calibration.Knopp and Weber [25] first used CS to speed up SM calibration.They sparsified the SM using certain basis transformations, such as discrete Fourier and cosine transforms.Accordingly, many CS-based variant methods have been developed [16], [17], [26], [27], [28].For example, Ilbey et al. [27] proposed a coded calibration scene method, which places multiple MNP samples inside the FOV in each MPI scan instead of using a single MNP sample, as in conventional methods.This operation increases the signal-tonoise ratio and significantly improves the conventional CS calibration.
MPI reconstruction [29], [30] and SM calibration [18], [19], [31], [32] have both demonstrated the efficacy of deep learning.For the MPI image reconstruction area, Gungor et al. [33] proposed a deep equilibrium-based model using learned data consistency.This method demonstrated excellent generalization and quick imaging.Similarly, deep learning-based methods for the SM calibration area can benefit from measured high-resolution SMs and integrate prior knowledge of SM calibration through training.Many deep learning models have been proposed for the SM calibration.For example, 3dSMRnet was the first model based on a convolutional neural network (CNN) for 3D SM calibration [18].This model improved both SM calibration and image reconstruction.
The transformer architecture has recently emerged for diverse computer vision applications [34], [35].Despite the success of CNN, long-range dependencies are not adequately modeled.The transformer architecture has also been applied to SM calibration.Gungor et al. [36] introduced a CNNtransformer hybrid model (TranSMS) for 2D SM calibration.TranSMS contains one CNN and one transformer branch for feature extraction.The fusion feature maps are then upsampled, and a high-resolution SM is generated through a data consistency module.This model shows a performance improvement compared with CNN-based methods.
Because the SM frequency components are inherently related, we can model these relationships using the multimodal information of coil channel and frequency index.Several studies have shown that multimodal information fusion improves model performance [20], [21], which is encouraging for SM calibration.In our previous conference paper [37], we preliminarily demonstrated feasibility of utilizing multimodal information using transformer.In this study, on the basis of introducing multimodal information, we propose a novel pretraining strategy to prevent potential overfitting caused by the high complexity of the transformer architecture.We also provide more extensive experiments and in-depth discussions to confirm the contribution of coil information and the effectiveness of our pretraining strategy.Overall, this study offers valuable insights and a comprehensive evaluation of our proposed method, which may advance the current researches on fast SM calibration.

III. PROPOSED PROTSM
The architecture of the proposed ProTSM is shown in Fig. 2(a).The transformer encodes the low-resolution SM and the multimodal tokens of the coil channel and frequency index.Then, the encoded hidden representation is upsampled and followed by successive convolution blocks to predict the highresolution SM components.The adopted notation is listed in Table I, and details of the proposed model are provided in the following subsections.

B. Progressive Pretraining Strategy
The flowchart of the proposed pretraining strategy and finetuning process is shown in Figs.2(b) and 2(c), respectively.We first collect a large unlabeled dataset, {S un }, and obtain pseudo-labels {Y} using a super-resolution model.This model can be a simple linear model (trilinear interpolation) or a trained deep learning model.The proposed model is then pretrained on this large dataset and optimized using pseudolabels as follows: where s un i and y i are pretraining data represented by {S un } and {Y}, respectively, and η is the learning rate.Following pretraining, the model has better initial weight parameters pr e θ than those obtained through random initialization.The model is then finetuned on an accurately labeled dataset, {S L }, {S H } starting with pretraining initialization and a smaller learning rate.
The proposed pretraining strategy achieves the following improvements while fully using low-resolution SM data: 1) The pretrained model performs a weak super-resolution SM calibration, which improves the performance of the SM calibration and serves as a suitable initialization for optimization through supervised learning.2) Compared with supervised methods, our model leverages low-resolution SM data.Hence, the risk of overfitting owing to limited SM data is mitigated.3) Compared with training from scratch, finetuning simply optimizes our model from a weak to a more refined one, hastening the training convergence.

C. Transformer Encoder With Coil Embedding
1) Embedding of Coil Channel and Frequency Index: .Because p i and f i are single numeric variables, we project them onto a vector space for computation.We use the following linear and embedding layers for projection: where F denotes the latent representation dimension.
2) SM Component Sequencing: .To handle the 3D image s L i as the input for the transformer encoder, we first reshape it as 1D sequence tokens s . Then, a linear layer projects the tokens into latent space 3) Transformer Encoding: .Following an existing method [34], we add absolute position embeddings e pos ∈ R (N +2)×F to label the patch position, i.e., z i = [W p x i + b p ; e p i ; e f i ] + e pos .Compared with its relative counterpart, absolute position encoding explicitly indicates the spatial location relationship between image tokens, likely supporting dense prediction (e.g., super-resolution reconstruction).
The transformer encoder contains two modules: multihead self-attention M S A and multilayer perceptron M L P. Encoding can be expressed as follows: where z i are the hidden result and the output of layer l, respectively.MSA(•) is the key operation of the transformer and can be expressed as where ∥ and H are the concatenation operation and number of heads, respectively; Q(•), K (•), and V (•) are linear transformation operations, with Q(z i ) = W q•z i ; and d denotes the number of dimensions in this head.The information from e p i and e f i is encoded into s L i using the multi-head self-attention module.Additionally, each s L has the same encoding parameters of p and f.If two SM components have the same coil channel or similar frequency index, their e p and e f are the same or similar, respectively.Thus, we establish the relationship between the SM components using the coil channel and frequency index.

D. Decoder
The decoder contains upsampling and convolution blocks.First, it upsamples the output of the transformer encoder before Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and upsample x L i to obtain a high-resolution feature map through 3D pixel shuffling as follows: where x H i and r denote the hidden representation after upsampling and the downsampling ratio, respectively.The subsequent convolution operations produce the feature map for the prediction header (i.e., 1 × 1 × 1 kernel convolution operation):

E. Skip Connection
To alleviate the potential vanishing gradient problem in the deep network, we add a skip connection to our model.In particular, we directly upsample the original 3D SM components and extract shallow feature map x H i as follows: Finally, we aggregate x H i and x H i to predict the highresolution component ŝ H i using the prediction header as follows:

IV. DATASETS AND EXPERIMENTAL SETUP
A. Datasets   In particular, we extracted 20 × 20 × 20 and 10 × 10 × 10 SM samples for downsampling ratios of 2 and 4, respectively.This pretraining dataset contains 14596 samples.Then, we obtained pseudo-labels using the super-resolution CNN (SRCNN) [40] model trained on the OpenMPI training set.

3) In-House Datasets for Generalization Ability Evaluation:
We evaluated the proposed ProTSM trained on the Open-MPI dataset using two in-house MPI systems: field-free point (FFP) and field-free line (FFL) scanners.The 3D model schematic diagrams for the two scanners are shown in Figs.3(a  SM with a grid size of 10 × 10.For image reconstruction, the frequency components were selected using the formula f = m x f x + m y f y .In this study, m x ∈ [1,13], m y ∈ [−7, 7] and only frequency components with f < 330 kHz are used.Finally, 195 frequency components were preserved.This FFP instrument uses active compensation techniques to minimize the influence of excitation feed-through, and the base frequency signal was unfiltered.The phantom used for imaging is shown in Fig. 3(d).For the FFL scanner, the selection field gradient was 0.6 T/m along the X , and the drive frequency was 2.51 kHz.For 2D imaging, the object to be imaged rotates along the Z-axis in the FOV.The sampling frequency was 1 MHz.The FFL scanner was rotated along the X Y plane from 0 to 180 • with increments of 12 • (15 measured angles).We measured a square grid of 9 × 9 for the SM with a delta sample (3 × 3mm2 ).The second through thirteenth frequency components for each angle (totaling 15 × 12 = 180 frequency components) were used for image reconstruction.The phantom used for imaging is shown in Fig. 3(e).We stacked the replicated 2D frequency components along the Z axis to create 3D data.Then, the predicted 3D high-resolution data (i.e., grid size of 40 × 40 × 40) were averaged along the Z axis for 2D image reconstruction.We did not measure the high-resolution SM but conducted a qualitative analysis of the reconstructed image.

B. Implementation Details
The proposed ProTSM contains four transformer layers and four 3D convolutions per upsampling block.In this study, the hidden representation dimension F was 1024.The number of heads was eight, each of which had 128 dimensions (denoted by d) per head.The number of channels, C, for the convolutions was 64.For pretraining, the batch size was 50 and the learning rate was 5 × 10 −4 .We pretrained the model for 50 epochs.For finetuning, the batch size was eight and the learning rate was 1 × 10 −3 (half the learning rate for the encoder).We first trained the model for ten epochs using linear warmup and then for 50 and 100 epochs using a constant learning rate for downsampling ratios of 2 and 4, respectively.We conducted two experiments using different downsampling ratios (2 and 4) on each dataset.The model contained two upsampling blocks for a downsampling ratio of 4. The patch size was set to two and one for downsampling ratios of 2 and 4, respectively.For image reconstruction based on the calibrated SM, we used the kaczmarzReg algorithm 2 with parameter λ = 0.75 over three iterations.

C. Baselines and Evaluation Metrics
• Bicubic Interpolation [24].Bicubic interpolation is a common super-resolution reconstruction method.However, because it can only process 2D images, we applied bicubic interpolation twice to perform a 3D upsampling.
In particular, we first upsampled the SM component along the X Y and then along the Z axis.
• Trilinear Interpolation [41].Trilinear interpolation calculates the values of points in a cube based on the values of its vertices.
• CS [27].CS assumes that the SM components are sparse after applying the discrete cosine transform DC T .We obtained the low-resolution data through Poisson disc sampling and optimized the following problem: min • SRCNN [40].SRCNN is the first CNN-based superresolution reconstruction model.It first upsamples low-resolution images through bilinear interpolation before reconstructing high-resolution images using three convolutions.
• VolumeNet [42].VolumeNet is a CNN-based superresolution model designed for 3D medical images.It contains several parallel branches for multiscale feature extraction.The features are aggregated to generate a highresolution image through voxel shuffling.
• 3dSMRnet [18].3dSMRnet is a state-of-the-art method for super-resolution 3D SM calibration.It leverages residual-in-residual dense blocks to extract features from low-resolution SM components.Then, it upsamples the feature maps and reconstructs high-resolution SM components using 3D convolutions.We executed the open-source code at the website. 3 In addition to the above-mentioned baseline models, we present two competitive baselines that use coil information: • MetaBlock [43].MetaBlock uses an attention-based mechanism to enhance image features using non-image data (such as age and gender).In this study, the frequency index and coil channel represent the non-image data.
• IDL [44].IDL proposes a multistage interactive fusion strategy to convolve image and non-imaging data.Instead of simple concatenation of multimodal data, this model uses channel-wise multiplication at each feature map downsampling level.In our 2D experiments, we select the same baseline models as the recent work [36], and the extra methods are listed below: • VDSR [45].VDSR uses a very deep CNN-based neural network model for super-resolution tasks.This model learns the residual between the low-and high-resolution images to address the gradient vanishing and explosion problem.
• TranSMS [36].TranSMS is the most recent state-of-theart model for 2D SM calibration.This model proposes a two-branch architecture with a convolutional branch and a transformer branch.The transformer branch contains a novel transformer block with a convolution-based patch embeded method.For each experiment, both the baseline models and our proposed model required the same number of calibration measurements.For the SM calibration, we obtained the normalized root-mean-square error (nRMSE) as the evaluation metric, as in [18]: where ∥ • ∥ F denotes the Frobenius norm, | • | denotes the complex modulus, and ŝ H i and s H i are converted into the complex format for evaluation.
To evaluate a reconstructed image, we calculated the peak signal-to-noise ratio (PSNR), structure similarity index measure (SSIM), and nRMSE.

A. SM Calibration
Table II lists the 3D SM calibration results for the two datasets.The proposed ProTSM is highly superior to the other evaluated methods on the OpenMPI dataset in terms of nRMSE (3.08% and 4.10% for downsampling ratios of 2 and 4, respectively), with an improvement of approximately 15% over the best single modal-based method.Additionally, the proposed ProTSM achieves a relative improvement of approximately 9.5% compared with other multimodal-based methods.ProTSM also performs the best on the simulation dataset, with nRMSE values of 0.72% and 2.70% for downsampling ratios of 2 and 4, respectively.
Fig. 4 shows the center slice of the reconstructed 3D SM data for a qualitative evaluation.Overall, the deep learning Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Evaluation of Image Reconstruction
We evaluated the image reconstruction performance using a super-resolution calibrated SM.For image reconstruction, we selected the phantom shape and resolution from the OpenMPI dataset.Additionally, we simulated numerical phantom M (see 3(c)) in the simulation dataset.The corresponding reconstruction results are listed in Tables III and IV.
The results of image reconstruction and SM calibration are consistent.The proposed ProTSM achieves the best performance for the three metrics (nRMSE, PSNR, and SSIM) on both OpenMPI and simulation datasets.On the OpenMPI dataset, ProTSM outperforms single-modal-based methods at high downsampling ratios.The PSNR of ProTSM and the best single-modal-based model are 35.90 and 33.46 (7.29% improvement) for phantom shape, respectively, for a downsampling ratio of 2. The PSNR of ProTSM and the best single-modal-based model are 31.57and 28.11 (12.3% improvement) for a ratio of 4. A similar trend is observed for phantom shape.Our proposed ProTSM still performs better than the two multimodal methods.On the simulation dataset, ProTSM also performs better (PSNR of 38.25 and 36.53 for downsampling ratios of 2 and 4, respectively).Therefore, ProTSM consistently outperforms the other evaluated methods.Fig. 5 shows two reconstructed images for qualitative evaluation.The figure shows the center slice of 3D images and the 3D error map averaged along the Z axis for phantom resolution.All methods provide an acceptable image quality in the center slice for a downsampling ratio of 2. However, the error maps show how poorly the interpolation-based methods perform with 3D images.When the downsampling ratio is 4, the baseline models reconstruct low-quality images polluted with noise and artifacts.Conversely, the proposed ProTSM provides a better image quality.These qualitative results demonstrate that ProTSM is robust even with a high downsampling ratio.

C. Comparisons With State-of-the-Art 2D Methods
For comparison with the TranSMS state-of-the-art model for 2D SM calibration, we adapted the proposed ProTSM to handle 2D data.We first conducted the same experiment using the same dataset used in [36].We directly referenced the study'ss results, and the 2D SM calibration comparison Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.results are listed in Table V. ProTSM performs similarly to TranSMS for a small downsampling ratio of 2 and 4. ProTSM outperforms TranSMS for a high downsampling ratio.However, the SM calibration results of all methods are insufficient for a downsampling ratio of 8, which may mean that the metric nRMSE is insignificant.
Four representative methods-bicubic, SRCNN, TranSMS, and ProTSM-were selected for its validation, and another experiment (OpenMPI calibration 7 for training and calibration 6 for test) was conducted in 2D settings.Table VI and Fig. 6 show the results.For ratio 4, ProTSM and TranSMS continue to perform better in terms of SM calibration and image reconstruction.Although bicubic achieves a better metric nRMSE for SM recovery for ratio 8, the metrics of the reconstructed image are lower.All calibrated SMs fail to reconstruct a satisfactory image; therefore, metric nRMSE may not be able to assess the model'ss performance in such a scenario.

D. Application to In-House MPI Systems
We applied the proposed ProTSM to in-house MPI systems to improve the quality of the image reconstruction.We estimated high-resolution SM from a measured lowresolution SM, and reconstructed images using the measured SM and estimated high-resolution SMs.The corresponding results are shown in Fig. 7.The reconstructed images from two phantoms are shown.We measured the phantom resolution using two parallel cylindrical tubes filled with Perimag MNPs with 3 mm distance using the FFP scanner (top of Fig. 7) and the phantom vessel using the FFL scanner (bottom of Fig. 7).The boundaries of the reconstructed images appear mixed for the measured low-resolution SM, whereas the image reconstructed using the high-resolution SM shows better quality for phantom resolution.For the phantom vessel, the reconstructed image using the low-resolution SM does not distinguish the vascular bifurcation in the upper-right region, whereas the image generated by the calibrated SM clearly shows that structure.The evaluation results of ProTSM embedded in in-house FFP and FFL scanners validate our proposal.

E. Ablation Studies
We also investigated the impact of three main design components in the proposed ProTSM: pretraining strategy, modeling of coil information, and transformer architecture.Three ablation models [ProTSM-scratch (ProTSM without pretraining strategy), ProTSM-w/o coil information (ProTSM without coil information and pretraining strategy), and ProTSM-CNN (replace the transformer layer with equal number of convolution layer for ProTSM-w/o coil information)] were evaluated on the public OpenMPI dataset.In Section IV-B, the other experiment settings remain unchanged.Both SM calibration and image reconstruction tasks were conducted, and the corresponding results are shown in Tables VII and VIII and Fig. 8.
Regarding the pretraining strategy, the nRMSE values of ProTSM without pretraining (ProTSM-scratch) are 3.29% and 4.33% for downsampling ratios of 2 and 4, respectively.This demonstrates a performance decline of approximately 6%.ProTSM w/o coil information refers to ProTSM results that do not consider the coil channel and frequency index.The corresponding nRMSE metrics for downsampling ratios of 2 and 4 are 3.44% and 4.45%, respectively.Finally, to investigate the impact of the transformer, the encoder was replaced with a CNN.The performance is comparable to that of the CNN-based models (VolumeNet and 3dSMRnet) without the transformer.Therefore, super-resolution SM calibration benefits from the transformer, as discussed in [36].
Additionally, in Fig. 8(a), we show the image reconstruction and error map results using the calibrated SMs for downsampling ratio 4. The ProTSM-scratch-reconstructed image contains more artifacts around it.Additionally, ProTSM without coil information generates a distorted image, and ProTSM-CNN shows low image quality.
We further highlight the effectiveness of the proposed pretraining strategy.The training loss and test nRMSE variations for ProTSM training with and without pretraining are shown in Fig. 8(b).Compared with training starting from scratch, finetuning provides a lower loss during training.Furthermore, the test nRMSE indicates that finetuning has better performance and stability.These results confirm the importance and contribution of the proposed pretraining strategy.

F. Visualization Results
To demonstrate an intuitive comprehension, this section visualizes hidden features from the transformer layer.Particularly, we averaged the feature maps using the token dimension after obtaining them through the final transformer layer.We used t-SNE to visualize the representations in Fig. 9 Additionally, three examples of test set data demonstrate the impact of coil information.The performance of ProTSM-w/o coil information and ProTSM-scratch is compared, and the attention map is calculated using the frequency index and coil channel as seeds.The attention mask is averaged using the two tokens, and the top 25% activation areas are preserved.The results are shown in Fig. 9(b).The attention mask covers relatively important areas, and the coil information may help the ProTSM-scartch perform better.The above results show that the SM calibration task may benefit from the coil information.

VI. DISCUSSION
To accelerate the 3D SM calibration for MPI, we propose a transformer-based method to model the relationship between SM rows and a pretraining strategy to use unlabeled data.The estimated time of high-resolution SM in the OpenMPI dataset is shown in Table IX.The measurement time cost is estimated using [38].Measurement and CS methods take a lot of time to recover SM.Interpolation-based methods have notably shorten the calibration time, while the quality of recovered SM is not satisfactory especially in high downsampling ratio.The deep learning-based approaches reduce the calibration time to the hundred-second level.Considering that the SM calibration is not required to be real-time, the proposed method, just like other deep learning-based approaches, has efficiently saved time and labor costs compared with the measurement.Moreover, in light of the quality of the recovered SM, our proposed method may also strike a more desirable balance between SM recovery prediction accuracy and calibration time.
Existing methods conceptualize SM calibration as a super-resolution task in natural images, but the calibration accuracy of the SM frequency components is higher than the reconstruction accuracy of natural images.Additionally, the spatial size of the SM rows (32 × 32 × 32) is significantly smaller than that of natural and medical images (e.g., 256 × 256 × 128).In large images, the relationship between distant pixels is relatively weak, while the SM'ss compact size promotes stronger relationships between its elements.Considering the high level of accuracy required and the strong relationship between elements, SM calibration may benefit from modeling long-range dependencies than natural image reconstruction.This may explain the notable contribution of transformer architecture to SM calibration.
To prevent overfitting owing to the high complexity of the transformer, we introduce a pretraining strategy that leverages low-resolution SM data.A low-resolution SM is easily collected during the development of an MPI system.We may measure the small SM repeatedly throughout system development to verify its performance.However, we should not measure the full-size SM because it is inaccurate after system upgrade.Hence, massive low-resolution SM data can be collected during the development process and used for SM calibration.
Despite the success of previous SM recovery studies [16], [17], [18], [36], they may have overlooked the potential benefit of the hardware information (e.g., coil information in this study).Numerous studies have shown the importance of multimodal data fusion learning [46], [47], e.g., non-image data in medical image analysis.However, the effectiveness of multimodal data (i.e., frequency index and coil channel) in the MPI area has not been evaluated.This study introduces previously overlooked hardware information and validates its effectiveness for SM recovery.
One limitation of our study is that the robustness of the proposed method has not been validated in vivo imaging.Several phantoms were imaged in vitro for image reconstruction task assessment, and we only assessed the performance using nRMSE, PSNR and SSIM.These metrics evaluate the overall quality of the reconstructed image, but may be insufficient in assessing the specific image details, especially in vivo imaging.Different nanoparticle behaviors have been observed between in vitro and in vivo settings because tracers' signals will change when they interact with biological tissue [48], [49].Therefore, higher metric (PSNR and SSIM) may not guarantee better performance in vivo imaging especially for clinical applications.The solution to this problem remains an open debate.We intend to develop better metric to discuss the potential solution to this problem, and validate the effectiveness of our proposed method in vivo settings in our future research.
There are two future research directions to improve the current study: 1) Better utilization of multimodal information.We use the coil channel and frequency index for SM calibration, but the integrated method may not be optimal.Hence, multimodal information should be fully used to model the relationships between SM rows and improve the calibration accuracy.For example, graph networks [50], [51] may better model the relationships using graphs.Therefore, developing SM calibration using such networks may be a direction worth exploring.
2) More powerful pretraining strategy.We introduce a pseudo-label-based pretraining strategy to use available unlabeled data.A more enhanced pretraining strategy should be explored and analyzed.For example, more accurate and transferable pseudo-labels should be generated for different downstream datasets.Additionally, self-supervised pretraining has demonstrated its effectiveness on medical data [52], [53].The fusion of such pretraining strategies may further improve the SM calibration.

VII. CONCLUSION
We proposed a transformer-based model for fast 3D SM calibration that uses multimodal information.Additionally, we proposed a pretraining strategy to fully use available unlabeled SM data.Our results on the OpenMPI and simulation datasets demonstrated that our ProTSM outperforms other methods.Moreover, the results for in-house MPI systems indicated the applicability and generalization ability of ProTSM.

Fig. 1 .
Fig. 1.Visualization of t-distributed stochastic neighbor embedding from SM rows.(a) The illustration of SM dimension reduction by the embedding method.(b), (c) show the visualization results for OpenMPI calibration dataset 5.Each point represents one SM row, and the color indicates its receiving coils in (b) and frequency in (c).

Fig. 2 .
Fig. 2. (a) The overall framework of the proposed method.(b) The illustration of our proposed pseudo-label-based pretraining strategy.(c) The finetune process after pretraining.
and W p and b p are trainable parameters.Before feeding z i into the transformer encoder, global tokens used in self-attention calculations with other image tokens.Therefore, the final input is constructed as

Fig. 3 .
Fig. 3. (a),(b) show the 3D schematic diagrams of the field-free point (a) and field-free line (b) scanners.(c) The numerical phantom "M" used in simulation dataset.(d), (e) show the phantoms used in field-free point scanner (d) and field-free line (e) scanner for 2D imaging.
generating high-resolution frequency components through successive 3D convolution blocks.Considering that e p i and e f i are encoded into image tokens, they are not involved in SM construction during decoding.Let z L i ∈ R N ×F be the output of the transformer encoder without coil tokens.We reshape z we used SM calibration experiment 7 with Synomag-D MNPs (Micromod GmbH, Germany) to construct the training set and evaluated the model performance on calibration experiment 6 with Perimag MNPs (Micromod GmbH, Germany).This setting was intended to evaluate the generalization ability for different MNP types.In both training and test sets, we only preserved the SM rows with a signal-to-noise ratio of S N R > 3, leaving 4129 and 3290 for training and test sets, respectively.• Simulation dataset.We rewrote a 3D version for simulating SMs based on a code 1 and [39].The FOV size was 40 mm × 40 mm × 40 mm and the grid size was 40 × 40 × 40.The sampling frequency was 1 MHz.The drive frequencies along the X , Y , and Z axes were 24.51 kHz, 26.04 kHz, and 25.25 kHz, respectively.The MNP temperature was 300 K, and the Boltzmann constant k B was set as 1.38 × 10 −23 .We evaluated the model generalization performance for different MNP diameters and selection field gradients.In particular, the training set included three 3D SMs (gradients of 0.5 T/m, 1 T/m, and 5 T/m).The MNP diameter was 25 nm.For the testing set, the SM gradient and the MNP diameter were 1 T/m and 12.5 nm, respectively.The remaining data for training and test sets are 3933 and 1311, respectively.The phantom used for imaging is shown in 3(c) 2) Pretraining Dataset: We obtained low-resolution SM data from OpenMPI calibration experiments 7, 8, and 9.
) and 3(b), respectively.For the FFP scanner, the selection field gradient was {−1.7, −1.7, 3.4}T/m along the axes X, Y, Z .The excitation frequency along the X axis was 25 kHz and the driving frequency along the Y axis was 20 Hz.A Cartesian trajectory was used to scan the FOV.The sampling frequency was 2.5 MHz.The FOV of the MPI scanner was 20 mm × 20 mm.A delta sample (2 mm × 2 mm) filled with Perimag MNPs was used to measure the low-resolution

Fig. 5 .
Fig. 5.The image reconstruction result for resolution and shape phantom in OpenMPI dataset.The first row shows the reconstructed image, and the second row shows the corresponding 3D error map that is averaged in Z-axis.Number "2" and "4" indicate the downsampling ratio.GT image is reconstructed by the measured full-size SM.

Fig. 6 .
Fig. 6.The 2D image reconstruction results of four representative methods for resolution Phantom in OpenMPI dataset at ratio 4.

Fig. 7 .
Fig. 7.The reconstructed image with the raw measured low-resolution SM and predicted high-resolution SM for two in-house MPI instruments.The first and second rows show the image reconstruction results of FFP (resolution phantom of two parallel cylindrical tubes with 3mm distance) and FFL (vessel phantom) instruments, respectively.

Fig. 8 .
Fig. 8. (a) Reconstructed image based on SMs predicted by ProTSM variant models.(b) Variation in train loss and test nRMSE for finetune mode and train from scratch mode during training epochs.
(a).ProTSM-rand.init denotes the ProTSM model without training (i.e., with randomly initialized model parameters).The lowresolution SM rows are mixed distributed before training, and they are clustered closer together through the frequency index after training.This demonstrates that the calibration may help the low-resolution SM rows regain the coil-related properties.

Fig. 9 .
Fig. 9. (a) The t-SNE visualization of SM rows generated from the model.The color of the points represents the frequency index.ProTSMrand.init indicates the ProTSM model without training.(b) Qualitative visualizations of the ProTSM-scratch and ProTSM-w/o coil information for three representative SM rows.The attention mask indicates the most attentive areas with the coil information as seed.

TABLE II SM
CALIBRATION RESULTS ON OPENMPI AND SIMULATION DATASETS

TABLE III IMAGE
RECONSTRUCTION RESULTS BASED ON CALIBRATED SM ON OPENMPI DATASET AND IMAGE RECONSTRUCTION RESULTS OF THE 4 REPRESENTATIVE METHODS IN OPENMPI DATASET.THE METRIC NRMSE IS USED TO ASSESS SM RECOVERY AND METRICS PSNR, SSIM ARE USED TO ASSESS IMAGE QUALITY RECONSTRUCTED BY THE SM

TABLE VII THE
ABLATION RESULTS IN OPENMPI DATASET FOR SM CALIBRATION.THE NUMBER INDICATES THE NRMSE METRIC TABLE VIII THE ABLATION RESULTS IN OPENMPI DATASET FOR IMAGE RECONSTRUTION.THE DOWNSAMPLING RATIO IS 4

TABLE IX THE
COMPARISON OF ESTIMATED TIME (SECONDS) FOR HIGH-RESOLUTION SM IN OPENMPI DATASET