Generating synthetic computed tomography for radiotherapy SynthRAD2023 challenge report

. However, no significant correlation was found between image similarity metrics and dose accuracy, emphasizing the need for dose evaluation when assessing the clinical applicability of sCT. SynthRAD2023 facilitated the investigation and benchmarking of sCT generation techniques, providing insights for developing MRI-only and CBCT-based adaptive radiotherapy. It showcased the growing capacity of deep learning to produce high-quality sCT, reducing reliance on conventional CT for treatment planning.


Introduction
More than half of cancer patients receive radiotherapy as the standard care, providing effective local treatment (Chandra et al., 2021).Radiotherapy is typically delivered daily over several weeks (Mitchell, 2013), aiming to provide a high radiation dose to the target while minimizing the dose to the surrounding healthy tissue.To achieve conformal radiation treatment, obtaining an electron density map of the patient's anatomy is crucial to determine beam attenuation and local dose deposition (Grégoire and Mackie, 2011).This electron density information is currently obtained through computed tomography (CT) (Seco and Evans, 2006).However, tumors are not always clearly visible on CT, and magnetic resonance imaging (MRI) has been proposed as its superior soft-tissue contrast offers improved visibility of tumor-boundaries and organs-at-risk (OARs) (Schmidt and Payne, 2015).Moreover, throughout the treatment course, patient anatomy may vary.In adaptive radiotherapy, new treatment plans are generated weekly or daily while the patient is on the treatment couch to maintain dose conformality.During adaptive radiotherapy, typically cone-beam CT (CBCT) (Nijkamp et al., 2008) or MRI (Lagendijk et al., 2014) are the sole imaging modalities at hand.However, neither MRI nor CBCT allows for direct treatment plan optimization as accurate electron density information is lacking.Techniques have been developed to generate synthetic CT (sCT) (also called pseudo-CT, virtual CT, surrogate CT) from MRI and CBCT to aid in determining local beam attenuation and dose deposition for treatment planning (Edmund and Nyholm, 2017).The sCT generation has paved the way for MRI-based treatment planning (MRI-only radiotherapy) and CBCT-based adaptive radiotherapy, which avoid additional radiation exposure due to imaging and reduce the treatment centers' workload by omitting unnecessary scans.
Although several approaches for obtaining sCT exist, including bulk density override and atlas-based methods, deep neural networks have recently shown promise in generating sCT (Spadea et al., 2021).Neural networks can be broadly categorized into convolutional neural networks (CNNs), e.g., U-net (Ronneberger et al., 2015), generative adversarial networks (GANs), e.g., cycleGAN, pix2pix, Goodfellow et al. (2014), Zhu et al. (2017), Isola et al. (2017), and, more recently, (vision-)transformers (Vaswani et al., 2017;Dosovitskiy et al., 2020) and diffusion models (Ho et al., 2020).Paired (supervised) and unpaired (unsupervised) training approaches have been suggested depending on the network architecture.The models were trained using 2-dimensional (2D) slices or 3D CT and MRI/CBCT volumes.Moreover, 2.5D approaches considering neighboring slices or perpendicular planes have been introduced to deal with spatial information and coherence while maintaining performance and feasible memory use.Most of these papers claim that their sCT generation method outperforms others.However, networks are often trained on different datasets and anatomies and evaluated using different metrics, making consistent methodological comparison difficult.Moreover, most sCT methods are evaluated based on image similarity metrics, whereas what matters is, ultimately, the effect of sCT on the treatment plan dose distribution, and image metrics do not necessarily reflect the dose accuracy (Kieselmann et al., 2018).This lack of a fair comparison hinders the identification of the best network design choices that should be implemented in clinical sCT tools.
To address these issues and provide a fair comparison, we organized the SynthRAD2023 Grand Challenge, held in conjunction with MICCAI 2023.In the challenge, we provided ground truth data and developed methods to facilitate fair model comparisons and increase the understanding of how different network designs influence performance.This challenge encourages the development and evaluation of state-ofthe-art algorithms for generating accurate and clinically relevant sCT images from MRI and CBCT data.Two tasks were defined based on a new publicly available dataset (Thummerer et al., 2023a): (1) MRI-to-CT generation for MRI-only radiotherapy and MRI-guided radiotherapy and (2) CBCT-to-CT generation for image-guided adaptive radiotherapy (IGART) and online adaptive radiotherapy.
This paper reviews the challenge participation, evaluation, and ranking of the submitted algorithms based on image similarity and dose assessment for sCTs compared to ground truth CTs.The analysis Fig. 1.The SynthRAD2023 pipeline.Left: the participants' algorithms generate sCT from input MRI or CBCT images.Middle block: the obtained sCT is evaluated with image similarity metrics (comparing sCT images to ground truth CT images) and dose metrics (comparing dose distributions recalculated on sCT and ground truth CT for pre-planned photon and proton treatment plans).Right: after calculating the metrics, the winner is determined by applying a ranking approach.
explores trends in submitted algorithms and their correlation with overall performance, focusing on the impact of variation within the dataset, the metrics chosen for evaluation, and examining ranking stability.

Challenge setup
The SynthRAD (Synthesizing Computed Tomography for Radiotherapy) challenge allowed teams to test and compare their sCT algorithms.The challenge was hosted on the Grand Challenge website https:// synthrad2023.grand-challenge.org.It consisted of two tasks: task 1 involved generating sCT from MRI data, while task 2 focused on developing sCT from CBCT data.Each task comprises two subtasks involving the brain and pelvis regions.
The organizing team arose from the ''Image synthesis & reconstruction'' expertise subgroup of the Dutch deep learning in radiotherapy initiative www.DLinRT.org.The organizing group encompasses earlystage researchers, PhDs, postdocs, four assistant professors, and one associate professor from five Dutch University Medical Centers and three Dutch Technical Universities.
Fig. 1 presents an overview of the SynthRAD2023 Grand Challenge design, including the algorithms developed by the participants and the evaluation and ranking procedures performed by the organizers.Participants in the challenge were tasked with developing and training models capable of generating accurate sCT images using only input MRI or CBCT.Participants could participate in either task 1, task 2, or both.Only fully automated methods trained from scratch on the provided data could be used; in other words, pre-trained models were not allowed.The submissions were automatically evaluated on the Grand Challenge environment.Further details regarding participation rules and policies can be found in the Appendix A. As the sCTs are intended for radiotherapy, we analyzed photon and proton dose metrics alongside image similarity metrics, as described in Section 2.5.To determine the winner of the challenge, we ranked the teams based on these metrics, for which we provide a further explanation in Section 2.6.To ensure transparency and enable further exploration of the methods employed during the challenge, the data preprocessing and evaluation code can be accessed at https://github.com/SynthRAD2023.

Challenge phases
The challenge was divided into four phases: training, validation, preliminary test, and test.Teams had two months to familiarize themselves with the challenge and begin training their algorithms, as the training data was released on April 1, 2023.The validation phase began on June 1, 2023, and was a Type-1 challenge in which participants were required to execute the inference locally and submit the corresponding sCTs.This phase allowed for up to two submissions every four days, and the submitted sCTs were automatically assessed using image similarity metrics.The results were then updated on an open leaderboard, allowing real-time comparison between participating teams.The ground truth CT images used for validation were not shared with the participants to prevent biased results.The final test phase was a Type-2 challenge in which teams had to upload a Docker image containing their method, which is inferred and evaluated on the Grand Challenge platform.The test data and ground truth CTs were kept hidden.To familiarize participants with a Type-2 challenge, we introduced the preliminary test phase, which started on May 1, 2023.The preliminary test phase used six cases; only image similarity metrics were evaluated.The final test phase started on July 16, 2023, and lasted five weeks.The preliminary test phase and test phase ended on August 22, 2023.Teams were required to upload a Docker image of their algorithm and a description of their methods.To minimize algorithm tweaking to the test data, each team could submit only twice during the testing phase, and only the last submission was counted.The second submission allowed participants to correct potential errors arising during the first submission.During this phase, the generated sCT images underwent an image similarity evaluation and a photon and proton dose evaluation to verify the most relevant metrics for radiotherapy.The image similarity metrics were calculated online on the platform provided by Grand Challenge, and the dose evaluation was performed offline due to the computational resources required.At the end of the testing phase, the final ranking was published to show the performance of the participating teams.After the challenge, a postchallenge test phase was opened, and the preliminary and validation phases were reopened to enable continuous evaluation of algorithms until September 20, 2028.

Dataset
Data from 1080 patients undergoing radiotherapy treatment were included in the SynthRAD2023 dataset.The dataset consisted of imaging data from three Dutch University Medical Centers.Both task 1 (MRI-to-CT) and task 2 (CBCT-to-CT) included data from 270 patients for both the brain and pelvis anatomy (leading to 2 × 2 × 270 image pairs).The 270 cases were divided into a training, validation, and test set of 180, 30, and 60 patients.The dataset consisted primarily of adult patients, with no gender restrictions applied.Only patients for whom the MRI or CBCT was acquired within two months of the CT were included to limit anatomical changes.It should be emphasized that the datasets for task 1 and task 2 did not contain the same patients.A detailed dataset description can be found in the publication by Thummerer et al. (2023a).Ethical approval was obtained from the data-providing institutes' internal review boards/Medical Ethical committees.The data was released under the CC BY-NC (Creative Commons Attribution-Non-Commercial) license and made available via Zenodo at https://zenodo.org/doi/10.5281/zenodo.7835406(train), https: //zenodo.org/doi/10.5281/zenodo.7868168(validation), and https:// doi.org/10.5281/zenodo.10514185(test, available from 01-01-2028).
The imaging protocols used to acquire the MRI and CBCT adhered to the clinical routines of the individual centers.As a result, variations in the MRI, CBCT, and CT imaging protocols were present between centers and between datasets, which are representative of real-world application scenarios.A comprehensive table detailing the imaging parameters was provided alongside the dataset (Thummerer et al., 2023a).For task 1, MRIs were acquired with scanners from two different vendors using different settings per site.Additionally, centers A and C used MRI scanners with field strengths of 1.5T and 3T, while center B exclusively utilized a 1.5T scanner.T1-weighted gradient echo was selected for all brain data.The datasets from centers B and C included T1-weighted MRI acquired after Gadolinium contrast agent injection, whereas those from center A were acquired without contrast agent injection.The pelvis data comprised two-thirds of a T1-weighted gradient echo sequence and a T2-weighted spin echo sequence.For task 2, CBCTs were acquired with Linacs from two different vendors.The two sites that scanned CBCTs with Linacs from the same vendor had different acquisition protocols.
As described by Thummerer et al. (2023a), the data was preprocessed by resampling the voxel size to 1 × 1 × 1 mm 3 for the brain and 1 × 1 × 2.5 mm 3 for the pelvis patients, respectively.The face was intentionally removed for brain cases to protect patient privacy or proprietary information.The patient outline was automatically segmented on the MRI/CBCT using thresholding and morphological operations.This was followed by a dilation of 20 voxels in the axial plane and 2 in the superior-inferior directions.To ensure alignment between the MRI/CBCT and the CT, the field of view of the MRI/CBCT and CT was adjusted based on the patient outline, and rigid registration was performed.The resulting mask, including surrounding air, was provided and could be used by the participants for preprocessing.

Baseline algorithms
Two bulk-assignment baseline sCT models were used to provide insight into the evaluation metrics: ''water'' and ''stratified''.The water approach assigned 0 HU to voxels within the dilated body contour mask and −1000 HU outside the mask (air).As suggested by Maspero et al. (2017), a stratified approach was employed to obtain images resembling bulk-assigned sCT without geometrical deformations by starting from ground truth CT.Stratified sCTs were obtained by classifying the ground truth CT data into five categories and assigning bulk density values for voxels within their specific HU ranges.Voxels were categorized into five classes based on HU intensity levels, mapping a range of density values to a population-derived HU value for this tissue (Maspero et al., 2017), indicated as ⟨lower bound, upper bound⟩ HU → xx HU: 'air' (⟨−∞, −210⟩ HU → −968 HU), 'adipose tissue' ([−210, −20⟩ HU → −86 HU), 'soft tissue' ([−20, 120⟩ HU → 42 HU), 'bone marrow' ([120,555⟩ HU → 198 HU),and 'cortical bone' ([555,∞⟩ HU → 949 HU).The accuracy of the bone segmentation was further refined using a binary hole-filling algorithm to avoid soft tissue and air voxels within bone structures.

Evaluation
The sCTs generated by the participants were compared to the ground truth CTs based on metrics comparing image similarity and dose accuracy.

Image similarity
During the validation and test phases of the SynthRAD2023 Grand Challenge, the accuracy of the generated sCT images was evaluated using image similarity metrics within the dilated body contour masks  = { |   = 1} provided with the dataset.This evaluation aimed to assess how closely the sCTs resembled the reference CTs.The mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) were considered as image similarity metrics, as they are commonly used in medical image synthesis (Spadea et al., 2021).
Masked MAE was calculated to measure the average absolute difference between corresponding voxels in the sCT and CT, defined as in which we sum over the voxels inside the body contour  and normalized by the total number of masked voxels | |.
Masked PSNR was calculated to quantify the ratio of maximum signal intensity over the noise level in the sCT compared to the CT, defined as where  is the dynamic range of the voxel intensities ([−1024, 3000] HU).The CT and sCT were clipped to the dynamic range to calculate the masked PSNR.
Masked SSIM was calculated to assess structural similarity between CT and sCT.The SSIM for a voxel  between two images  and  is computed by where    and    are the mean and variance, respectively, of  within an  ×  ×  window centered on voxel  and    is the covariance of  and  within an  ×  ×  window centered on voxel . = 7 is the window size, and  1 = (0.01 ⋅ ) 2 and  2 = (0.03 ⋅ ) 2 are normalization constants, where  = (3000 − (−1024)) HU is the dynamic range of the volumes.The final masked SSIM value is then obtained by computing where the intensities of both the CT and sCT were clipped to [−1024, 3000] HU and then adjusted to be non-negative by adding 1024 HU.

Dose distribution similarity
Photon and proton intensity-modulated treatment plans were optimized based on the reference CT using the matRad treatment planning system (Wieser et al., 2017).The dose was prescribed to the planning target volume (PTV) for simplicity in both modalities, i.e., no robust optimization was performed for proton plans, with specific doses and isodose levels for the brain and pelvis.Only co-planar plans were considered, with photon plans utilizing 9-13 equi-angled 6 MV beams from a generic Linac model and proton plans utilizing 3-4 beams (from bilateral and opaque angles) from a generic proton system available in matRad.To reduce the dose to the healthy tissues and to ensure plan uniformity between patients, we used the same objective functions and constraints available in matRad per treatment site.OAR dose limits were treated as hard constraints whenever possible and were revised on a patient-specific basis when hard constraints were not achievable.For a few patients, the number of beams and some optimization parameters (e.g., optimizer, maximum number of iterations, and objective weights) were also fine-tuned to meet dose prescriptions and OAR limits.All planning goals and OAR dose limits were based on international guidelines for the brain (Lambrecht et al., 2018) and pelvis (Hall et al., 2021) are summarized in Table 1.Throughout the dose evaluation process, the dose was recalculated on each sCT for both proton and photon treatment plans.This recalculation was carried out without propagating organ delineations or replanning, a deliberate measure taken to avoid potential differences arising from plan optimization.Subsequently, the differences between the planning dose distributions, originally calculated on CT, and the recalculated dose distributions on the sCT for both photon and proton plans were quantified using three specific metrics.To ensure high reproducibility and facilitate fair comparisons for the SynthRAD2023 test set, the offline dose evaluation will be available at https://doi.org/10.5281/zenodo.10514185at the time of the release of the test set.
Relative mean absolute dose difference within high dose regions  = { |   , ≥ 0.9 ⋅   } were calculated to assess the difference in received dose in and around the target, defined as with  () being the dose distribution in the () and   the prescribed dose.Dose-volume histogram (DVH) parameters were calculated to assess the differences in the doses received by the PTV and OARs: the near-minimum dose in the PTV 98 PTV , the PTV volume receiving at least 95% of the prescribed dose  95 PTV , the near-maximum dose of a given OAR 2 OAR , and the mean dose received by a given OAR mean OAR .Specifically, the use of the near-minimum and nearmaximum was suggested by ICRU83 (https://www.fnkv.cz/soubory/216/icru-83.pdf).We included the relative absolute differences for all parameters as defined by where  = 1e−12 to avoid division by zero and  OARs is the number of OARs.For each patient, we used the three OARs (if available) that had the highest average of 5 OAR and mean OAR to analyze dose differences in organs close to the target.We summed the four terms to obtain one final value for the DVH metric.Gamma pass rates were calculated to compare the 3D spatial dose distributions from the sCTs with the dose obtained from the CT.This calculation followed the 3D gamma pass rate approach described by Low et al. (1998) with a dose-difference criterion () of 2% and a distance-to-agreement criterion () of 2 mm.The gamma pass rate at each position vector in the sCT was determined by comparing it with the CT dose.Gamma pass rates were evaluated within regions receiving doses ≥ 10% of the prescribed dose (Ezzell et al., 2009).

Eligibility and ranking
The nine metrics defined above were calculated for each test case and aggregated across all test cases for each participating team ( ± ).Teams were not considered in the ranking if their method did not outperform the water baseline for all three individual image similarity metrics.Moreover, the participants' method must complete the generation of a single sCT within 15 min on the Grand-Challenge platform, as described in the Appendix A.
Several methods exist for creating a ranking for a challenge with multiple metrics, including (1) calculating the mean over all metrics and ranking the aggregated scores (MeanThenRank), (2) calculating the median over all metrics and ranking the aggregated scores (Me-dianThenRank), (3) calculating the ranking for each metric and computing the mean of the aggregated ranks (RankThenMean), and (4) Calculating the ranking for each metric and computing the median of the aggregated ranks (RankThenMedian).Directly applying MeanThen-Rank and MedianThenRank to the nine metrics is inappropriate due to their lack of normalization and the differing orderings (ascending or descending).To fairly rank the submissions, each metric was normalized and scaled between zero (indicating the worst average team performance) and one (indicating the best average team performance).Subsequently, the normalized metrics are used to calculate the mean or median and rank the aggregated score.
In the context of the SynthRAD2023 challenge, the MeanThenRank approach was employed to determine the winners.This method should account for variations in team performance, enabling a fair evaluation considering the diverse clinically relevant aspects of image similarity, photon dose, and proton dose metrics.To analyze biases introduced by the ranking method, we also studied how the other ranking approaches would affect the outcome to assess ranking stability.

Overall sCT performance
Besides computing the aggregated metrics per submission ( ± ), we analyzed the significance of one team outperforming another in terms of individual metrics.To do so, we used the Wilcoxon signedrank test (Wilcoxon, 1945) with Holm's adjustment for multiple testing (Holm, 1979) for each metric separately, offering insights into the pairwise performance differences between teams.The significance level for this test is set at  = 0.05.Additionally, we recorded the inference time of the participant's methods ( ± ) to synthesize the CT from the CBCT or MRI data on the Grand Challenge infrastructure.

Model design predictors
We evaluated the model design choices adopted by participating teams thoroughly, aiming to identify the impact of these choices on overall ranking and performance.Statistical significance of the differences in SSIM performance within each subtask is determined using the Mann-Whitney U-test (Mann and Whitney, 1947) ( = 0.01), chosen for its suitability in comparing two independent samples that may not adhere to normal distribution.This test is particularly robust in the context of our analysis, providing reliable insights into the performance disparities associated with distinct design choices.To define predictors for sCT performance, we analyzed different design choices.We categorized them into five key aspects: (1) model and anatomy, (2)  backbone architecture, (3) spatial configuration, (4) preprocessing, (5)  data augmentation, and (6) postprocessing.Model and anatomy.Teams used different strategies to handle brain and pelvis data.Some teams used one collective model trained on both brain and pelvis patients (''One model'') or conditioned the collective model on the anatomical region (''One model, anatomy conditional'').In contrast, others trained the same model separately for the brain and pelvis subsets (''Two identical models'').Additionally, some teams used the same or a similar backbone architecture for both regions with distinctions in training parameters or network layers (''Two identical backbones, different training param.''and ''Two similar models'', respectively).Others employed entirely different models for the two regions (''Two different models'').
Backbone architecture.The base of a synthesis model involved using a CNN encoder-decoder model, which is a standard choice for image reconstruction and translation within deep learning.Teams also explored alternative architectures, such as GAN-based models that introduced a discriminator network and adversarial loss, transformer-based architectures that emphasized attention in the synthesis process, and diffusion model-based approaches that relied on an iterative diffusion process during inference.Moreover, some teams used an ensemble of multiple models to produce the final output.
Supervision.Each team reported the supervision approach adopted.Supervised (paired) training was guided by directly comparing predictions (sCT) to ground truth (CT) from the same cases.Unsupervised (unpaired) training was guided by introducing cycle-consistency as introduced by Zhu et al. (2017).
Spatial configuration.The implementation of sCT generation models varied in different spatial configurations.Opting for fully 3D models was possible, considering the entire image volume as input.However, fully 3D models were often restricted in use by available computing resources; therefore, many studies employed 3D patch-based approaches, 2.5D models considering multiple consecutive 2D slices, or a combination of orthogonal slices, full slice 2D models, or 2D patch-based models.
Preprocessing.A range of preprocessing techniques were used in the submitted algorithms, focusing on resizing and intensity normalization, necessary for stable and optimal model training.Resizing was used to achieve the desired voxel size, such as in the case of iso-resampling or the desired model input size.Intensity normalization was implemented linearly at the population level, at the patient level, or as standardization by ensuring well-distributed data based on a specific mean and standard deviation.Furthermore, some teams used intensity clipping to remove outliers, applied histogram matching or provided a specialized pipeline for the specific modality or anatomical region being processed.
Data augmentation.Various data augmentation techniques ensured a diverse training set, potentially making the models more robust to unseen cases in the test set.Teams introduced randomness through random crop or patch selection, flipping, rotation, blurring, noise addition, and intensity transformations like bias field or contrast adjustments.Random deformations, whether affine or elastic, were also applied to enhance the diversity of the training set.Postprocessing.Some teams that implemented patch-based models averaged overlapping patches at test time.The multiple outputs of ensembled methods could be combined into a single sCT.Additionally, specific postprocessing steps were implemented considering prior knowledge of the modality, such as noise and artifact removal.The inversion of original preprocessing steps, such as normalization and padding, was crucial in obtaining the final sCT with accurate dimensions and voxel representation in Hounsfield units (HU).

Data influence
By examining the teams' performances, we analyzed the test dataset to identify the characteristics and features of the samples correlating with synthesized image quality.The analysis compared image similarity and dose metrics ( ± ) averaged for each task, center, and anatomy within the test set.In addition, for task 1, the influence of MR acquisition protocol and magnetic field strength on performance was investigated.Statistical significance between the groups was established using the Mann-Whitney U-Test (Mann and Whitney, 1947) ( = 0.01).Lastly, we extended our analysis to a patient level, allowing for detailed evaluation of low-performing patients.

Metric correlations
For clarification throughout the paper, we defined the term 'metric group' to refer to one of the three categories of evaluation metrics: image similarity metrics (MAE, PSNR, and SSIM), photon dose metrics (MAE dose , DVH metric , and ), and proton dose metrics (MAE dose , DVH metric , and ).
Our objective was to analyze the correlation within and between metric groups.To achieve this, we employed visual assessments to illustrate correlations within a metric group.We used the Spearman rank correlation coefficient  (Spearman, 1904) to quantify correlations between all metrics.This coefficient considers the ordinal relationship between the ranks of single test case performances, providing robustness against variations in the scale and direction of the values.

Ranking stability and correlations
We used Kendall's  correlation coefficient (Kendall, 1938) between the approaches to analyze the effect of ranking approach choice.This coefficient quantifies the correlations between the ranking approaches outlined in Section 2.6, assessing the similarity in the relative ordering of elements across different rankings.
In addition, we investigated the stability of the final rankings at a patient level, as recommended by Wiesenfarth et al. (2021).This involved implementing bootstrapping to examine variations in the ranking positions of all teams.The ranking process was iteratively applied to 1000 bootstrap sets.Each bootstrapping set consisted of 120 randomly selected patients from the test set, with patients potentially being selected more than once.The MeanThenRank approach was employed to rank the teams by first normalizing the metric values based on the best and worst average performance of that metric per bootstrap sample.

Participation
The SynthRAD2023 Grand Challenge witnessed substantial participation from research teams worldwide, showcasing various techniques E.M.C. Huijben et al. and methodologies for sCT generation.By the end of the test phase, the training dataset had been downloaded 1797 times, and 617 researchers had registered for the challenge, forming 94 teams and 429 individual participants.Participation in the challenge phases decreased over time, resulting in 22 and 17 successful submissions in the test phase for tasks 1 and 2, respectively.Based on the criteria described in Section 2.6, 18 and 14 teams were included in the analysis for tasks 1 and 2, respectively (Table 2).Note that due to an unexpectedly large matrix size for one patient in the test set, the inference time limit was raised to accommodate the sCT generation.Nine of the included teams participated in both tasks, primarily utilizing the same or similar models for both tasks.Tables 3 and 4 show an overview of the proposed methods of all teams for tasks 1 and 2, respectively.More detailed descriptions of the methods implemented by the top five teams for both tasks are presented in Sections 3.1 to 3.7.Detailed method descriptions of all other teams can be found in supplementary document A.

SMU-MedVision (task 1 & 2)
SMU-MedVision employed a hybrid 3D patch-based CNN and transformer Unet network with multi-scale structure extraction and preservation (MSEP) for task 1 (Chen et al., 2023;Zhong et al., 2023).In the encoder, they employed channel and spatial-wise attention to extract spatial information, allowing for varying input sizes.Additionally, a residual dilated Swin transformer (RDSformer) was integrated into each skip connection of the UNet to enhance the preservation of structural information in cross-modal features (Liu et al., 2021).Two identical models were created for both anatomical regions, including the masked MAE and VGG19 perceptual loss (Johnson et al., 2016).Preprocessing involved Z-score normalization tailored to individual patient statistics and random horizontal and vertical flipping for data augmentation.At test time, overlapping patches were created by selecting every 80,000th voxel within the body mask as the central point of each patch.These overlapping patches were averaged to result in the full sCT.The model underwent training for 200 epochs using the Adam optimizer with a learning rate of 2e−4 and a poly decay scheduler.The final epoch used at test time was determined based on the best MAE in the sub-validation set created from the training set.
For task 2, SMU-MedVision implemented a 2.5D Unet++ (Zhou et al., 2018) with a ResNeXt101 backbone, with the loss function combining masked MAE loss, VGG19 perceptual loss (Johnson et al., 2016), and L2 regularization.The model was trained using brain and pelvis data and then fine-tuned per region.Preprocessing involved resizing, clipping, and linear normalization of the CT.Training data augmentation included shift scale rotations with horizontal and vertical flipping, while test data underwent augmentation via horizontal and vertical flipping.Slices of 5 × 384 × 384 voxels were used for collective pretraining, and the model input sizes for fine-tuning were 5 × 288 × 288 voxels for the brain and 5 × 416 × 416 voxels for the pelvis.Postprocessing included the inversion of test-time augmentations.The model was collectively trained for 40 epochs and then fine-tuned using 5-fold cross-validation for 50 for the brain and 40 epochs for the pelvis, respectively, and optimized using an AdamW optimizer with a stepped decay learning rate schedule.The final result was based on an ensemble of all five folds (with the best validation MAE) and a model trained on the completely provided training set (for the number of epochs mentioned above).

Jetta_Pang (task 1)
Jetta_Pang implemented two 3D patch-based nnU-Net (Isensee et al., 2021) models with an MSE loss for task 1: Model-Brain and Model-Pelvis.Preprocessing involved Z-score normalization for MRI and no normalization for CT images.No resizing, rescaling, or data augmentation was applied.Model input sizes were 64 × 128 × 224 voxels for brain patches and 112 × 160 × 128 voxels for pelvis patches.
Inference utilized the nnU-Net's default sliding window with half-patch size overlap, and no postprocessing steps were applied since the sCT was presented in HU.The models were trained for 1000 epochs using an SGD optimizer with a Nesterov momentum of 0.99, an initial learning rate of 1e−2, and a polyLR scheduler.

GEneRaTion (task 2)
GEneRaTion employed a 2D restoration approach using a Swin transformer (Liang et al., 2021) combined with a pre-trained masked autoencoder (He et al., 2022) for task 2. The SwinV2 architecture (Liu et al., 2022) was enhanced by incorporating group propagation blocks (Yang et al., 2022).Depending on the training stage, the model included either an L1, MSE, or perceptual loss (Johnson et al., 2016).Two identical models were created for the brain and pelvis.(CB)CT was linearly normalized between [−1000, 3000] HU for the brain and [−1000, 2000] HU for the pelvis.In a self-supervised pretraining phase, an L1 loss and a learning rate of 1e−4 were applied to 8 × 8 random patches with at least 75% of the patch within the provided body mask.Random 90 • rotations or horizontal or vertical flipping were part of this pretraining step.Subsequently, the models were fine-tuned for epochs on axial slices randomly cropped to 160 × 160 voxels, utilizing three stages.These stages involved training with (1) L1 loss and a learning rate of 1e−4, (2) MSE loss with a learning rate of 2e−5, and (3) a perceptual loss with a learning rate of 1e−5.During test-time ensembling, the three sCTs were combined through a weighted average, with weights calculated by  −mean( ) max( ) , and preprocessing steps were restored.

FAYIU (task 1 & 2)
Team FAYIU implemented a patch-based 3D Swin UNETR (Hatamizadeh et al., 2021) in MONAI (Cardoso et al., 2022) for both tasks and regions separately.The Swin UNETR architecture, incorporating a vision transformer-based encoder and CNN-based decoder, enabled the processing of 3D patches.The models used a masked L1 loss.MRI inputs were normalized by dividing by 1000, while (CB)CT inputs were first made non-negative and subsequently divided by 2000.
For training, 20 random patches of 32 × 96 × 96 voxels were selected per patient, and no other data augmentation techniques were applied.At inference time, patches overlapping by 28 × 72 × 72 voxels were selected, and overlapping regions were averaged in a weighted manner, with the weights for adjacent patches decreasing linearly as the overlap distance increased.Furthermore, the CT normalization procedure was reverted to result in an sCT in HU.The models were trained for epochs using the Adam optimizer and step-wise learning rate decay from 5e−4 to 5e−5.

iu_mia (task 1 & 2)
Team iu_mia employed a 3D patch-based ShuffleUNet (Chatterjee et al., 2021) model conditioned on the anatomical region for both tasks, with the L1 loss for both tasks.This model incorporates specialized 3D pixel unshuffling and shuffling modules to effectively handle the unique 3D aspects of medical imaging data.Z-score normalization was applied to the 3D MRI volumes, while (CB)CT volumes underwent linear scaling by ((CB)CT − 1024)∕4024.They selected random patches measuring 96 × 96 × 96 voxels for training, and no other data augmentation techniques were applied.At test time, sCTs were generated from patches with a 62.5% overlap and averaging using Gaussian weighting ( = 0.125), and the normalization process was inverted.The models were trained for 3000 epochs using the Adam optimizer with a linear learning rate scheduler initialized at 1e−3.

Table 3
Ranking and model details task 1 (MRI-to-CT synthesis).When a check is used, this step is applied to both MRI and CT and brain (br) and pelvis (pel); otherwise, it is specified by the subgroup.All distinctions listed in the first two rows are described in Section 2.7.2.

Table 4
Ranking and model details task 2 (CBCT-to-CT synthesis).When a check is used, this step is applied to both CBCT (CB) and CT and the brain and pelvis (pl.).Otherwise, it is specified by the subgroup.All distinctions listed in the first two rows are described in Section 2.7.2.  Huijben et al. 3.6.Elekta (task 1)

Rank Team
Team Elekta only participated in task 1, where they employed a 2.5D pix2pix (Isola et al., 2017) model using a ResUnet (Zhang et al., 2018) generator and a discriminator implemented similarly to the encoding part of the ResUnet.Spectral normalization (Miyato et al., 2018) was applied after each convolutional layer, and instance normalization replaced group normalization.Two identical models were created per anatomical region, using the least squares GAN loss (Mao et al., 2017) with L1 regularization (weight of 50).Linear scaling of MRI and CT intensities was conducted to fit within the range of [−1, +1], with source ranges determined by percentiles for MRI and fixed as [−1000, +2200] HU or [−1000, +3000] HU for CT.Two networks were trained for both regions, one covering the full CT intensity range and the other focusing on a narrower range.The training involved randomly selecting axial patches of 5 × 192 × 192 and augmented using affine transformations, synthetic multiplicative bias fields, blurring, sharpening, gamma contrast adjustments, and linear intensity transformations for MRI.For CT, only affine transformations were applied.During inference, patches with 4 × 96 × 96 voxels overlap were combined through weighted averaging, with higher weights assigned to pixels near the center of the patch and lower weights to those near the edge.The model was trained using the Adam optimizer and learning rates of 1e−4 and 5e−5 for the generator and discriminator, respectively.Additionally, a slowmoving exponential moving average (EMA) of the generator parameters was tracked during training and used as the final model for inference.Each model was trained six times, resulting in a final sCT ensembled by averaging the results of the six models.

Pengxin Yu (task 2)
Pengxin Yu employed a 3D patch-based model inspired by Ge et al. (2019) for task 2, implemented separately for the brain and pelvis.The model architecture featured consecutive multiscale residual blocks, effectively extracting fine-grained spatial structures and integrating stereo-correlation and image-expression constraints alongside the L1 reconstruction loss to guide structural detail and scene content.CBCT was linearly normalized between [−1000, 2000] HU, and for CT, center and region-specific windows were set: brain center A: [0, 3000] HU, brain center B and C: [−1000, 2000] HU, pelvis center A: [0, 2000] HU, and pelvis center B and C: [−1000, 1000] HU, after which the intensities were linearly normalized.During training, patches of 8 × 180 × 180 voxels were created by randomly resizing, cropping, and horizontal flipping.At test time, overlapping patches were selected with an overlap of 2 × 32 × 48 voxels.The models were trained for 1000 epochs with the AdamW optimizer with an initial learning rate of 3e−4 and reducing the learning rate by a factor 10 when the validation loss has not decreased for 10 epochs in a row.The final epoch used at testtime was determined based on the best PSNR on the sub-validation set, created from the training set.

Overall sCT generation performance
Table 5 presents the final ranking and quantitative results of the 18 eligible teams for task 1 and 14 eligible teams for task 2, along with the two baseline algorithms.All eligible teams outperformed the water baseline in both tasks based on the image similarity metrics.Almost all teams also outperform the water baseline based on the dose metrics.However, one team (X-MAN) did not outperform the water baseline when considering the  photon and DVH proton metrics in task 1 and the MAE photon and  photon metrics in task 2. On the other hand, 11/18 and 14/18 teams outperform the stratified baseline based on image similarity for tasks 1 and 2, respectively.Regarding the dose metrics in task 1, 10/18 and 14/18 outperformed the stratified baseline for the photon and proton gamma pass rate, respectively.In task 2, the stratified baseline outperformed all teams based on the photon gamma pass rate, while 11/15 teams achieved a higher proton gamma pass rate than the stratified baseline.Interestingly, a higher image similarity did not automatically lead to an improved dose distribution.For example, comparing SMU-MedVision (rank 1) and FGZ Medical Research (rank 6) for task 2, we observe a large difference in MAE of 49.95 ± 11.78 and 60.65 ± 12.56 HU, while a subtle difference in photon gamma pass rates of 99.49 ± 1.65 and 99.57± 1.07 is seen.Such differences motivated us to perform an in-depth statistical analysis examining the significance of one team outperforming another based on individual metrics (Figures 1 and 2 in supplementary document B).Based on the image similarity metrics, high-ranking teams robustly outperform lower-ranked teams.Statistical significant improvements were observed when comparing all image metrics between a team and another team ranked at least seven places lower for task 1, or six placed lower for task 2. However, for the dose metrics, this relation is weaker.In task 1, no statistical significant differences were observed between the top fourteen teams regarding the photon dose metrics and top eleven teams regarding the proton dose metrics.In task 2, no statistically significant differences were observed between the top eight teams regarding the photon and proton dose metrics, except for the fifth team (Pengxin Yu), which significantly outperforms the seventh team (KoalAI) regarding the proton DVH metric.
Overall, the teams successfully generated high-quality sCTs, accurately synthesizing soft-tissue density.However, visual examples in Fig. 2 show more pronounced errors at transitions between tissue densities, such as the boundaries between air and soft tissue or soft tissue and bone.These errors at the boundaries of the input with the ground truth CT appear consistent across teams and lead to increased dose error when a beam passes through these regions.Moreover, in the pelvic cases, the anatomy does not always fit within the field-of-view of the CBCT, requiring participants to synthesize anatomy not present in the model input.
The average inference time per case was 5.2 ± 2.8 minutes, with teams utilizing an average of 4.0 ± 4.8 GB of GPU RAM.The maximum observed inference time for a single case was 21.8 min.There was a notable spread in resource usage between teams, and a detailed overview per team per subtask is available in Figure 3 in supplementary document B.

Model design predictors
Of all the teams that participated in both tasks, the challenge winner, SMU-MedVision, was the only team to implement two different model architectures for each task.Most teams used the same model architecture for the brain and the pelvis but trained it separately for both regions.Therefore, the limited number of teams that chose similar or different models/parameters for the brain and pelvis did not allow for visible trends in the rankings (Tables 3 and 4).Nevertheless, teams that used one model conditioned on the anatomy consistently secured relatively high ranks for both tasks.Still, the team that trained one collective model without conditioning on the anatomical region ranked last in both tasks.
In addition, plain CNN decoder-encoder and GAN-based models were prevalent among the teams.However, the teams that placed first and third in task 1 and second and fourth in task 2 used transformerbased approaches.These transformers showed significantly better performance in both regions for task 1 and in the brain for task 2, achieving average SSIM values of 0.88 ± 0.03 for task 1 and 0.90 ± 0.03 for task 2 (Fig. 3).Following the transformers, CNN encoder-decoder models were the next best-performing, yielding SSIM values of 0.85 ± 0.04 and 0.89 ± 0.04 for tasks 1 and 2, respectively.Conversely, teams using GANs tended to rank lower (Tables 3 and 4), with SSIM values of 0.83 ± 0.07 for task 1 and 0.87 ± 0.05 for task 2. Notably, GANs showed a significant performance drop, especially for the pelvis cases (Fig. 3).

Table 5
All quantitative metrics ( ± ) produced by every participant in task 1 (MRI-to-CT) and task 2 (CBCT-to-CT).There were three image-based and six dose-based metrics: three for photon treatment and three for proton treatment.The best results per task per metric are marked in boldface.
Metric table 1: The quantitative metrics for task 1 (MRI-to-CT).Participants who scored worse than the water baseline on one image metric were excluded from the final ranking.Finally, the diffusion model, rarely adopted in this challenge, achieved SSIM values of 0.82 ± 0.06 and 0.88 ± 0.04 for tasks 1 and 2, respectively.Only one team (RRRocket_Lollies, task 2) implemented an unsupervised approach which placed them close to the bottom of the ranking (12th out of 14).Due to the lack of unsupervised methods we could not extend the analysis on the supervision level.
No significant differences in choices for preprocessing, data augmentation, and postprocessing and, consequently, no trends in ranking were observed (Tables 3 and 4).The numerous combinations of processing steps and substantial differences in model design prevent definitive conclusions about the importance of specific processing steps.

Data influence
The image quality of the brain patients was significantly different between the centers (Fig. 5).In task 1, the participants generated sCTs for centers A, B, and C with an SSIM of 0.857 ± 0.052, 0.831 ± 0.056, and 0.852 ± 0.050, respectively.For task 2, the participants generated sCTs for centers A, B, and C with an SSIM of 0.883 ± 0.039, 0.921 ± 0.034, and 0.897 ± 0.035, respectively.No statistically significant differences in image similarity were observed between centers for the pelvis data.The sCTs in task 2 showed a better image similarity than those in task 1, with an MAE of 79.40 ± 28.30 HU for task 1 versus 63.50 ± 24.34 HU for task 2. When considering the dose metrics for brain cases in task 1, center B ( photon = 92.03± 6.84) underperforms compared to centers A ( photon = 99.65 ± 1.09) and C ( photon = 99.93 ± 0.17).On the other hand, for pelvis cases in task 1, center A ( photon = 98.29 ± 3.00) underperforms relative to center C ( photon = 99.55 ± 0.58).For brain cases in task 2, minor dose differences were observed between the centers, with   proton = 98.80 ± 2.83, 96.87 ± 4.75, and 97.34 ± 4.94 for centers A, B, and C, respectively.
For task 1, each center employed consistent MRI scanning protocols for each anatomical region.Consequently, a comparison at the level of the MRI scan sequence yields identical results, as illustrated in Fig. 5.Moreover, the absence of variability in magnetic field strengths for centers B and C constrained this analysis to center A (Figure 4 in supplementary document B).For the brain, the only significant difference was observed for  photon , which decreased from 98.99 ± 1.43 for 1.5T to 97.33 ± 3.23 for 3T.In contrast, for the pelvis, a significant increase in performance was observed for 3T compared to 1.5T.Specifically, the SSIM increased from 0.83 ± 0.05 to 0.84 ± 0.05,  photon increased from 97.51 ± 3.45 to 98.75 ± 2.59, and  proton increased from 93.29 ± 4.05 to 95.64 ± 3.42.
A further investigation of the performance at the patient level, including a visual analysis of outlier patients, is presented in section 1.2 of supplementary document B.

Metric correlations
Fig. 6 highlights the correlations within the three metric groups.We observe strong correlations within the image similarity metric group, with the absolute inter-metric Spearman correlation coefficients || ranging from 0.88 to 0.96 (Fig. 7).These values consistently measure the underlying aspects of all three image metrics.In contrast, the photon and proton metrics show weaker correlations within their groups.Among the dose metrics, the MAE dose (photon) shows the highest correlation with the other dose metrics, such as  pass rate, with   coefficients of −0.66 and −0.75 for photons and protons, respectively.While the correlation between the MAE dose (photon) and DVH (photon) is strong (0.76), the correlation between MAE dose (photon) and DVH (proton) is significantly lower (0.26).The proton DVH metric shows poor correlation with all metrics, highlighting the complex relationship between these metrics (Figs. 6(b), 6(c) and 7).
Furthermore, the metric groups correlate moderately with each other.The average absolute coefficients between image similarity metrics and photon metrics were 0.40 ± 0.03, while those with proton metrics were 0.47 ± 0.08.Moreover, the average absolute correlation coefficient between photon and proton metrics was 0.50 ± 0.23, with the large standard deviation introduced by a correlation coefficient of zero between DVH (proton) and  (photon).Overall, the results strongly suggest that an sCT similar to the ground truth CT does not directly translate into a dose distribution similar to the reference distribution, highlighting that the different metrics focus on different aspects in evaluating sCTs.

Ranking stability and correlations
Fig. 8 illustrates that the challenge winner also secured the top position for both tasks under the three other ranking approaches, and teams at the bottom of the rankings are also stable across the ranking approaches.However, the middle-ranked teams experience notable shifts.For task 1, transitioning from MeanThenRank to MedianThen-Rank caused substantial changes for UKA (9 → 13) and mriG (12 → 8).Conversely, task 2's largest shifts occurred when changing from MeanThenRank to RankThenMedian for FGZ Medical research (6 → 2) and iu_mia (3 → 6).Despite these variations, all approaches strongly correlated with MeanThenRank approach, as indicated by Kendall's  correlation coefficient (Table 6).
Fig. 9 demonstrates that the final rankings (determined by Mean-ThenRank) were relatively stable.There was high confidence in topperforming teams securing higher ranks and underperforming teams obtaining lower ranks, with the teams showing a maximum shift of 4 and 3 positions for tasks 1 and 2, respectively.SMU-MedVision had a 63.7% certainty of being the winner for task 1 and 99.7% for task 2, Fig. 8. Stability of the chosen ranking approach (MeanThenRank) compared to the other three.For all approaches, we used the mean over all test patients to obtain one average value per metric per team.while certainties for the second to fifth places were lower, ranging from 45.0% to 66.7% for task 1 and from 30.5% to 60.4% for task 2. Teams in the middle of the rankings again showed some level of uncertainty, while it was inevitable that teams at the bottom of the ranking received the correct rank.This specifically holds for the last five teams for task 1 and the last six teams for task 2, with average certainties of 97.0 ± 3.5% and 99.8 ± 0.4%, respectively.

Discussion
SynthRAD2023 allowed the comparison of deep learning techniques for synthesizing CT from MRI or CBCT.It is the first large-scale, multi-center challenge for generating in-vivo synthetic CT and garnered significant participation among the community, consisting of 617 participants, generating 39 valid submissions.The participants were generally able to synthesize high-quality sCT, outperforming the baseline algorithms in terms of image quality and dose accuracy.
The top five teams performed well, with SSIM values of at least 0.87 and 0.90 for tasks 1 and 2, respectively.Additionally, they exceeded gamma pass rates (2 mm/2%) of 98.07% for photon and 97.25% for proton treatment plans in task 1, and at least 98.99% for photon and 97.00% for proton treatment plans in task 2. These results indicate a high level of correspondence to the ground truth CTs.Nevertheless, despite the excellent performance, challenges remain for image synthesis.Difficulties for MRI-to-CT synthesis were encountered at airtissue boundaries, potentially due to low MRI signal and magnetic susceptibility artifacts (Krupa and Bekiesińska-Figatowska, 2015).Additionally, in our dataset, the limited field-of-view of CBCT compared to CT introduced challenges in accurately synthesizing the complete body contour in the sCT.
Our analysis revealed that transformers (Vaswani et al., 2017) outperform CNN encoder-decoder models (e.g., U-Net Ronneberger  , 2015), which in turn outperform GANs (Goodfellow et al., 2014).Notably, recent architectures like diffusion models (Ho et al., 2020) and transformer-GAN combinations performed worse than the architectures mentioned above.These findings contrast recent reviews considering sCT generation (Spadea et al., 2021;Dayarathna et al., 2023), which either found no correlation between model architecture and performance or suggested that diffusion models hold promise in this field.Despite the statistically significant performance differences observed in our challenge, the differences were marginal, and the sample size was limited.In addition to comparing model architectures, it is important to acknowledge the potential impact of variations in training methodologies, including the reliability of hyperparameter search, on the observed performance differences among different approaches.Therefore, whether the observed differences stem solely from architectural choices or are significantly influenced by other aspects of the complex end-to-end pipeline, including preprocessing, data augmentation, postprocessing, and training procedures, remains inconclusive.For instance, previous literature suggests that data augmentation generally benefits generative models, suggesting that this step may play an essential role in model performance (Taylor and Nitschke, 2018;Steiner et al., 2021).
The 2D models outperformed 2.5D and (patch-based) 3D models for MRI-to-CT synthesis, while the 3D models outperformed the 2(.5)D approaches for CBCT-to-CT synthesis.These results hold for both pelvis and brain cases in both tasks.However, it has been shown that for MRI-to-CT synthesis 2.5D (multi-view) models outperform 2D models (Spadea et al., 2019;Maspero et al., 2020), and that 3D models outperform 2D models (Sun et al., 2022).We did not identify the cause of these contrasting results between MRI-to-CT and CBCT-to-CT synthesis.Future work could investigate why the impact of spatial dimension differs between these imaging modalities for synthetic CT generation.
We found that the image similarity metrics are highly correlated among themselves (|| ≥ 0.88) and that the MAE of the photon and proton dose distribution are moderately correlated to their respective gamma pass rates (|| ≥ 0.66).Specifically, the photon and proton DVH metrics are weakly correlated with the respective gamma pass rates (|| ≤ 0.42) (Fig. 7).Furthermore, the average correlations between the image similarity metrics and dose metrics are low (|| ≤ 0.47) despite the similar goal of measuring correspondence between the sCT and ground truth CT.The difference in correlations observed within and between metric groups may be attributed to the distinct regions where each metric is measured: image similarity was assessed within the dilated body contour, while dose metrics were calculated within high-dose regions or specific organs.These findings suggest that image similarity metrics should not be solely relied upon to determine the clinical suitability of a model, as they are not a reliable surrogate for clinically relevant dose metrics.Previous literature aligns with the finding, corroborating the poor correlation between image similarity and dose accuracy (Kieselmann et al., 2018;Peng et al., 2020).This highlights the need to perform thorough dose evaluations when clinically testing sCT generation approaches.
Two teams, i.e., UKA and PSCICP_4AI4PT, scored unexpectedly low in the final test phase compared to the validation phase due to implementation errors or misinterpreting data details.After (re-)opening the post-challenge phases, the two teams submitted corrected versions of the algorithms.Based on image similarity metrics alone, UKA climbed seven positions (9 → 2) in the rankings for both tasks.Similarly, PSCICP_4AI4PT climbed six positions (10 → 4) for task 1.During the open test phase, the teams could not resubmit their algorithms, as they ran successfully on the platform.The low scores of the erroneous algorithms underscore the fairness of the adopted rules.

Clinical impact
Despite the similarity between the sCTs generated by the participants and the ground truth CTs, there remains a lack of consensus regarding the criteria determining the clinical acceptability of an sCT (Vandewinckele et al., 2020).In radiotherapy, treatment planning is defined to meet specific dose prescriptions and constraints.In this sense, dose-related metrics may be considered clinically significant.Some works have investigated clinical acceptance criteria for synthetic CTs.For example, Olberg et al. (2019) considered photon gamma pass rates greater than 98% acceptable using the 2 mm/2% criterion.On the other hand, Korsholm et al. (2014) proposes that treatments with a DVH difference of <2% are clinically acceptable.However, these criteria were proposed for breast, head-and-neck, and thorax sCT generation; it is unclear whether these criteria translate to different anatomical regions.Before addressing the clinical impact of the challenge results, it is crucial to consider the quality of the treatment plans adopted for SynthRAD2023 evaluation.Treatment planning techniques may differ between institutes.The planning techniques chosen have been based on constraints adopted in clinical guidelines (Hall et al., 2021;Lambrecht et al., 2018), making the results of the challenge of clinical relevance.The Linac and proton systems used in the treatment planning were generic; however, studies have demonstrated their effectiveness by showing that gamma pass rates deviate by a maximum of 0.5% when compared to dose engines adopted in clinical systems, independent of the irradiation type (Wieser et al., 2017).
To indicate clinically acceptable sCT, we propose considering an average gamma pass rate (2 mm/2%) above 99% and 97.5% for photon and proton irradiation in regions receiving at ≥10% of the prescribed dose, respectively.For the SynthRAD2023 challenge, only one team (Jetta_Pang) met these criteria for the MRI-to-CT task.For the CBCT-to-CT task, one team (SMU-MedVision) met both criteria.In contrast, five other teams (GEneRaTion, FAYIU, Pengxin Yu, FGZ Medical Research, and Breizh-CT) met only the photon criterion.As previously mentioned, the evaluation was affected by differences in patient positioning between the imaging sessions.Still, when considering the results on a population level, we do not expect to observe any systematic dose differences unless the sCT generation method introduced geometrically consistent distortion (Adjeiwaah et al., 2019).The lack of systematic dose differences suggests that the solutions offered by the participants are promising and of high quality.Before implementing any proposed clinical solutions, evaluating them according to the clinical standards specific to each facility using the commissioned treatment planning system is advisable.
Currently, commercial solutions to generate sCT from MRI or CBCT are available (Köhler et al., 2015;van Stralen et al., 2019;Cronholm et al., 2020;Archambault et al., 2020).It would be interesting to compare the algorithms submitted to the SynthRAD2023 challenge with these commercial solutions.Some of these commercial solutions require a dedicated imaging protocol to generate accurate sCT data (Florkow et al., 2020;Bratova et al., 2019;Liu et al., 2023), making the comparison challenging.Exploring the necessity of specialized imaging protocols, or in simpler terms, assessing the ability of sCT algorithms to generalize across different input variations as in Nijskens et al. (2023), could be worthwhile.

Limitations of the SynthRAD2023 dataset and setup
A substantial multi-center dataset was gathered for the SynthRAD2023 challenge.However, the dataset can be further improved despite its size and diversity.For example, the dataset consists solely of patients treated at Dutch hospitals, which may limit the dataset's heterogeneity, possibly resulting in low performance for case outliers in the data distributions.Additionally, it is important to acknowledge that the included MRIs represent only a subset of the magnetic field strengths commonly used in clinical practice (1.5T and 3T).This limits the generalizability of the findings to the broader spectrum of clinical MRI applications, where other field strengths are routinely employed.While the inclusion of three centers represents a commendable starting point, extending the dataset with international data may improve the generalization capabilities of the submitted models and increase the clinical impact.
A model ideally should generalize across different centers without conditional fine-tuning, as current commercial solutions are not centerspecific.While some participants incorporated center-based prediction and optimization using information shared in the training and validation sets, effective models should extend beyond the provided centers to make a clinical impact.Future challenges may consider whether circumventing such information may lead to designing more general approaches.
Furthermore, the SynthRAD2023 dataset contained rigidly registered image pairs, resulting in residual anatomical mismatch after registration, as mentioned above.Reducing the registration error, e.g., recurring to deformable registration, may improve performances (Florkow et al., 2020).However, it may also confound possible geometrical distortion to the input images the models may introduce, which is undesirable in a clinical scenario (Pappas et al., 2017).The impact of residual misregistration is corroborated by the paired nature of the dataset, and could be mitigated by performing unpaired synthetic CT generation.However, unpaired sCT generation prohibits a dosimetric evaluation of the generated sCT, and limits the use of established image similarity metric, such as the SSIM or PSNR.
An additional dataset limitation stems from the automated processing pipeline, where, for all brain patients from center B in task 1, the treatment table was included within the dilated body contour.At the same time, the table was successfully excluded from the dataset by the other centers, leading to inconsistent table representation in the dose evaluations.Moreover, for two out of sixty pelvis patients in task 2, the field-of-view of the CBCT was smaller than the body contour.Such patients can be considered outliers, and for future challenges, it would be beneficial to revise case selection and exclude them from the test set.The inclusion of the table in the mask and limited CBCT field-of-view had minor impact on the image similarity evaluation, which was computed within the provided mask, but could be more substantial for the dose evaluation due to beam attenuation.Note that the inconsistency was present for all teams, leaving the challenge ranking unbiased.
Another limitation arose from the absence of dose evaluation during the validation phase, hindering teams from optimizing their models for this radiotherapy-related metric.On the other hand, the lack of dose metrics during validation may have compelled participants to develop general methods that could function irrespective of the chosen planning strategy.

Future direction
SynthRAD2023 has set out to advance the state-of-the-art in MRI-to-CT and CBCT-to-CT generation.While the results are promising, these tasks have not yet been solved during this challenge.The dataset only included brain and pelvis patients.Other, maybe more challenging, anatomical regions could benefit from sCT generation, such as the thorax, head-and-neck, breast, or abdomen (Spadea et al., 2021).In addition, it would be of interest to examine the generalizability of the models by including test data from centers that were not present in the training data (Texier et al., 2023).
The positive reception to SynthRAD2023 has spurred the development of SynthRAD2025, which aims to expand the challenge beyond the Dutch national domain into more unexplored anatomical regions, such as the head-and-neck and abdomen.
Furthermore, addressing the limitations in data preparation and image registration discussed earlier will enhance the analysis of future challenges.
Lastly, we anticipate that the post-challenge phases will offer opportunities to validate and enhance the statistical robustness of the challenge's conclusions, enabling other researchers to compare their methods with the results of SynthRAD2023.

Fig. 2 .
Fig. 2. Examples of synthetic CTs for task 1 (MRI-to-CT; a) and task 2 (CBCT-to-CT: b).The model input is shown in the upper left, and the ground truth is in the center-left.The sCT of the top five participants for task 1 and task 2 are shown in the top row.The difference from ground truth CT is shown in the middle row.On the bottom left is the planned irradiation based on the CT for a photon (a) and proton (b) plan.The bottom row shows the dose difference when the treatment plan is applied to the sCT (CT dose -sCT dose).All values outside the body contour were masked.

Fig. 3 .
Fig. 3.Boxplots of SSIM values for all patients in each subtask, i.e., task 1 (MRI-to-CT) or 2 (CBCT-to-CT) and brain or pelvis, grouped by the model backbone choice of each team.The  in the boxes indicates the number of teams represented in that box.An asterisk indicates significant differences within one subtask.

Fig. 4 .
Fig. 4.Boxplots of SSIM values for all patients in each subtask, i.e., task 1 (MRI-to-CT) or 2 (CBCT-to-CT) and brain or pelvis, grouped by the spatial configuration of the models designed by the team.The  in the boxes indicates the number of teams represented in that box.An asterisk indicates significant differences within one subtask.

Fig. 5 .
Fig.5.Boxplots of the teams' performance in terms of SSIM and gamma pass rates for photon and proton, grouped by different subsets in our dataset, analyzing the differences between task, anatomical region and center.Asterisks indicate significant differences.

Fig. 6 .
Fig.6.Correlation plots among metrics in the three categories: (a) image metrics, (b) photon metrics, and (c) proton metrics.Each data point indicates a team's performance for one patient in either task 1 or 2. Note that some metrics are presented using a logarithmic scale, and one extreme outlier for the proton DVH metric (in the order of 1 × 10 5 ) is excluded from the plot.

Fig. 7 .
Fig. 7. Spearman rank correlation coefficient  between the different metrics.Note that the interpretation of the correlation coefficient is contingent upon whether both compared metrics exhibit concordant trends.

Fig. 9 .
Fig. 9. Visualization of ranking stability.Blob size is proportional to the frequency of the rank achieved based on bootstrapping (N = 1000).

Table 2
Details on the challenge participation.Participants without a team are displayed as a one-person team.

table 2 :
The quantitative metrics for task 2 (CBCT-to-CT).Participants who scored worse than the water baseline on one image metric were excluded from the final ranking.

Table 6
Kendall's  correlation coefficients for the ranking obtained from MeanThenRank compared to the other three ranking approaches.