Fibroglandular tissue segmentation in breast MRI using vision transformers: a multi-institutional evaluation

Accurate and automatic segmentation of fibroglandular tissue in breast MRI screening is essential for the quantification of breast density and background parenchymal enhancement. In this retrospective study, we developed and evaluated a transformer-based neural network for breast segmentation (TraBS) in multi-institutional MRI data, and compared its performance to the well established convolutional neural network nnUNet. TraBS and nnUNet were trained and tested on 200 internal and 40 external breast MRI examinations using manual segmentations generated by experienced human readers. Segmentation performance was assessed in terms of the Dice score and the average symmetric surface distance. The Dice score for nnUNet was lower than for TraBS on the internal testset (0.909 ± 0.069 versus 0.916 ± 0.067, P < 0.001) and on the external testset (0.824 ± 0.144 versus 0.864 ± 0.081, P = 0.004). Moreover, the average symmetric surface distance was higher (= worse) for nnUNet than for TraBS on the internal (0.657 ± 2.856 versus 0.548 ± 2.195, P = 0.001) and on the external testset (0.727 ± 0.620 versus 0.584 ± 0.413, P = 0.03). Our study demonstrates that transformer-based networks improve the quality of fibroglandular tissue segmentation in breast MRI compared to convolutional-based models like nnUNet. These findings might help to enhance the accuracy of breast density and parenchymal enhancement quantification in breast MRI screening.


Introduction
Breast cancer is the most frequent type of cancer in the female population, and represents the second leading cause of death in the United States [1] among women.New guidelines for breast cancer screening recommend the use of MRI for women with dense breast tissue, whereas X-ray mammography was previously the primary imaging modality for breast cancer screening [2; 3].Deep learning-based tools for the assessment of breast density on mammography have already been developed [4], yet a consistent and reliable automated assessment of breast density -as the ratio of fibroglandular tissue (FGT) to the breast volume -on MRI examinations is still lacking.Besides breast density, the enhancement of fibroglandular tissue (BPE) has also emerged as a promising marker for the early detection of breast cancer [5; 6], however, reliable automated assessment of BPE is also lacking.The development of a machine learning algorithm capable of segmenting the fibgroglandular tissue (FGT) is an important first step towards an automatic quantification of breast density and BPE in breast MRI examinations.Several research studies have investigated this Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation A PREPRINT problem by training convolutional neural networks (CNNs) on manually segmented breast MRI examinations and evaluating their performance on single-center test sets [7; 8; 9].The high level of agreement between human-and machine-generated segmentation maps in all of these publications demonstrates the potential of CNNs.However, there is an important impediment to the widespread introduction of such algorithms: MRI examinations are not standardized.Different clinical centers use diverse MRI protocols and sequences for the diagnosis of breast cancer.None of the studies we found tested their CNN architecture on independent data that did not belong to the institution where the algorithms were developed.Transformer-based models have proven to be more robust, generalizable, and attack-proof than CNNs in other applications of medical image analysis [10; 11].They have achieved state-of-the-art results for natural language processing [12; 13], mainly because of their capability to handle long-term dependencies and self-supervised pre-training for downstream tasks.Therefore, we aimed to develop and test a robust and accurate segmentation method based on the transformer architecture that could generalize well to multi-institutional data.We compare our model against the current state-of-the-art CNN-based model on both internal and external breast MRI datasets from Duke University [14].Our hypotheses were that the new transformer-based model outperforms the current state of the art and that it generalizes better to external data.

Ethics Statement
Local institutional review board approval was obtained (EK028/19).

Datasets
In this retrospective study, two breast MRI datasets were used which we will refer to as "UKA", "DUKE".First, UKA was collected between 2010 and 2019 at the University Hospital Aachen, Germany [15].UKA comprises a total of 9751 breast MRI examinations of 5086 women.Among this set, a total of 200 examinations from 200 women were chosen, comprising 104 carcinomas and 55 fibroadenomas.Dynamic Contrast Enhancement (DCE)-MRI studies of the breast had been performed according to a standardized protocol [16] on a 1.5-T system (Achieva and Ingenia; Philips Medical Systems) by using a double-breast four-element surface coil (Invivo) with two paddles being used to immobilize the breast in the craniocaudal direction (Noras).See Table 1 for a detailed description of the acquisition parameters.Second, DUKE was collected between 2000 and 2014 at the Duke Hospital, USA, and is publicly available [14].All 922 cases have biopsy-confirmed invasive breast cancer and were acquired with either a 1.5 Tesla, 2.9 Tesla, or 3.0 Tesla scanner from General Electric or Siemens.The MRI protocol consisted of a T1-weighted fat-suppressed sequence (one pre-contrast, and four post-contrast scans) and a non-fat-suppressed T1-weighted sequence.For evaluation, 40 cases were randomly selected and manually segmented as detailed below.Ground Truth Segmentation of Fibroglandular Tissue Both the whole breast volume and the fibroglandular tissue were segmented by F.M. and E.K. using the software ITK-SNAP [17], quality controlled by L.H. and V.R. with six and three years of experience in breast MRI respectively, and corrected if necessary.Segmentation masks were generated for the UKA subset of 200 MRI examinations and 40 randomly sampled cases of DUKE, respectively.The breast outline was defined as the tissue volume located anterior to the pectoralis muscle.Sample manual segmentations are shown in Supplemental Figures 6 and 7.

Data Processing Pipeline
The processing pipeline comprised two consecutive stages: The first stage performed the segmentation of the whole breast, while the second stage segmented the FGT only (Figure 1).In both stages, the use of a neural network was possible, however, the manual (ground truth) segmentations were used in the first stage with the rationale that we want to compare the network architectures for FGT only.
For the second step of the segmentation pipeline, the segmentation masks from the first step were used to create a crop of the left and right breast.The non-enhanced and the contrast-enhanced images were stacked along the channel dimension and both breast sides were subsequently fed into the neural network.The intensity distributions of all images were z-score normalized (mean=0, standard deviation = 1).The segmentation pipeline was implemented with PyTorch [18] on a computer equipped with an NVIDIA GeForce RTX 3090.Please refer to the SwinUNETR publication [19] for an in-depth explanation.

Model Architecture
In the following, we refer to our new transformer-based model as TraBS (SwinTransformer for fibroglandular Breast tissue Segmentation).TraBS was built upon SwinUNETR [19] with 2, 4, and 8 heads and 24, 48, 96, and 192 embedding features in stages 1 to 4. Inspired by the nnUNet to handle typically non-isotopic resolutions in MRI images, we replaced the uniform 2x2x2 patch sizes and 3x3x3 kernels in the two up-most layers with non-isotropic 1x2x2 patches and 1x3x3 kernels.
In addition, 1x1x1 convolutions were added to supervise the deeper layers (Figure 2).To establish a baseline, we employed the state-of-the-art nnUNet [20].The model had two max-pooling layers with 1x2x2 strides and 1x3x3 kernels, followed by two max-pooling layers with 2x2x2 strides and 3x3x3 kernels, following a previous publication for FGT segmentation [8].Training and Testing of the Framework The UKA subset was randomly divided into training and test sets using five-fold cross-validation.The training set within each fold was further subdivided into a dedicated training set (80%) and a validation set (20%).The training of the FGT segmentation models was performed for each of the five folds with the manual segmentation masks as ground truth.AdamW with a learning rate of 0.0001 was used to optimize the sum of DiceLoss and CrossEntropy, following previous recommendations for medical image segmentation [21].Following the nnUNet implementations, the loss function was additionally calculated at the lower resolutions of the decoder path (Multi-Scale Supervision) in the TraBS model.Using early stopping, training of each model was halted as soon as the loss within the validation set did not decrease within 30 epochs.To increase the diversity of the training set and thus prevent overfitting, the following data augmentation operations from the TorchIO framework [22] were applied: flipping, affine transformation, ghosting, Gaussian noise, blurring, bias field, and gamma augmentation.During training, a random region of 256x256x32 voxels within the left and right crops was selected.A sliding window of 256x256x32 voxels with an overlap of 50% was used during inference.Random-flip along all axes was used as test-time augmentation.The source code is publicly available at https://github.com/mueller-franzes/TraBS.

Statistical Analysis
We used five-fold cross-validation on the internal UKA dataset to examine the performance of the models on unseen test data.For the external DUKE dataset, an ensemble of the five FGT segmentation models from the cross-validation training was applied.Majority voting was used to combine the five segmentation masks.Segmentation performance was assessed by calculating the Dice similarity coefficient (DSC) [23] and Average Symmetric Surface Distance (ASSD) [24].Breast density and the BPE are both clinically relevant metrics related to breast cancer risk [5; 6] and their quantitative Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation A PREPRINT assessment depends on the FGT segmentation.Therefore, we measured these two metrics both for the manual and the automated segmentations and calculated the Pearson correlation coefficients between manually and automatically derived metrics.Note, that the BPE was defined here as the percentage change of the FGT between the post-and pre-contrast image.Bootstrapping was employed to calculate confidence intervals and permutation testing was used to calculate p-values.Following the guidance of Amrhein et al. [25], we did not employ thresholds for statistical significance to the p-values.

Patient Characteristics
The study included only female patients, with a mean age of 56±10 years (range 19-91) and a mean weight of 75±27 kg for the UKA data and 53±11 years (range 22-90) years with a mean weight of 76±18kg for the DUKE data.

Overall segmentation performance
We tested the segmentation performance of TraBS as compared to nnUNet in terms of overlap between the ground truth and the automated segmentation.nnUNet achieved a mean Dice score of 0.909±0.069for the FGT segmentation on our internal dataset, see Table 2. Our improved model TraBS achieved consistently better results with a mean Dice score of 0.916±0.067,P<0.001.TraBS also demonstrated a lower Average Symmetric Surface Distance (ASSD) (0.548±2.195) than nnUNet (0.657±2.856,P=0.001), indicating, that finer details are more accurately assessed by TraBS.By trend, segmentation performance as measured by Dice score was lower for both models when breasts were less dense, i.e., when the fractional volume of FGT within the breast was lower, see Figure 3.

Fine Details and overall Structure are better captured by TraBS
In addition to quantitative assessment, an expert radiologist visually assessed the segmentation quality and found that TraBS performed better in capturing both overall structure and fine details compared to nnUNet.Specifically, TraBS was better at differentiating between breast implants and FGT and distinguishing between lesions and normal breast tissue, as shown in Figure 4.

Transformer-based segmentation translates to more accurate Clinical Measures
To investigate how the segmentation performance relates to clinical measurements used to assess patients' risks such as breast density, we examined the correlation between such measures when calculated based on the ground truth segmentation and on the automated segmentation.Both nnUNet and TraBS demonstrated an almost perfect correlation to the manually derived breast density and  BPE, as shown in Table 2.Although the correlations were almost perfect for both models, TraBS showed slightly higher correlations (P=0.11 and P=0.06).

Fine Details and overall Structure are better captured by TraBS
Visual inspection of the segmentations in the external datasets confirmed the superior performance of TraBS, as it was better able to capture fine details and overall structure compared to nnUNet.Sample images are given in Figure 5.
Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation A PREPRINT

Transformer-based segmentation translates to more accurate Clinical Measures
Despite the limited overall segmentation quality on the external DUKE dataset, both TraBS and nnUNet still demonstrated good correlations with manual segmentations for breast density and BPE (Table 3).However, TraBS achieved higher correlations than nnUNet (P=0.007 and P=0.24).

Discussion
In this study, we propose a novel network architecture, TraBS, for segmenting fibroglandular tissue (FGT) in breast MRI images.We demonstrate that TraBS outperforms the previous state of the art in both internal and external validation sets.Breast density and BPE are important factors in determining patients' cancer risk.Thus, accurate and reliable methods for the automated extraction of quantitative markers such as breast density and BPE are needed.Our work advances the field in four aspects: First, all groups who have applied neural networks on FGT segmentation have only evaluated their algorithms on internal test sets, i.e., examinations that are similar in appearance to the examinations upon which the algorithm was trained, see Table 4 for an overview of previous research.This is a shortcoming that needs to be addressed in view of the plethora of MRI scanner protocols that are currently in clinical use.We tackled this gap by evaluating our proposed TraBS model on an external dataset and we demonstrated that the new transformer-based architecture exhibits a better generalization performance as compared to nnUNet.Second, we examined the Dice score as a function of breast density and found that lower FGT density results in a lower Dice score.This partly explains the spread of reported Dice scores in the literature (Table 4), as the test set that is used for evaluation has a large effect on the Dice metric: if segmentation algorithms are tested on breast MRI examinations with high amounts of FGT, Dice scores are higher by trend.This is an important finding for future studies and therefore, we suggest that future works on FGT segmentation should contain a report about the mean FGT density of the test set or a graph similar to Figure 3. Third, we make the manual segmentations for the DUKE data publicly available to serve as a reference standard for future evaluations.We reckon that this can contribute to independent external evaluations of segmentation algorithms for breast MRI.Last but not least, we demonstrate the overall better performance of our transformer-based model TraBS as compared to the previous state-of-the-art architecture for breast tissue segmentation in all selected performance metrics.We make our code publicly available, alongside the trained model, to further advance the field and to bridge the gap to clinical application.Our work has limitations that relate to the fact that manual segmentations are extremely time-consuming to obtain: First, even though we evaluated the model on external test data, we did not include any external training data.Thus, the segmentation performance decreases when applied to external data and even though TraBS is more robust to domain shift, its performance could be increased by including additional multi-domain data during training.Future work should focus on this in order to make the segmentation performance more robust.Second, we only included 40 external examinations as test cases from one external institution.Even though this is progress as compared to previous research, the database for a broad multi-institutional can and should be extended to provide a global perspective on the performance in so far underrepresented patient groups.Third, we did not investigate inter-rater variability due to the lack of multiple segmentations by multiple readers on the same examinations.This should be done by future studies in order to evaluate the accuracy of the segmentations which so far serve as ground truth, i.e., the human-generated segmentations.

Conclusion
In conclusion, our proposed TraBS network demonstrates excellent performance in segmenting FGT in breast MRI images.This paves the way for routine automated FGT segmentation and automatic quantification of breast density and BPE.
Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation A PREPRINT

Supplemental Material
Table 5: Mean DSC values within the five-fold cross-validation.

Figure 1 :Figure 2 :
Figure 1: Illustration of the segmentation framework.Non-Enhanced T1-or T2-weighted images respectively (depending on the availability within the MRI protocol) were used to manually segment the breast.The manual breast segmentation is used to crop the subtraction image of the dynamic contrast enhanced (DCE) sequence and the pre-contrast image.Based on the cropped sequences, the neural network created a segmentation mask of the fibroglandular tissue.

FibroglandularFigure 3 :
Figure 3: Dice Similarity Coefficient (DSC) and Average Symmetric Surface Distance (ASSD)between the automated and manual segmentations for all examined neural network architectures.Independent of the neural network used, DSC was lower in examinations of low-density breasts, while ASSD was not influenced by breast density.

Figure 4 : 987 Figure 5 :
Figure 4: Sample MRI examinations of the internal UKA dataset.The two leftmost columns show the contrast-enhanced subtraction and non-enhanced T1-weighted image.The third column shows the ground truth segmentation by the radiologists and the two remaining columns show the segmentations by the neural networks.Correct segmentations are displayed in green and incorrectly labeled regions in red.Blue arrows denote challenging regions such as lesions (Patient A and B) or breast implants (Patient C).

Figure 6 :Figure 7 :
Figure 6: Illustration of the manual breast volume segmentation in the UKA dataset.

Table 2 :
Comparison of the Dice Score (DSC, higher is better), Average Symmetric Surface Distance (ASSD, lower is better), Pearson correlation coefficients between quantitative estimation by radiologists and neural network of the Breast Density (ρ Dense , higher is better) and background parenchymal enhancement (ρ BP E , higher is better) on the UKA dataset.The best performance is denoted in bold.DSC ASSD [mm] ρ Dense ρ BP E

Table 4 :
Comparison of studies using neural networks for fibroglandular tissue (FGT) segmentation in MRI.The number of patients refers to the test set(s).The agreement between the manual and neural network segmentation was measured based on the Dice Similarity Coefficient (DSC), the Pearson correlation coefficient (ρ F GT ) of the FGT volume, the breast density (ρ Dense ), and the correlation (ρ BP E ) of the background parenchymal enhancement (BPE).Note: *BPE was estimated qualitatively by radiologists, **Spearman Correlation Coefficient Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -A multi-institutional evaluation A PREPRINT