A Two-Step Feature Selection Radiomic Approach to Predict Molecular Outcomes in Breast Cancer

Breast Cancer (BC) is the most common cancer among women worldwide and is characterized by intra- and inter-tumor heterogeneity that strongly contributes towards its poor prognosis. The Estrogen Receptor (ER), Progesterone Receptor (PR), Human Epidermal Growth Factor Receptor 2 (HER2), and Ki67 antigen are the most examined markers depicting BC heterogeneity and have been shown to have a strong impact on BC prognosis. Radiomics can noninvasively predict BC heterogeneity through the quantitative evaluation of medical images, such as Magnetic Resonance Imaging (MRI), which has become increasingly important in the detection and characterization of BC. However, the lack of comprehensive BC datasets in terms of molecular outcomes and MRI modalities, and the absence of a general methodology to build and compare feature selection approaches and predictive models, limit the routine use of radiomics in the BC clinical practice. In this work, a new radiomic approach based on a two-step feature selection process was proposed to build predictors for ER, PR, HER2, and Ki67 markers. An in-house dataset was used, containing 92 multiparametric MRIs of patients with histologically proven BC and all four relevant biomarkers available. Thousands of radiomic features were extracted from post-contrast and subtracted Dynamic Contrast-Enanched (DCE) MRI images, Apparent Diffusion Coefficient (ADC) maps, and T2-weighted (T2) images. The two-step feature selection approach was used to identify significant radiomic features properly and then to build the final prediction models. They showed remarkable results in terms of F1-score for all the biomarkers: 84%, 63%, 90%, and 72% for ER, HER2, Ki67, and PR, respectively. When possible, the models were validated on the TCGA/TCIA Breast Cancer dataset, returning promising results (F1-score = 88% for the ER+/ER− classification task). The developed approach efficiently characterized BC heterogeneity according to the examined molecular biomarkers.


Introduction
Breast Cancer (BC) is the most commonly diagnosed cancer type in the world. The most recent global cancer statistics estimate that there are about 2.3 million incident BC cases and that the disease is the leading cause of cancer mortality in women worldwide [1]. Currently, radiographic evaluation followed by a histological confirmation of malignancy on biopsy samples is used to make the early diagnosis of BC [2,3]. Although this method allows practitioners to safely and effectively characterize the molecular changes in breast tissue, it has intrinsic drawbacks because of the accessibility and heterogeneity of the tumors and the risks associated with the bioptic process [4].

Study Design
The goal of the proposed methodology was to build robust predictors for the four biomarkers commonly used in BC molecular profiling. Two comprehensive datasets were exploited: the MOLIM ONCO BRAIN dataset (DS M ) for model development and the TCGA-BRCA (DS T ) for model validation. Both included: (i) mpMRI preoperative images of BC patients (both functional and morphological sequences), (ii) clinical information related to at least three biomarkers (ER, PR, HER2), and (iii) BC tumour segmentations. Figure 1 shows the steps of the radiomic pipeline used to carry out the model development and validation: mpMRI images acquisition (Section 2.3), tumor segmentation (Section 2.4), radiomic feature extraction (Section 2.5), a two-step feature selection and classification (Section 2.6), and best model selection and validation using an external dataset (Section 2.7). Figure 1. The steps of the adopted radiomics pipeline. They include MRI acquisition, tumor segmentation, feature extraction, feature selection, and model analysis.

Patients
The patient cohort included two datasets of BC patients. For the DS M , 92 MRI preoperative examinations of patients with BC (93 lesions) were collected from February 2017 to February 2020 and retrospectively evaluated. Inclusion criteria were the following: (1) age >18 years and (2) patients with histologically proven BC. Patients were excluded if the histological report was unavailable and the MRI images were significantly affected by motion artifacts. All patient information was de-identified before the data were stored in the collection of BCU Imaging Biobank [22]. The study was approved by the Ethical Committee IRCCS Pascale (Prot. 12/19 OSS SDN), and written informed consent was obtained from all participants.
The DS T was used to perform model validation. It included 164 MRI studies (Digiatl Imaging and COmmunications in Medicine-DICOM-format, 88.1 GB) of 139 Breast Cancer patients from several American hospitals and clinics. The clinical, genetic, and pathological data were acquired from the Genomic Data Commons Data Portal [23]. To reduce potential image acquisition variation, only breast MRI studies that were similar in acquisition and technique (namely, MRIs that were acquired on a 1.5 T magnet strength using GE (GE Medical Systems, Milwaukee, WI, USA) scanners and protocols) were analyzed. This selection procedure resulted in a total of 93 patients. For these cases, tumor segmentations were available in binary format [24]. One subject with missing DCE images and one with missing genotyping data were excluded from the study. Finally, the DS T consisted of 91 BC patients. The images and segmentations were downloaded and converted in NIfTI (Neuroimaging Informatics Technology Initiative) format.

MRI Acquisition
MRI examinations of the DS M were performed using a 3 T Biograph mMR (Siemens Healthcare, Erlangen, Germany) with a dedicated breast surface coil. T2 Turbo spin-echo (TSE T2) sequence was acquired on an axial plane before contrast-agent injection, and DWI with b values of 50, 500, and 800 s/mm 2 was acquired on the axial plane with their corresponding ADC maps. DCE-MRI studies were obtained with intravenous administration of paramagnetic contrast agent (Prohance, Bracco Imaging, Italy) 0.3 mmol/kg, a flow rate of 3.5 mL/s, injected after six pre-contrast transaxial T1 Vibe with flip angles of 2 • , 5 • , 8 • , 12 • ,15 • , and 20°, followed by a T1 Vibe axial dynamic (TR/TE = 5.47/1.75) with 60 measurements over a 10 min period and a temporal resolution of 9.6 s. Subtracted DCE images (SUB) were obtained automatically by subtracting pre-contrast images from the post-contrast (PC) images. Finally, an axial high-resolution T1 Vibe with fat suppression (HR Vibe T1-w fat sat) was acquired. Technical details of MRI sequences are shown in Table 1.

Image Processing and 3D ROI Segmentation
ADC images were non-rigidly coregistered on SUB PC DCE-MRI images using Elastix software (v4.9.0) to correct for typical spatial distortion arising from DWI acquisition. T2 images were all resliced on DCE-MRI images. Lesion segmentation was performed on SUB DCE images by an experienced radiologist using an in-house developed software for region labeling. During the segmentation procedure, the radiologist was blinded to both the histological results and all clinical information relative to the retrospective breast mpMRI images. The delineated ROIs were then copied and pasted into the PC DCE-MRI, registered ADC, and resliced T2 images (refer to Figure 2 for an example of primary BC lesion). Before radiomic feature extraction, normalization was applied on T2 and PC image intensities. Specifically, intensities were normalized by centering them at their respective mean value with a standard deviation of all grey values in the original image [25].

Two-Step Feature Selection and Learning
Since the number of extracted radiomics features was very high, using all of them for the classification step was generally ineffective because these features are redundant and highly correlated. Moreover, when the number of features is much higher than the number of samples, the classification process might yield low-quality results due to the so-called curse of dimensionality. Thus, a feature selection process has been applied to remove redundancies while preserving features that might give greater contributions in terms of classification [30].
Seven feature selection methods, described in Table 2, were used. These techniques were chosen mainly because of their popularity in literature, simplicity, and computational efficiency [31]. Table 2 classifies the methods based on their type, i.e., ranker or subset, relation with the subsequent classification approach, and returned results [32].
Before the classification step, the Synthetic Minority Oversampling Technique (SMOTE) algorithm [33] was applied to overcome the over-fitting problem that might arise when an unbalanced set of data is used. New samples in feature space were produced through data interpolation among the instances that lie together, obtaining a more balanced set. Finally, to perform classification, six well-known ML algorithms have been exploited [34]: K-Nearest Neighbors (KNN) [35], Naive Bayes (NB) [36], Support Vector Machine (SVM) [37], Decision Tree (DT) [38], Multi-Layer Perceptron (MLP) [39], and Random Forest (RF) [40]. There is not a universally recognized ideal approach that could be considered as a standard choice for feature selection: indeed, different methods, combined with various classification algorithms, might give very different results on the same dataset, as well as in terms of the generalization ability of the extracted feature subset on new data. Moreover, the datasets examined in this work were characterized by a small number of patients with respect to the number of available features. To better exploit the available data, a crossfold validation approach was chosen rather than dividing the dataset into the training, validation, and test set. This choice, in turn, raised the problem of merging the features selected over the different folds.
To address the problem, a two-step approach was adopted: in the first step, complete filter methods were exploited to greatly reduce the amount of features used for classification, whereas, in the second step, more complex algorithms were employed to fine-tune the selection of the most representative features. This process was inspired by similar methodologies, such as those proposed by Ge et al. and Yang et al. [41,42], designed to address the same issue of having a higher number of features than the input samples, a very typical condition when dealing with biomedical data. In the first step of the pipeline, the complete ranker filter methods (Chi Squared, Fisher Score, Gini Index, and ReliefF) were used to delete the most redundant features. In the second step, three different approaches were considered to boost diversity: i) the Least Absolute Shrinkage and Selection Operator (LASSO) Regression with Recursive Feature Elimination (LR-RFE), the Mutual Information (MI) method, and Correlation-based Feature Selection (CFS). The proposed two-step feature selection pipeline is fully described in Figure 3. In the first learning step, the complete ranking methods were applied on all the extracted folds and using a predefined range of feature numbers set by the threshold t 1 . The highest classification performance, expressed in terms of F1-score, was evaluated among all the feature selection and classification method combinations and for the different thresholds t 1 . Then, the associated feature subset must be combined over the different folds used for the classification process. Similar to homogeneous ensemble learning, in which solutions belonging to different data splits are combined, the feature selected over the folds must be aggregated to produce a final reference set. Following Bolon et al. [43], an aggregation strategy based on tracking the minimal rank of each feature (min pos ) and the number of times it has been chosen in that position (num pos ) over the different folds was used. The features were ordered by min pos first and num pos after, thus obtaining a list of ordered features over all the folds. The features reaching the position specified by the threshold t 1 at least once were selected and became the new feature set on which the second step of the pipeline was then performed. The strategy was very similar to the previous step, with the slight difference that the number of the extracted features had to be directly specified to the selection algorithms (except for CFS, in which the number was automatically determined). Once again, the features were ordered as described above, but they were filtered using a lower threshold t 2 . It is worth noting that the defined thresholds were applied on the minimum position reached: the final amount of features selected might be greater than the threshold itself, owing to the different features extracted across the folds in the validation steps. Figure 3. A schematic view of the ensemble feature selection and classification steps underlying the building process of each predictive model. The first two steps involve 10-fold cross-validation on the training dataset. The last step is based on a leave-one-out cross-validation (LOOCV) approach to validate the combination of selected features and classification models. The best F-score values drive the choice of the classification algorithm at each step.

Model Selection and Validation
At the end of the two-step learning phase, the list of the most relevant features was obtained. Since the dataset size did not allow for a separate test set, and an external dataset with the same characteristics (e.g., same image type and modalities, same annotation markers) was missing, an LOOCV approach was applied to the original dataset to build the predictors. Hence, to further validate the results and simultaneously determine the best classification algorithm, a final step was performed using all the classification algorithms. The final feature list and the chosen classification algorithm constituted the final predictor. The bottom section of Figure 3 depicts this last step.
To assess the generalization abilities of the models and, more generally, the validity of the proposed two-step approach, the predictors were tested on the DS T .Unfortunately, only the predictors for the ER and PR markers can be tested since the available information regarding HER2 was partial or totally missing as in the case of Ki67. In addition, only the T1 and T2 images were present, whereas the ADC and SUB were unavailable. We used the same approach described above to assign samples to the classes whenever the needed information was available and the positive/negative official label otherwise. We ended up with 91 patients for both the ER and PC markers, with the classes distributed as follows: ER−/ER+ = (15.4%/84.6%) and PR−/PR+ = (20.9%/79.1%).
Therefore, only the ER and PR detection tests have been performed using the feature subset selected in the last step of the pipeline.

Experiments and Settings
The proposed pipeline has been applied separately to the four molecular markers. Data from the DS M comply with the common practice in markers annotation, with ER, Ki67, and PR expressed in percentages (of involved cells), whereas the HER2 had discrete values followed, in this case, by one or more plus signs. To train the classification models, the markers' expression values were binarized according to the following criteria: the ER and PR markers were considered negative if their value was lower than 10% and positive otherwise; the Ki67 marker was considered negative if it had a value lower than 14% and positive otherwise; and HER2 was considered negative if its value was 0 or 1 and positive otherwise [44]. The distribution of data classes obtained following those criteria is reported in Table 3. Except for the PR marker, the class distribution is strongly unbalanced. In the two-step feature classification process, the threshold t 1 , which should provide a coarse-grained feature subset, was set to t 1 = [5, 10, 15, ..., 50], whereas the t 2 threshold was used to extract a fine-grained feature subset and was set to t 2 = [1, ..., 10]. These values were selected for: (i) having a comparable number of input samples and features in the classification steps and (ii) trying to extract a smaller, meaningful representative subset able to generalize across datasets. Due to the reduced number of input samples, a 10-fold cross-validation approach was used in the first two steps of the feature selection pipeline. Generally, when working on classification models in which the dataset is unbalanced, the F1-score, which combines precision and recall into a single metric, is a suitable measure. Thus, the F1-score was used to select the best training models. In the next subsections, the results of the first and second steps of the feature selection are reported separately. Moreover, different model settings were taken into account: Regarding implementation details, both classifiers and feature selection algorithms have been implemented in Python 3.7, using the Scikit-learn framework [45] and scikitfeature package [46], respectively. As for the over-sampling algorithm SMOTE, we adopted the implementation provided by the Imbalanced-learn library [47].

Results of the First Feature Selection Step
All the experiments performed for the first step of the feature selection are available in the Supplementary Materials (Tables S1-S4). However, to give the reader an idea of the available information, some results are reported in Table 4. For each of the different input combinations, the best F1-score is shown. Features from different image modalities were taken all together or used separately. Considering the ER marker, the T2 modality emerged as the one performing better when using only radiomics features and also with a combination of radiomics and clinical data. When all the image modalities were used together, the performances decreased. Considering the HER2 marker, ADC was the modality giving the best results, whereas for Ki67, the best performances were obtained using all the image modalities and only the radiomics features. Finally, for the PR marker, the PC modality gave the best results both with clinical data and without them.
Referring to the Supplementary Tables S1-S4, the classification results obtained with the complete set of features were generally much worse than the ones obtained after the feature selection. This result was expected since the number of features was much higher than the number of input data. Hence, feature selection was confirmed to be a mandatory step when performing classification in these conditions. Table 4. Best results of the first feature selection step for all markers. For each combination of features (i.e., single radiomics, single radiomics with clinical information) and for each combination of modalities (i.e., all together (ALL) and single), the best mean F1-score obtained over the folds is shown. Bold font indicates the best results for each marker. The Supplementary Materials show the results obtained for all the possible combinations of features, namely those using all the radiomics modalities (thus totaling more than four thousand features for each patient), using both the radiomics and the clinical information, and using single radiomics modalities (about one thousand features for each patient). The two best results were considered for each feature combination, and the feature threshold and selection algorithm used for each of them was reported. On the feature subsets obtained from the two best results, the second pipeline step was then applied. In all the cases, the only clinical feature available (that is, the patient's age) did not emerge as an important one. Hence, only the radiomics features were used. There is no preferred feature selection algorithm over the four biomarkers and the different feature combinations. The Fisher Score method was one of the most used, thus suggesting that it could be a good starting choice if one needs to perform a single feature selection step. When the imaging modalities were used together, the ReliefF method generally gave the best results due to its higher robustness to noise and redundancy. Nonetheless, both for the HER2 and Ki67 markers, on which ReliefF and Fisher Score gave the best results, the subsequent feature selection step performs better when using the second best (i.e., Gini Index for HER2 and Chi Squared method for Ki67, as reported in the following subsection). This might be due to the classification algorithm overfitting the selected features owing to the small dataset size. Regarding the radiomics modalities, the SUB one always gave poor results, whereas the T2 and PC were often preferred. The ADC alone never emerged, but it proved useful in combination with other modalities. In addition, for the t 1 threshold, there is no preferred value. The t 1 value giving the best results changes widely along the different modalities, features, and classification algorithms.

Results of the Second Feature Selection Step
The final classification results for all the markers obtained after the second feature selection step are shown in terms of F1-score in Table 5 and in terms of accuracy, precision, and recall in Table 6. In addition, in the second pipeline step, there was no clearly winning feature selection algorithm. LR-RFE and CFS gave the best results, with the MI approach performing generally worse. MLP is the classification algorithm more frequently used, although high results were also obtained with the DT and the KNN models (for the HER2 and PR markers, respectively). In the final step, the best-performing classification algorithm was chosen to obtain the final predictor. There was no preferred one, with MLP usually performing as well as the SVM, with the second being more prone to overfitting or learning a single class in the most unbalanced cases (as in Ki67). Table 5. Results for the four markers after the proposed two-step pipeline and the model validation and selection. The features and the learning algorithm obtained at the final LOOCV step are used to build the model predictors. Only the best result (in terms of F1-score) is shown here for each marker. Since the classes were unbalanced (except for the PR marker), the F1-score obtained considering both the label values as the positive class is reported. FSA: Feature Selection Algorithm; LA: Learning Algorithm. The F1-scores for the LOOCV step were calculated considering the two labels in turn as the positive class. Indeed, for these markers, both the negative and positive states are important in defining the cancer's molecular subtype. Especially in the case of the HER2 marker, whose class distribution was highly unbalanced, with 71% of samples belonging to the negative class, it was very important to understand how the predictor behaved. Being trained on a majority of negative samples, it performed better in detecting negative samples. However, it had good abilities also with a positive sample. When the classes were balanced, as in the PR case, there was no difference between the two values. Confusion matrices associated with performances of the predictors at the end of the two-step pipeline were reported in Figure 4. The final feature number was obtained in the LOOCV step, and the list of the extracted features list is available in Table 7. For HER2, the best results were obtained with a combination of features coming from two different image modalities (i.e., ADC, T2). For the Ki67 and PR markers, the features all belong to the PC modality, whereas for the ER marker, they belong to the T2 one. Concerning the validation performed on the DS T , a noteworthy F1-score of 0.88 was obtained for the ER marker. Since other works report their performances based on the Area Under the ROC curve (AUC), this measure was also computed for the ER marker, obtaining an AUC = 0.77. For the PR marker, results were less encouraging. The F1-score was 0.88, but only a single class (the positive one) was predicted on those data, thus resulting in an AUC = 0.63, as expected for this condition. Figure 5 shows the curves obtained for the two markers.

Discussion
This study aimed at comparing the performance of different radiomic models in the prediction of the four most widely used molecular markers (ER, HER2, Ki67, PR) in BC management, using a two-step feature selection radiomic approach to extract meaningful mpMRI feature subsets. Predictions were performed under different settings, i.e., with or without clinical features with radiomics and, for the latter, in all the mono-and multimodality combinations of MRI sequences.
The results obtained in this study demonstrated that the proposed approach was an accurate method to pre-operatively predict the most relevant molecular markers, with the best resulting models composed of only radiomic features and reaching F1-scores up to 0.9. In particular, for HER2, the best results were obtained with an SVM model built with three T2 texture features and the minimum value of ADC from the LLH wavelet transformed ADC map. Other studies found that MRI-based features were associated with the HER2 status of patients with BC [48]. For the Ki67 and PR markers, the best results were obtained with an MLP model built with features all belonging to the PC modality, which currently represents the clinical standard for characterizing BC lesions [49]. These results are partially in accordance with some previous studies investigating the power of radiomics for Ki67 and PR status prediction [50][51][52], although they also found promising results arising from different sequences. Of note, shape flatness was the only shape feature that contributed both in the PR prediction model and in the best-performing model for the ER marker (RF), which was surprisingly built entirely on T2 features. Shape flatness characterizes the shape of the tumor, and in particular, a small flatness value indicates an irregular tumor shape. This feature has been shown to have power in the prognostic prediction of BC patients [53,54]. It is worth noting that, except for the prediction of HER2+ status, radiomics features are critical for model construction derived from a single MRI sequence.
From the methodological point of view, our two-step pipeline is novel, although it shares some similarities with approaches used in other works. In particular, in Xie et al. [55], a two-step pipeline was proposed to filter features, distinguishing between coarse-grained and fine-grained feature subsets with the classification target of simultaneously classifying the four immunohistochemically derived cancer subtypes. They reported a mean accuracy of 0.72 on a private dataset. Apart from being different in the imaging modalities and the learning targets used, they relied on a single statistical method for the first feature selection step, whereas we exploited several different algorithms to choose the most suitable one for the problem at hand. We suggested the importance of exploiting features coming from different imaging modalities, as also reported in Liu et al. [56], where they use conventional T2, DWI, and T1w DCE imaging to predict cancer subgroups and in particular to distinguish between HER2-positive/negative receptor status. They evaluated the performances on a private dataset and reported good results in the training phase (AUC = 0.78) and lower in the testing phase (AUC = 0.62), with better performances for the models exploiting multimodal features than monomodal ones. Notably, we found that for HER2, a multimodal feature set is needed to classify patients properly. This intuition was confirmed by the drop in F1-score to 0.57 when the ADC feature is removed. This result also underlines the critical importance of DWI for BC characterization, as also reported in previous studies [57]. Unfortunately, it was not possible to directly compare the results obtained with the proposed methodology with similar strategies on the same learning target, owing to the different, not publicly available dataset used. However some information could be extracted by looking at large-scale studies such as [58]: the authors exploited feature selection and machine learning (ML) approaches on a large private DCE-MRI dataset, obtaining an AUC = 0.65 for the ER status using training and test sets extracted from the same dataset. The result obtained in this study was higher, suggesting the good generalization abilities of the proposed predictor. In Li et al. [19], about forty radiomics features were extracted from the same TCGA-BRCA dataset we used, and the prediction ability of the features was assessed on the four clinical biomarkers through statistical analyses. Considering the ER biomarker prediction, they reached an AUC = 0.89. In Guo et al. [14] the authors used logistic regression to predict different outcomes, including the ER marker on which they obtained an AUC = 0.79. Again, a direct comparison was not possible since the predictors were built directly on the TCGA-BRCA data, while in this work, this dataset was only used as a test set (DS T ). However, it is essential to emphasize that the relevance of the obtained result lies in having the models trained on a dataset completely different from the one used to test them. Concerning the PR marker, the results from the already cited works reported AUCs of 0.62, 0.69, and 0.69, respectively. In this study, the predictor was not able to distinguish among the PR− /PR+ classes and assigned all the patients to the PR+ class. This was somehow expected given the differences between the used datasets and the small size of the training data and deserves to be further explored with the usage of additional data for the training phase. Referring to the Supplementary Materials, Table S5 reports a comparison of the experimental design adopted by the aforementioned works.
Despite the interesting results obtained, this study suffers from some limitations. First, the sample size for the analysis was small and, except for the PR+/PR− classification task, unbalanced. A larger and more balanced study group is needed to perform a better radiomic analysis and build more robust prediction models. Although the model's performance was corrected by using 10-fold CV in the main classification step and LOOCV for the building of prediction models, such imbalance might have influenced the development of the ML model and the results [59]. Second, the study was retrospective and needed to be validated with other comprehensive external cohorts to determine the value of the developed model in clinical practice and improve the confidence of performance. Furthermore, prospective and multicentric studies need to be performed to define a potential standardization of the proposed approach. Moreover, the lack of standardization in radiomic investigations, in terms of image acquisition, processes, segmentation methods, and radiomics analysis tools, could lead to discrepancies in radiomic feature measurements that are not due to underlying biological variations.
Reproducibility of radiomic features is of crucial importance to clinical applications in the field of BC [60]. Of note, to extract radiomic features, we used the PyRadiomics software [61] which: (i) is compliant with the Image Biomarker Standardization Initiative (IBSI) guidelines that promoted the standardization of radiomic analysis [62], (ii) allows for a reproducible extraction of radiomic features due to the parameter files that could be shared and re-used, and (iii) can also be used starting from DICOM input images with the file name pointing to a DICOM Segmentation Image object, thus automatically obtaining radiomic features without any intermediate steps. This choice allows for a reproducible feature extraction under real clinical conditions that usually involve DICOM objects [27]. In addition, according to Lambin et al. [63], a detailed report of all the steps of the radiomic workflow performed in the study was carried out to improve both clinical translation in this emerging field and the reproducibility of study outcomes. Another limitation affecting this study concerned the use of manual segmentation for the VOIs' delineation, which is time-and labor-consuming and prone to user variability. More accurate and automatic tumor segmentation tools are needed to improve the quality of the radiomic analysis in future works. On a positive note, in this study, 3D ROIs were used for lesion segmentation. The aim was to decrease inter-reader variability by eliminating the requirement to choose a single-slice corresponding to a portion of a lesion. Hence, a more thorough description of the lesion is obtained by an increase in the number of points considered for feature computation, which improved the accuracy of characterization of heterogeneous lesions and lowered the sampling errors [64].

Conclusions
The MRI-based radiomic approach developed in this work, built on a comprehensive BC dataset including MRI sequences and molecular outcomes, can efficiently characterize BC heterogeneity according to the most examined biomarkers (ER, PR, HER2, and Ki67). This methodology might be of great support for BC management for the following reasons: (i) it has the advantage of being developed on an appropriate two-step feature selection and classification technique; (ii) it implements an effective comparison of different models and feature selection approaches); (iii) it is externally validated whenever possible; and (iv) it addresses the well-known issues arising from the lack of available BC datasets by exploiting a comprehensive dataset of molecular markers and MRI modalities. Moreover, the developed two-step pipeline is general enough to be used on similar classification problems on different cancer types. Our results also highlighted the potential and strength of using only mpMRI data for high-quality BC radiomics analysis. Further prospective and multicentric studies need to be performed to define a potential standardization of our approach. In the future, larger BC cohorts will be investigated to validate our results more extensively.

Institutional Review Board Statement:
This study was conducted in accordance with the Declaration of Helsinki, and the study protocol was approved by the ethics committee of the Istituto Nazionale Tumori "Fondazione G. Pascale" (protocol number 12/19 OSS SDN).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The DS M used in this paper come from the MOBL − OB collection of BCU Imaging Biobank, available under request.
Acknowledgments: Some of the results here shown are based on the data provided by the TCGA Research Network: The Cancer Genome Atlas Program, https://www.cancer.gov/tcga, accessed on 1 September 2021.

Conflicts of Interest:
The authors declare that they have no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: