Comparison of current deep convolutional neural networks for the segmentation of breast masses in mammograms

Breast cancer causes approximately 684,996 deaths worldwide, making it the leading cause of female cancer mortality. However, these figures can be reduced with early diagnosis through mammographic imaging, allowing for the timely and effective treatment of this disease. To establish the best tools for contributing to the automatic diagnosis of breast cancer, different deep learning (DL) architectures were compared in terms of breast lesion segmentation, lesion type classification, and degree of suspicion of malignancy tests. The tasks were completed with state-of-the-art architectures and backbones. Initially, during segmentation, the base UNet, Visual Geometry Group 19 (VGG19), InceptionResNetV2, EfficientNet, MobileNetv2, ResNet, ResNeXt, MultiResUNet, linkNet-VGG19, DenseNet, SEResNet and SeResNeXt architectures were compared, where “Res” denotes a residual network. In addition, training was performed with 5 of the most advanced loss functions and validated by the Dice coefficient, sensitivity, and specificity. The proposed models achieved Dice values above 90%, with the EfficientNet architecture achieving 94.75% and 99% accuracy on the two tasks. Subsequently, classification was addressed with the ResNet50V2, VGG19, InceptionResNetV2, DenseNet121, InceptionV3, Xception and EfficientNetB7 networks. The proposed models achieved 96.97% and 97.73% accuracy through the VGG19 and ResNet50V2 networks on the lesion classification and degree of suspicion tasks, respectively. All three tasks were addressed with open-access databases, including the Digital Database for Screening Mammography (DDSM), the Mammographic Image Analysis Society (MIAS) database, and INbreast.

beginning of 2021, the World Health Organization (WHO) reported this disease as the most common cancer worldwide, surpassing lung cancer [6]. Currently, these statistics continue to grow, and an increase of 50% is estimated over the next two decades as a consequence of increased life expectancy, unhealthy diets, insufficient physical activity, and the consumption of harmful substances such as alcohol [7]. This demonstrates the need for research at all stages related to breast cancer, from prevention to timely diagnosis and treatment [7], [8].
Breast cancer typically manifests as a mass or lump sensation, which can be detected by breast self-examination [9]. However, not all lumps are synonymous with cancer, i.e., there are benign and malignant lumps.
Various studies suggest that the incidence rates in lowincome countries are lower than those in high-income countries. However, in the latter group, the mortality rate is lower, and the incidence rate (despite being higher) has been decreasing, while in the former, it has been progressively increasing. [10]. These trends may be due to risk factors inherent to the socioeconomic positions of these countries, where one of the highest risk factors is the lack of access to early breast cancer detection [11]. In addition, this may be accompanied by other factors, such as age, ethnicity, breast characteristics, reproductive patterns, hormonal and environmental factors, and alcohol and tobacco consumption [12]. However, the probability of survival depends mainly on the stage and subtype of breast cancer. Detection at early stages can reduce the mortality rate from 40% to 15% [13], so it is vital to develop systems for the early and accurate detection of breast cancer.
There are many tools for diagnostic assistance in different areas of medicine [14]. Breast cancer is no exception to this rule, where technological evolution has allowed the integration of complex tools such as mammography, magnetic resonance (MR) imaging, positron emission tomography (PET), computed tomography (CT), and singlephoton emission computed tomography (SPECT) [15]. These techniques have made it possible to evaluate and detect breast cancer with a high percentage of accuracy. However, the costs of these pieces of equipment prevent them from being integrated into health systems, especially in lowincome countries or regions with difficult access [16], [17]. In this sense, conventional mammography is usually the most economical and viable solution. In addition, it is one of the most efficient tools for early breast cancer diagnosis [18]. In such mammographic images, benign masses appear as regular shapes, while those with irregular borders are usually malignant [19]. Furthermore, research has shown that annual mammograms can help detect abnormalities even before the patient or physician can perceive a significant change [20]. Consequently, mammography plays a primary role in the early detection of breast cancer, increasing the likelihood of curing the disease and the success of breast-conserving surgeries [20], [21]. In fact, this examination's effectiveness can decrease mortality from 40% 20% and increase the 5year relative survival rate to 99% in screened women [22]- [24]. In this sense, mammography's potential has encouraged multiple applications ranging from interpretation, analysis, and visualization approaches for medical data [25]. Moreover, state-of-the-art artificial intelligence techniques have enabled the integration of complex tasks such as detection, segmentation, and classification with speeds that exceed human performance [26], [27]. However, much work is needed to develop and refine these systems, particularly in mammography cases, where the structure of the breast is quite complex. Additionally, traditional medical segmentation techniques are time-consuming and knowledge-intensive processes that can lead to errors or subjective diagnoses [28].
As mentioned above, lesions or masses are the main signals utilized for breast cancer diagnosis [8]. The boundary information in the affected regions reflects the growth pattern and biological characteristics of the disease [29]. Therefore, poor masses segmentation limits the classification of these masses (benign or malignant), making segmentation one of the most important processes in new diagnostic aid systems for breast cancer detection.
On the other hand, in the last decade, various artificial intelligence techniques for segmenting medical images or environments with objects of interest have been studied [30], [31]. The implementations include developments ranging from basic image processing techniques to current deep learning (DL) algorithms, where the latter has exhibited exponential growth in the areas of health informatics and medical imaging [31]- [33]. DL includes architectures such as a convolutional neural network (CNN), which is similar to the primary visual network [34]. In particular, the CNN design can extract complex features at the same level as humans, giving it a more efficient generalization capability than that of conventional machine learning methods. Furthermore, DL can be performed on raw data, i.e., it is unnecessary to perform preprocessing on the input images or to know the background of the problem in detail [35]. Moreover, the paradigm shift toward automatic diagnostic aid systems is a reality inherent to technological evolution due to the generation of large datasets and the development of state-of-the-art computers [36]. The implications of using artificial intelligence range from reducing radiologist workloads, aiding diagnosis, improving response times, and even providing information that is not perceptible to the naked eye during mass breast segmentation; therefore, AI systems are handy tools in daily-life medical practice [37], [38].
Following the above considerations and with the aim of improving the accuracy of automatic breast exam segmentation, we perform a comparative analysis of 12 DL networks with the latest backbones and architectures in this paper. The models are the most efficient and the most widely used methods in classification and segmentation tasks. Additionally, we propose studying these models under the five most-used loss functions to better compare them. The analysis includes architectures such as the original UNet, Visual Geometry Group 19 (VGG19), InceptionResNetV2, EfficientNet, MobileNetv2, ResNet, ResNeXt, MultiResUNet, linkNet-VGG19, DenseNet, SEResNet and SeResNeXt, where "Res" denotes a residual network. The model training process is performed on binary cross-entropy loss functions, including weighted binary cross-entropy, Dice, Tversky focal, and log-cosh Dice functions.
Mainly, the following elements are highlighted in the study.
• A model is found that achieves scores exceeding those reported for state-of-the-art methods in similar works regarding the segmentation breast lesions. • A comparative analysis of the 12 state-of-the-art architectures with respect to the segmentation task is addressed. • The state-of-the-art architectures and backbones presented before or during 2021 are included. • Different loss functions are compared to determine which one has the best performance on the segmentation task. • The importance of resolution for achieving strong in-network evaluation metrics under the segmentation scenario is highlighted. Finally, the paper is organized as follows. Initially, work related to the segmentation and classification of breast lesions is addressed. The main DL techniques that have been used to address these problems are highlighted, followed by a brief literature review of the latest and most recent works. Next, details of the methodology used to explore the different networks are given, and the main characteristics of the utilized materials and methods are shown. Subsequently, the results are shown while each of the findings is discussed, leading to a general discussion that highlights the most relevant elements of the study. Finally, the main conclusions are presented.

II. RELATED WORK
Early concepts in mammographic image anomaly automation date back to the 1960s [39]. Originally, developments were focused on minimizing errors due to fatigue or inherent in human execution [39], and since that time, research and developments have included techniques ranging from the basic image processing methods to recent DL techniques [40].
DL has exhibited exponential growth in recent years, and there are even recent reviews highlighting the use of CNNs for different tasks and datasets. For example, Abdelrahman et al. [41] achieved advances with modern architectures such as ResNet, UNet, DenseNet, and attention mechanisms. The results demonstrated excellent performances on tasks such as classification, detection, and segmentation. However, the survey lacked heterogeneity between the utilized models and techniques, as all studies have different databases and evaluation metrics, limiting the ability to conduct an objective comparison between architectures [41]. On the other hand, diagnostic aid approaches can also be performed with other techniques.
For example, Zhou et al. defined a series of image intensity steps, consisting of background removal, pectoralis muscle removal, and a technique based on a regularized distance level to segment masses [42]. Similarly, Sadeghi et al. used image intensity with a new adaptive thresholding method based on variable-size windows. This method allows for the exact location of a mass to be calculated, reducing the possibility of generating false positives [43]. Salih et al. proposed mass segmentation through classical and diffuse morphological techniques. Their method processes the breast's internal structures generated from a thresholding process, allowing for highlighting and extracting the lesion of interest [44]. In other more novel approaches, Kamil et al. implemented two clustering techniques as segmentation methods. In the first technique, they employed the K-means method, and in the second approach, they employed the fuzzy c-means (FCM) algorithm. In both cases, the techniques were integrated with the lazy snapping algorithm as an additional step, improving the segmentation of abnormal areas [45]. These techniques achieved accuracies of 91.18% and 94.12%, respectively.
Although the above methods are promising methods, most researchers focus on the versatility, performance, and advantages of recent DL algorithms. For example, Li, Abdelhafiz, De Moor, and Zhu et al. approached the problem of breast lesion segmentation with CNNs [28], [46]- [48]. Li et al. combined the densely connected UNet (DenseNet) with attention gates (AG). The model was trained under the crossentropy loss function, and its performance reached 82.24%, 77.89%, and 78.38% in terms of the F1-score, sensitivity, and accuracy metrics, respectively [28]. Similarly, Abdelhafiz et al. used UNet as a base network in two different mass segmentation studies. In the first one, UNet was integrated with residual attention blocks (RUNet). The network was trained with the Dice loss function, and its segmentation ability was validated with the accuracy metric, reaching a value of 98.7% [46]. The second uses a version of UNet enhanced with batch normalization layers, dropout layers, and increasing convolutional layers. Again, the network was trained with Dice loss and achieved an accuracy of 92.6% [47]. De Moor et al. used the base UNet for segmentation and evaluated it through free receiver operating characteristics (FROCs). Moreover, in their study, they achieved a maximum sensitivity of 0.94 [48]. Finally, Zhu et al. implemented a fully convolutional network (FCN) to model a potential function, followed by the use of a conditional random field (CRF) to perform structured learning. The design was trained with the maximum likelihood loss function (distribution-based loss), achieving a Dice score of 91.30% [49].
Finally, Salama et al. [50] segmented and classified mammograms by implementing DenseNet121, ResNet50, VGG16, and MobileNetV2 models for classification and a UNet model for breast region segmentation. The results achieved a maximum accuracy of 98.87% for the classification case.

A. DATASET
Three sets of data were taken for the different tasks performed on the mammograms, which are described below.

1) SEGMENTATION
Segmentation was performed on the public database "Curated Breast Imaging Subset Digital Database for Screening Mammography" (CBIS-DDSM) [51]- [53]. The data contain the digital mammograms of several subjects with corresponding segmentation masks performed by expert radiologists. Only mammograms confirmed as normal, benign, and malignant cases, plus verified pathological information, were taken. Therefore, a total of 714 randomly distributed images were used for training, validation, and testing and divided into groups of 499, 72, and 143 images, respectively. It should be clarified that each model was run 20 times, and during each run, the training, validation, and test data were randomly selected to obtain more accurate descriptions of the architectures and ensure that the results did not depend on the splitting of the data. The process is similar to cross validation and is known as Monte Carlo cross validation.

1) CLASSIFICATION
The mammograms were classified in two different ways. The first method was based on the types of lesions, i.e., whether they were calcifications, well-defined or circumscribed masses, spiculated masses, other ill-defined masses, masses with architectural distortion, asymmetric masses or normal areas. In this case, 322 images from the mini-Mammographic Image Analysis Society (MIAS) database of mammograms were used [54]. Second, classification by degree of suspicion (BI-RADS) was performed using 410 images from INbreast, a full-field digital mammographic database [55]. The two datasets were divided into proportions of 60, 20, and 20% for training, validation, and test data, respectively, as shown in Table I.
Similar to segmentation, during classification, the data were taken randomly in each run, ensuring that the results obtained did not depend on the split.

B. PREPROCESSING
The main feature of DL is that it can work with raw data [35]. For this reason, the mammograms were only subjected to two processes. First, they were normalized, converting the intensity values to a scale from 0 to 1. It should be clarified that the normalization process was performed because neural networks work more efficiently with these values and float data. However, this does not imply a reduction in or loss of information about the images.
Second, due to the large sizes of the images and the differences between them, the images were downsampled as follows: the segmentation images were changed to 512 × 512 size to reduce the computational load. Similarly, the images for classification were also reduced in size. However, their sizes were set to 256 × 256 to reduce the computational load and to increase the number of images for use with the new data augmentation method explained in the next section.

C. DATA AUGMENTATION
Data augmentation techniques were integrated into this study to increase the size of the training set and avoid overfitting the model. For the case with the segmentation images, a total of 7 additional images were generated for each image, yielding 3992 mammograms for training. The images were generated by inverting the pixels from right to left and rotating the images at 90-degree angles in all possible positions, as illustrated in Figure 1. Similarly, this process was performed for the true segmentation of the breast lesions. The two databases used in the classification process had unbalanced data, i.e., some classes had few subjects, while other classes had significant numbers of subjects. Therefore, data augmentation was performed in two different ways. In the first approach, eight images per subject were augmented regardless of the number of subjects per class.
In the second case, data augmentation was performed so that all classes had the same number of images, generating up to a maximum of 16 images per subject. The 16 possible images were generated by resizing the images to 1024×1024. Subsequently, each image was divided into 256*256 quadrants. Each quadrant had a size of 4×4. Finally, each image was generated by taking the pixels of each quadrant at the same position. The process is illustrated in Figure 2.

D. SEGMENTATION WITH DL NETWORKS
In medical imaging, the segmentation process consists of classifying each pixel into all possible interest elements (e.g., background and affected tissues). UNet [49] is one of the most popular networks for this task and was originally created to focus on medical imaging. The structure consists of convolutional layers interspersed with clustering layers, forming an encoder-decoder design. The low-level layer features are combined with the high-level layers (see Figure   3), preserving some of the spatial information. The encoder extracts the features, and the decoder performs upsampling. The major difference between UNet and other segmentation networks is that UNet adopts splicing and fusion in the channel dimension. Additionally, the network can be trained with a low amount of data since its structure can converge quickly [56]. The design allows UNet to be highly efficient in the segmentation process. However, recent studies have presented variations of the network (e.g., backbones of other models), which could be more efficient for this task. In this sense, we proposed evaluating 11 of the most novel convolutional networks, including UNet, as a reference. The implemented networks are shown in the following table with their relevant characteristics.
The networks are made up of several layers, and each convolutional layer uses filters to extract the desired features once the model is trained. The following mathematical model governs the convolutional layers.
where ( ) is the feature map (output) of the l-th convolutional layer associated with the j-th convolutional filter ( ( ) ). ( − ) is the output of the previous layer or the input for the l-th layer.
( ) is the bias, and ( − ) is the number of feature maps in the previous layer. Additionally, f denotes a nonlinear activation function, which usually consists of a rectified linear unit (ReLU). Despite the fact that all neural networks are based on convolutional layers, their behaviors can vary significantly due to their structural designs, i.e., the depth of the network (number of layers); the number of filters, connections or trajectories; and specific features, as described in Table II.  TABLE II  STATE-OF-THE-ART CNN ARCHITECTURES  Networks Date of publication Remarks VGG19 [57] 2015 Sequential convolutional architecture with a depth of 19 weight layers. * ResNet [58] 2016 First network with residual connections between convolutional layers. * InceptionResNetV2 [59] 2017 Modified multipath convolutional network with residual connections. * DenseNet [60] 2017 Architecture with direct-access connections (throughout the network) or a densely connected structure. * LinkNet-VGG19 [61] 2017 The input image passes through the architecture, generating the output (the training parameters; see Figure  3). Finally, the training results are validated using the associated loss function, and the model is iteratively adjusted until the best model performance is obtained.

E. LOSS FUNCTIONS
As mentioned above, image segmentation consists of classifying pixels into different types of elements, usually those associated with the background and the object of interest (e.g., breast lesion). The difference between the regions spanned by the elements (data imbalance) usually causes the networks to be biased toward the larger element. However, some loss functions can circumvent this problem. These can be classified into four different types: distributionbased, region-based, boundary-based, and composite loss functions [67]. Therefore, we proposed using five of the most common loss functions to fit the segmentation models to the training data. The utilized loss functions are described in detail below.

1) BINARY CROSS ENTROPY
Binary cross entropy is a loss function that is commonly used to measure the difference between two probability distributions. This principle can be applied to individual pixels in images, classifying elements into two possible values: the background and the object of interest [68]. The binary cross-entropy loss ( ) is mathematically defined as: where is the true value (label) and ̂ is the predicted probability of the label for the same element in the dataset. It should be clarified that the binary cross entropy in a dataset is defined as the average of all the elements that compose the dataset.

2) WEIGHTED BINARY CROSS ENTROPY
As in the previous case, the weighted binary cross-entropy loss is used to measure the difference between two distributions. However, this variant weights the sets, eliminating the bias induced by imbalanced data [69]. The weighted binary cross-entropy loss is mathematically defined as: Here, is the true value (label), ̂ is the predicted probability of the label, and is the weighting coefficient used to adjust for false positives or false negatives.

3) DICE LOSS
The Dice coefficient is a statistic used to calculate the similarity between two samples. Its use can be extended to images by comparing the similarity between spatially matched pixels [70]. The coefficient has also been included in a loss function, which is mathematically defined as: where is the true value (label) and ̂ is the predicted probability of the label. It should be noted that equation (3) is modified with a 1 in the numerator and denominator, ensuring that the function is defined even in extreme cases where and ̂ are equal to zero.

4) FOCAL TVERSKY LOSS
The Tversky index is a measure of asymmetric similarity between sets [71]. This function can be viewed as a generalization of Dice's coefficient, and it is mathematically expressed as follows: Equation (4) averages the false positive and false negative weights through the coefficient. Similar to Dice's coefficient, Tversky's index can also be fitted to a loss function as follows [72]: The loss function can be modified to a focal loss by reducing the weights of individual examples and focusing the training process on hard negatives through a modulation factor [73], as shown below: Here, the modulation factor must meet the condition of > 0.

5) DICE LOG-COSH LOSS
Dice's coefficient is widely used in computer vision for conventional images. However, due to its nonconvex nature, the smoothed version using a hyperbolic log-cosine has recently been proposed [67]. The loss function is defined as follows: Here, is the loss with the Dice coefficient established in equation (3).

D. EVALUATION METRICS
As an important part of the objective model comparison d, our approach was based on five evaluation metrics: the Dice coefficient, sensitivity, specificity, accuracy, and F1-score. The five metrics are mathematically expressed as follows: where the five metrics are established in terms of the true positives ( ), true negatives ( ), false positives ( ) and false negatives ( ).

1) SEGMENTATION
A comparative analysis among the models in terms of their automatic anomaly segmentation performance in mammographic images was proposed. The analysis addressed the study of the five loss functions and the 12 architectures described above. The performance of the networks was observed during training with the Dice coefficient to determine the architecture with the best performance. Similarly, validation was performed only through the Dice coefficient. Finally, all networks were evaluated (after training) with the test set under the Dice coefficient, sensitivity, specificity, and accuracy metrics. Figure 4 shows a graphical summary of the experimental design. It started with the CBIS-DDSM dataset, and these values normalized and divided into three datasets at proportions of 70, 10, and 20% for training, validation, and testing, respectively. The training dataset was augmented with the proposed method, and subsequently, the models were trained on all possible combinations of the five loss functions (see Equations (1), (2), (3), (3), (6), and (7)) and the 12 deep architectures (see Table II). Each model was trained for 150 epochs with the training dataset, and their parameters were adjusted to the optimal values. At each epoch, the models were evaluated on the training and validation data. Finally, the trained models were used to generate predictions for the test data, and the resulting scores were calculated through the true segmentation and evaluation metrics. Each network was run 20 times on average under the following hyperparameters: •  The performance metrics provide detailed descriptions of the implemented models. However, in discrete space, each metric's effectiveness is subject to the size of the element of interest. For example, in equation 8, a difference of only one pixel between the actual and predicted segmentation results (false positives or false negatives) would not generate a low Dice score for a large region (many TPs). In contrast, small regions would generate low values based on a difference of only one pixel. Therefore, to obtain a more detailed description of the Dice score as a function of size, the best network scores were compared with scores generated from the dilated and eroded masks. These processes introduced minimal error in the real regions by increasing or decreasing the perimeter by one pixel (dilation and erosion).

1) CLASSIFICATION
Similar to the previous case, a comparative analysis of stateof-the-art CNNs was proposed for breast lesion classification. In this case, the most common classifications were used: classification by lesion type and by degree of suspicion. In the two classification processes, seven CNNs were addressed under the cross-entropy loss function. The performance of the networks was observed during training through accuracy and validated with the same metric (accuracy). Finally, all models were evaluated (after training) with the F1-score, accuracy, sensitivity, specificity, and precision metrics.
The process is shown in Figure 4. Initially, the datasets were split and preprocessed; data augmentation was performed using the proposed method and the different networks were trained. It is worth noting that each network was run an average of 40 times under the following hyperparameters: • Loss function: Cross entropy.

A. SEGMENTATION
This section shows the results generated by the models under the different loss functions. The tables are presented with percentage values and graphs containing scores in their fractional form, i.e., with values ranging from 0 to 1 that are equivalent to values from 0 to 100%. Table III shows the maximum metric values achieved by the 12 deep CNNs. In this case, the EfficientNet architecture delivered the best result, reaching a Dice score of 94.75%. Moreover, the model achieved the highest sensitivity with a value of 95.21%, ensuring a low negative false rate, i.e., a small loss of lesioned regions. Likewise, the specificity score was 99.99%, indicating a low probability of generating false regions as lesions.
Additionally, although EfficientNet was not the architecture with the highest number of convolutional layers, it had the highest number of training parameters, i.e., this architecture had more filters per convolutional layer, allowing it to obtain a higher number of feature maps per layer.
Similarly, the InceptionResNetV2 network exhibited similar behavior to that of EfficientNet. The results show that the same specificity and sensitivity values were achieved, with an almost 1% difference. The Dice score also yielded a difference of less than 1%, i.e., the network had high performance (slightly lower than that of EfficientNet) but with a smaller number of training parameters, which implies that a lower computational load was required when implementing this model.
In contrast, the UNet base network was the model with the lowest number of training parameters and the lowest number of convolutional layers. This explains the low performance obtained by this network compared to that of the other architectures to a large extent. Similarly, Table IV indicates the highest scores achieved by the loss functions, where binary cross entropy generated the highest Dice score, sensitivity, and specificity. Otherwise, the Dice, Tversky focal, and log-cosh Dice losses were assumed to be more efficient since they are region-based functions [67]. However, the results were more than 4% below those of the binary cross-entropy loss function and its weighted version. This finding implies that small-element segmentation is not performed well with region-based functions and would be performed better with distributionbased losses.  Figure 5 shows two examples of the segmentations generated by the 12 models with the highest Dice scores (see Table III). The process was performed on large and small lesion. The results were obtained with the respective loss functions that performed best for each network and were generated with the test data. Each example contained a mammogram, an enlarged image of the region of interest (ROI), the actual segmentation, and the probability or prediction map generated by the model. Among the predictions, it can be seen that EfficientNet presented a heat map that was similar to the real region, even with the small ramifications presented by the lesion. Additionally, the network approached the real region in the small lesion and presented more defined edges.
Likewise, the InceptionResNetV2 network generated a probability map similar to that of the EfficientNet network.
The central region appeared with high probability values (close to 1), guaranteeing low uncertainty in this region. In particular, the differences occurred at the edge, where the probability decreased because the network could not classify the pixels as lesions or nonlegions. It should be noted that this same effect was present in EfficientNet, making it difficult to discern the visual differences in between two results.
In contrast, the basic UNet displayed diffuse edges in both cases, generating a poor segmentation result for the element of interest and producing the lowest metric scores, as shown in Table III. As mentioned above, EfficientNet achieved the best performance; therefore, Figure 6 shows the training ( Figure  6a) and validation (Figure 6b) results of this model only. The graphs are the training averages for all loss functions, where the translucent bands are the 95% confidence intervals. Figure 6a shows an that the Dice coefficient increased, indicating better performance in each successive training epoch. Similarly, Figure 6b presents the same behavior, with the same values reached at the end of the 150 epochs. This behavior reveals that the model did not overtrain (overfitting) for any of the 5 loss functions, and again, it can be seen that the binary cross-entropy and binary-weighted cross-entropy loss functions achieved the best performance. It should be noted that all models presented the same behavior, i.e., they did not overfit the day. However, the results of the other models were made available to the public in the GitHub repository (https://github.com/Qsinap/Breast-cancersegmentation). Figure 7 presents the overall results of all the training processes. Figure 7a shows the distribution of the Dice scores generated by all the training sessions. The figure shows that EfficientNet had the most homogeneous distribution with the highest values. The behavior assumes a probability of conducting training with the highest score through this network. Similarly, EfficientNet network a more homogeneous sensitivity distribution than those of the other networks (see Figure 7b).
In the case of specificity, all the networks had distributions above 99.8%, with values generated by the background's large presence and low probabilities of generating false positives, i.e., pixels classified as injuries. Finally, Figure 7d shows that the 5 loss functions exhibited same behaviors regarding the metrics. The specificity had homogeneous distributions close to 1, while the Dice score and sensitivity distributions displayed more heterogeneous behaviors. Additionally, in this last graph, the crossed and weighted binary entropy distributions of the losses are above the other losses. In fact, the first quartile of the first loss function is above the last three, showing a marked difference between these functions. The test set consisted of 143 images of patients with different lesion sizes. Consequently, Figure 8 shows the real breast lesion area and the areas generated by the best and worst models, i.e., by EfficientNet and the base UNet (see Table  III). The results show high agreement for EfficientNet, even for small lesions (see Figure 8a). Likewise, the InceptionResNetV2 network exhibited similar behavior to that of the EfficientNet network (see Figure 8c); however, there were some divergences for small lesions, limiting the segmentation ability of this model. UNet presented large differences with small lesions; however, the area was closer to the real values when the lesion was larger (see Figure 8b). Although EfficientNet presented an average Dice score of 94.75% (see Table III), Figure 8c shows that lesions with larger areas yielded scores above 95%. In contrast, smaller lesions led to values below 95% and even 90%.
Additionally, Figure 8c introduces the Dice score generated from the dilated and eroded masks, i.e., the score between the real mask and the mask with introduced morphological transformation error. As expected, the introduced error affected the behavior of the Dice score. The coefficient decreased by up to 40% for smaller lesions. Otherwise, the Dice score had high values for larger lesions even though they were generated with the same error type. On the other hand, theoretically, erosion and dilatation affect the internal and external perimeters, respectively. Consequently, the outer perimeter was expected to be larger than the inner perimeter, generating a larger error and affecting the Dice score to a greater extent. However, Figure  8c shows that erosion created a greater reduction in the Dice score. The metric's behavior versus the induced error reveals the dependence of the models on the sizes of the segmented regions. In other words, even if the segmentation effects are good, the corresponding scores can decline significantly for small elements.
Finally, EfficientNet maintained high values despite the inherent failure of the Dice coefficient in discrete space.

FIGURE 7. Distributions of scores generated by training the 12 deep CNNs for the a) Dice coefficient, b) sensitivity, and c) specificity metrics. d)
Distributions of scores as functions of the loss functions with the same three metrics. The results were derived from the test data.

FIGURE 8. Results as a function of breast lesion size for the 143 test subjects. Comparison between the actual areas and the automatic segmentation of a) EfficientNet, b) InceptionResNetV2 and c) UNet (base). d) Dice scores as functions of the area generated by EfficientNet and by the induced dilation and erosion errors. A yellow line
corresponding to a score of 0.95 is included in the graph. Figure 9 shows the segmentation performed by EfficientNet on two lesions: a large lesion and a small lesion. The network prediction map closely resembles the actual segmentation; however, the Dice coefficient varies drastically between these two examples, confirming the impact that the size of the element of interest has on the Dice coefficient.  Figure 10 shows the average time requirements of the 12 models for automatic segmentation. In this case, the UNet base network presented the shortest segmentation time; however, the other architectures had comparable times, with values below 15 milliseconds. Finally, as shown in Table V, the results showed that EfficientNet achieved better scores than other similar works, guaranteeing better segmentation results with respect to masses on mammographic images. The UNet network is the most straightforward deep network among the fully convolutional models, and its segmentation capability is limited relative to that of modern models. It is trained only with the weighted logistic loss function, as it is a measure based on the input data distribution. It is evaluated through sensitivity, a metric that does not consider unbalanced data between the background and element of interest. Although there are innovative state-of-the-art DL architectures, it makes little sense to evaluate the models from their structures since each is complex regarding the elements that are not directly comparable to each other. In this sense, the performance of the models is directly summarized by their evaluation metrics, i.e., a model is efficient if the evaluation metrics are high relative to other models. Consequently, the results in Table V clearly show that one of the proposed networks outperformed the results reported to date in terms of the segmentation of mammographic images. This is very useful for finding the affected regions (breast lesions) in short times and with high performance, making this network a handy tool in clinical settings.
The results show that the proposed network generated higher scores than related approaches. For example, regarding the accuracy metric, EfficientNet reached a score of 99.96%, surpassing the maximum score achieved by the method of Abdelhafiz et al. [46] across care models. In the case of sensitivity, EfficientNet scored 95.21, surpassing the maximum score reported by de Moor et al. [48] (94%). Similarly, the Dice coefficient reached 94.75%, exceeding the value reached by Zhou [49] with multiscale networks by almost 4%.
Finally, the results clearly show that the new DL architectures are at the technological forefront in terms of screening breast masses. However, there are some inherent limitations relative to the problem at hand. For example, the Dice metric is the most reported measure in the literature. However, the size dependence of the element of interest creates a bias that limits the objective evaluation of this work and existing work that has been evaluated with the same metric. In this sense, an approach could be sought to adjust the Dice metric to avoid the drops in the coefficient due to small regions. On the other hand, from a methodological point of view, the study was based on an extensive dataset; however, this does not cover all possible considerations of breast examinations, as several protocols generate different types of mammographic images.
Additionally, it is necessary to include outlier images to obtain more detailed descriptions of the networks in the face of these drawbacks. These challenges could be overcome by searching for and including new databases with different characteristics, where transfer learning could be conducted from the current methods to the models with the new databases. This transfer would enable the models to avoid starting with random parameters and allow them to reach lower loss function values more quickly.

B. CLASSIFICATION
This section shows the results obtained for the two types of classification problems, i.e., classification by the types of lesions and classification by the degree of suspicion in BI-RADS. As in the previous case, the tables present the results in percentage values, while the graphs show the values in their fractional forms.
Initially, classification by lesion type was performed with the 7 CNNs: VGG19, ResNet50V2, DenseNet121, InceptionV3, InceptionResNetV2, EfficientNetB7 and Xception. Table VI shows the overall results obtained for the test data in terms of the five different metrics. However, the results were organized from the highest to lowest F1score metrics. The F1-score provides a better description of this case, presenting the imbalances between the six classes' images.
In particular, the results show that the VGG19 network achieved the maximum F1-score on the test data. This network achieved the best performance even though the network was one of the worst networks in the segmentation task. The result confirms the need to search for a network carefully for each specific task. That is, if a network has the best performance on one task, this does not guarantee that this behavior will be maintained in other types of tasks.
On the other hand, Table VI shows how misleading the accuracy metric can be. The metric was above 95% for all networks. However, the sensitivity dropped to 14% for the Xception network. That is, the network achieved 95% accuracy but had a low ability to identify true positives. Additionally, it is worth noting that the marked difference between accuracy and sensitivity is due to the class imbalance problem, where it is possible that there are true positives than true negatives.
The results in Table VI also show the high effectiveness of the ResNet50V2 network. Although the network was 5% below VGG19, it remained among the most efficient networks for classification and segmentation.  Table VII shows the maximum classification scores achieved for each of the lesion types. The results contain high F1-scores for normal lesions. That is, the models were highly efficient in discriminating mammograms without any abnormalities. However, it is also worth noting that the results could have been generated due to data imbalances. For example, the normal class yielded the highest probability percentage, but this class occurred more frequently than the others (see Table I). Similarly, the MISC lesions (ill-defined masses, others) had the lowest frequency. Consequently, it was the class with the worst scores.  Table VIII shows the results for the case of classification by the degree of suspicion (BI-RADS). The networks presented similar behaviors to those in the previous case. The ResNet50 and VGG19 networks generated the best performances. However, in this case, the ResNet50 network outperformed the VGG19 network by more than 6% in terms of the F1-score.
Similarly, the accuracy metric did not present significant differences between the networks, and all of values exceeded 97%. Again, the differences were found among the sensitivities of the networks, where there was a difference of approximately 47%. In other words, the ResNet50 network had a better performance than EfficientNet in terms of discriminating between true positives. It should be noted that all the networks were excellent at determining true negatives (high specificity), which could be attributed to the greater probability of encountering a true negative.  Table IX shows the scores achieved according to the different grades of suspicion regarding the test data. Class 1 yielded a marked difference relative to the other classes, i.e., negative mammograms were clearly distinguishable unlike the other classes. In fact, suspicions highly suggestive of malignancy (5) produced the second-best results, but with an almost 47% difference from the firstplace results. In this case, the marked difference between the classes could not be directly attributed to class imbalance since the benign class (2) had the highest number of mammograms (see Table I), but its F1-score reached 11.43%. In summary, Table IX shows the high performance of the models in detecting true negatives (specificity), but they have low abilities to determine true positives.  Figure 11 shows the training and validation results of the VGG19 network as a function of the 40 epochs. In this case, the different training runs did not present significant differences since the error band (translucent color) was small. That is, the VGG19 network exhibited stability during training, guaranteeing convergence to similar training and validation scores. Moreover, the training and validation curves converged above 90%, guaranteeing a low degree of overtraining, which agreed with the test results shown in Table VI. On the other hand, both the accuracy and loss curves showed slight divergences between training and validation, which were generated near epoch 30, i.e., the models require approximately 30 epochs to reach the best performance without overfitting. Similarly, Figure 12 shows the training and validation of the ResNet50 network in terms of the accuracy metric and the loss of the model. The network exhibited stability between the different parts of training, generating a reduced error band and converging close to 0.8. The result guarantees that the model was not overtrained, and the results agree with those obtained in Table VIII. On the other hand, the losses of training and validation exhibited similar behaviors, corroborating the fact that the model did not overfit. However, the curves do not show any apparent divergences, so it is possible to train the model over a greater number of epochs to obtain a better result. As mentioned above, the models were run by randomly selecting data. Therefore, the box-and-whisker plots corresponding to the different obtained scores are shown in Figures 13 and 14. In addition, the difference between the datasets with data augmentation and without the proposed data augmentation approach is also presented. Figure 13a shows the distribution of the accuracy metric as a function of the seven CNNs. The results corroborate the fact that there is was higher probability of arriving at a high-performance network through VGG19 than through other networks for the lesion type classification case. Moreover, it is more than evident that an increase in data contributes to better results for all networks. Indeed, this behavior can be seen in both the sensitivity ( Figure 13b) and specificity (Figure 13c) metrics.
In Figure 13d, it can be seen once more that all classes exhibited high specificity, i.e., the models were highly efficient in identifying true negatives. In contrast, the detection rate of true positives declined significantly. Again, this trend can be attributed to the low chance of finding a true positive versus the high probability of finding a true negative. Furthermore, in this plot, it can be observed that the normal class yielded the three highest metrics, confirming that the class with the highest number of data (see Table I) resulted in better discrimination for the networks. The classification by degree of suspicion (BI-RADS) behavior similarly to the classification by the degree of the lesion. Figure 14a shows that the increase in data generated a better score distribution, reaching values close to 1, i.e., 100% probability. Likewise, the networks that achieved the best performance were the VGG19 network and the ResNet50 network, where the latter produced the highest scores for classifying the degree of suspicion. Again, specificity was observed to be the highest metric even without data augmentation, and it was difficult to see significant differences in most networks except for the VGG19 network (Figure 14c). In other words, the performance of the networks is subject to their sensitivity. That is, most of the networks managed to clearly identify the true negatives, as these were found in higher proportion, but were limited to incorrectly identifying the true positives (sensitivity), as shown in Figure 14b. We highlight that this effect could have been generated due to the distribution of the data.
Furthermore, since there were several classes, the possibility of finding a true positive of a class decayed in proportion to the number of classes being classified. That is, in this particular case, there were eight different classes. Therefore, the probability of finding a true positive of class 1 was 1/8. Additionally, this problem becomes more acute when the classes do not have the same numbers of images, i.e., when the data are unbalanced.
The accuracy, sensitivity, and specificity were maintained for the eight different classes. All classes yielded specificity distributions close to 100%, and accuracy produced high values. However, the distributions declined for sensitivity except for that of class 1.
In the classification by lesions scenario, although the class with the highest number of data presented the best results, in this case, this characteristic was not preserved. Table I shows that the benign class (2) had a higher number of images, but the distribution of this class was similar to those of the classes with less data (see Figure 14d). Finally, to describe the performance of the running networks, Figure 15 shows the average classification time per subject for each of the seven utilized networks. In this case, the ResNet50 network required the shortest classification time; this was in agreement with the general descriptions of the residual connection networks, which improve their training times by forcing the learning process to follow a residual mapping ( ) − and being easier to train if the ideal residual mapping is the identity function ( ) = [58]. Similarly, the other networks also had relatively short run times for classifying subjects, with most utilizing below 200 milliseconds of classification time.
It is also worth noting that all models had similar behaviors in the two classification cases. In fact, the ResNet50 network presented the best execution time, while the EfficientNet network generated higher execution times in the two cases.
Finally, as previously mentioned, this study focused on a comparative analysis of different CNNs implemented with conventional images. The approach sought to determine the behaviors of state-of-the-art DL networks in cases with medical images, specifically mammograms. The results showed the high effectiveness of EfficientNet regarding the segmentation of breast lesions, even in small lesions, despite the inherent constraints of small lesions. In addition, the study revealed the need to search for the network that best fits the specific task, i.e., although the MultiResUNet network is one of the newer architectures for segmentation, the performance metrics remained below those of EfficientNet. The review of the state-of-the-art approaches uncovered new elements for the case of breast lesions. However, most of the networks used with conventional images were shown to generate good results without the need for significant modifications except for the hyperparameters used during training. Even EfficientNet managed to the surpass state-of-the-art methods in terms of the segmentation of breast lesions. In the same sense, in the lesion classification and degree of suspicion tasks, the stateof-the-art networks did not generate the best performances. In fact, in this particular case with mammograms, the optimal classification results were obtained with the VGG19 network, even though this was one of the first deep CNNs to be developed. Therefore, among the different types of available networks, it is necessary to test them to accurately establish the network that is best suited for the specific task encountered in medical imaging.
The CNNs performed well in the segmentation and classification tasks, surpassing the state-of-the-art methods. However, the study presents some limitations that should be addressed in future studies. Initially, the databases remained one of the main limitations in the implementation of the DL algorithms. In this case, the CBIS-DDSM, MIAS, and INbreast databases, three of the main open-access databases in mammography, were used. However, all three databases lack heterogeneity, and each has features that address different problems. In other words, it is necessary to attempt to evaluate the results with external databases containing the same labels or segmentations of the same regions to reach an objective conclusion. For example, Salama et al. [50] addressed the segmentation problem. However, their research focused on the segmentation of the breast and not lesions.
On the other hand, as mentioned above, the accuracy metric is not the most suitable for unbalanced data, as it can generate high values despite having very low sensitivity. To avoid this drawback, this study was performed with different evaluation metrics. However, a comparison with previous work evaluated with the accuracy metric might not reveal significant differences.

V. CONCLUSION
A comparative analysis of methods for the segmentation and classification of breast lesions on digital mammograms was proposed. Initially, we proposed a comparative analysis of 12 state-of-the-art DL networks under five loss functions to improve the automatic segmentation of breast examination images. The proposed convolutional models were built with the base UNet and the most recently developed networks with building blocks, squeeze-and-excitation blocks, residual connections, large numbers of deep layers, and novel architectures for segmentation or conventional classification, i.e., on problems other than medical imaging. The results showed that EfficientNet, together with the binary crossentropy loss function, achieved an accuracy of 99.96%, outperforming the most recently developed approaches. Additionally, this model presented the most homogeneous distribution with higher scores than those of the other architectures. EfficientNet generated training and validation curves that converged with a Dice score close to 95%, indicating that the model was not overtrained. The architecture was validated with test datasets of different sizes, where the generated segmentations had areas with sizes close to those of real areas, even for minor lesions. Similarly, in the segmentation tests, it was possible to observe the details generated at the edges of the lesions, demonstrating the high effectiveness of EfficientNet. The model's effectiveness provides a detailed view of the morphological characteristics of breast masses, allowing their structures to be compared with theoretical bases for an objective assessment of the morphological aggressiveness of the masses and, consequently, allowing for the pathological characterization of the masses as potentially malignant or benign masses.
On the other hand, the comparative method with Dice's coefficient and the morphological transformations of the real segmentation images allowed for observing the effect that the discretization of the images had on the segmentation results of small regions, i.e., the validation metrics lose objectivity when the segmented element is smaller.
Finally, a comparative analysis was performed for the classification process through two different databases: MIAS and INbreast. The images were classified by the types of lesions and the degree of suspicion of malignancy. In the first case, the classification by lesion type yielded 96.97% accuracy. However, the results were subject to the distribution of the input data because there was a greater probability of finding true negatives than true positives, which is a critical limitation of the classification process. Similarly, the lesion type classification case produced 97.73% accuracy but was limited by identical drawbacks due to the distribution of the data across classes. The results showed that although new developments and techniques have emerged in the architectures of DL models, it is necessary to explore different networks to arrive at the one that best fits the desired task. For example, in this case, we obtained the best results through the VGG19 and ResNet50 networks, where the former is one of the oldest DL networks available.