Attention-Map Augmentation for Hypercomplex Breast Cancer Classification

Breast cancer is the most widespread neoplasm among women and early detection of this disease is critical. Deep learning techniques have become of great interest to improve diagnostic performance. However, distinguishing between malignant and benign masses in whole mammograms poses a challenge, as they appear nearly identical to an untrained eye, and the region of interest (ROI) constitutes only a small fraction of the entire image. In this paper, we propose a framework, parameterized hypercomplex attention maps (PHAM), to overcome these problems. Specifically, we deploy an augmentation step based on computing attention maps. Then, the attention maps are used to condition the classification step by constructing a multi-dimensional input comprised of the original breast cancer image and the corresponding attention map. In this step, a parameterized hypercomplex neural network (PHNN) is employed to perform breast cancer classification. The framework offers two main advantages. First, attention maps provide critical information regarding the ROI and allow the neural model to concentrate on it. Second, the hypercomplex architecture has the ability to model local relations between input dimensions thanks to hypercomplex algebra rules, thus properly exploiting the information provided by the attention map. We demonstrate the efficacy of the proposed framework on both mammography images as well as histopathological ones. We surpass attention-based state-of-the-art networks and the real-valued counterpart of our approach. The code of our work is available at https://github.com/ispamm/AttentionBCS.


I. INTRODUCTION
Breast cancer is the most common cancer in women worldwide and despite the mammography screening process reducing mortality, the incidence rates of this disease have been slowly increasing since the mid-2000s [1].Indeed, the mammography exam is by no means a perfect imaging test, characterized by a high rate of false positives and related false positive biopsies.Traditional computer-aided detection (CAD) algorithms fail to improve diagnostic performance and lead to high recall rates [2].On the other hand, deep learning based CAD systems have been shown to succeed in assisting clinicians during the reading process, reaching a higher diagnostic accuracy [3].This has led to an increased interest in this research area, with a variety of open problems, from reducing the number of false positives [4] to exploiting the multi-view nature of mammography [5], [6] and so on [7].
Nonetheless, applying deep learning techniques for this kind of problem still presents challenges.For starters, the task itself Authors are with the Department of Information Engineering, Electronics and Telecommunications (DIET), Sapienza University of Rome, Italy.
Corresponding author's email: eleonora.lopez@uniroma1.it. is much more difficult if compared to the classification of natural images.Indeed, discriminating between benign and malignant tumors requires trained and expert radiologists, thus being a far from trivial problem also for neural networks [8].In addition, the region of interest (ROI) comprises a tiny portion of the entire mammogram, which makes it hard to detect and, even more importantly, to concentrate on it and learn to pay particular attention to such a small patch [9].Finally, the fine-grained detail that characterizes highresolution mammograms, which is crucial to identifying and correctly diagnosing masses, is lost due to image resizing, a necessary step in order to train the neural model efficiently [5].
The attention mechanism has become a successful technique to handle these kinds of challenges, as it allows the network to concentrate on the most critical regions of the input [10].Typically, it is found inside transformer-like architectures [11], [12], however, new strategies are being developed to endow convolutional neural networks (CNNs) with attention layers [13], [14].
In this paper, we introduce a novel approach to exploit the information learned by attention layers to address the aforementioned challenges related to breast cancer screening.That is, we propose a framework that consists of an attentionmap augmentation step and a parameterized hypercomplex network as backbone model for breast cancer classification.For simplicity, we refer to it as the parameterized hypercomplex attention maps (PHAM) framework.In detail, the attentionmap augmentation step consists in computing attention maps for each cancer image with an already existing model, e.g., PatchConNet [13], such as the ones displayed in Fig. 1, and then employing them as additional input to the hypercomplex backbone.In this way, we perform a sort of conditioning on the attention map during the classification step, thus providing the neural network information regarding the abnormal regions of the breast cancer image, which are most significant for diagnosis.We exploit the new multi-dimensional input through parameterized hypercomplex networks, on account of the ability of such architectures to model correlations in multidimensional data while also obtaining a more lightweight network [6], [15].As a matter of fact, quaternion and generalized hypercomplex networks have gained much interest in the past few years.The success of quaternion neural networks (QNNs) is owed to quaternion algebra that allows to model both global and local relations in input 4D data and thus learn a more powerful representation in the latent space [16]- [20].Thereafter, parameterized hypercomplex neural networks (PHNNs) were introduced in order to bring the advantages of QNNs to general input domains of any dimensionality n [15], [21].
Thus, in this work, we employ parameterized hypercomplex (PH) ResNets as the backbone of our framework in order to capture the correlations between the original breast cancer image and the corresponding attention map thanks to hypercomplex algebra properties.Through an experimental analysis conducted on public benchmark datasets of mammograms, i.e.CBIS-DDSM [22] and INbreast [23], and histopathological microscopic images, i.e., BreakHis [24], we show how the proposed method is able to outperform the real-valued counterpart as well as attention-based state-of-the-art models.
The rest of the paper is organized as follows.Section II lays out an overview of the current related works.Section III gives a detailed description of the proposed framework and the theory behind hypercomplex networks.Section IV provides technical details regarding data, training, and experimental results.Finally, conclusions are drawn in Section V.

II. RELATED WORKS
Prior works adopt several different approaches for the task of breast cancer classification.Owing to the aforementioned problems, many studies focus on the classification of single patches [25]- [27] instead of the whole mammogram.In fact, by considering the single patch containing the ROI, there is no need to find a way to make the network focus on it since it will be the main object in the image.Moreover, the details would be clearly visible even after image resizing, thus making the classification task much easier compared to whole-mammogram classification [9].On the other hand, methods that directly process the whole mammography image, usually adopt a pertaining strategy based on either patch-level classification or on natural images, i.e.ImageNet, in order to alleviate these problems [4]- [6], [9], [28].
More recently, with the success of the transformer [10] and vision transformer (ViT) [29] architectures, also the medical imaging community has started to develop architectures based on the former models [11], [12].Indeed, these methods aim to focus the attention of the neural model on the ROI, which is often small in the medical scenario, through the self-attention mechanism.However, transformer-based models are characterized by much less inductive biases which instead are inherent to convolutions, i.e., locality and translation equivariance [30], [31].For this reason, many works started to investigate strategies to incorporate self-attention into convolutional neural networks (CNNs), with the aim of maintaining its intrinsic inductive biases and additionally gaining the advantages of the attention mechanism.One of the first popular works introduced a convolutional bottleneck attention module (CBAM), which can be easily plugged into any CNN to refine feature maps through the inferred channel and spatial attention maps [14].Then, a more recent study proposed an extension of this module specifically designed for breast cancer classification that is able to exploit cross-view information [32].Finally, Touvron et al. [13] introduced an attention-based aggregation layer to augment any CNN, and further propose a patchbased architecture, PatchConvNet, that allows obtaining higher quality attention maps by keeping the input resolution constant throughout the network.In this paper, we take a different approach and employ the pretrained PatchConvNet to obtain attention maps for mammography images, which we then exploit through parameterized hypercomplex layers, in order to overcome the problems presented in Section I.

III. PROPOSED METHOD
In this section, we expound on the proposed method, which is the parameterized hypercomplex attention maps (PHAM) framework for breast cancer detection, depicted in Fig. 2. First, we introduce the theoretical background of hypercomplex neural networks and, second, we give a detailed description of the framework we bring forward.

A. Parameterized hypercomplex models
Quaternion neural networks (QNNs) are models that operate in an extension of complex numbers C, namely the quaternion domain Q.A quaternion is defined as q = q 0 +q 1 î+q 2 ij +q 3 κ, in which q i ∈ R, with i = 0, . . ., 3, are the real coefficients and î, ij, κ ∈ Q are the imaginary units.The product between two imaginary units is not commutative, thus the Hamilton product has been introduced to properly model the multiplication of two quaternions.Thanks to this product, the weight matrix and the input can be encapsulated into a quaternion as (1) Then the convolution between them is defined as follows: Indeed, in this way, the filter submatrices W i , i = 0, . . ., 3 in eq. ( 2) are shared between each dimension of the input x.This property brings two main advantages.Firstly, the number of parameters is reduced by 1/4, thus yielding a much lighter model with respect to the real-valued counterpart.Secondly, by sharing weights between input dimensions, the neural model is endowed with the ability to model local relations.Therefore, QNNs, in addition to modeling global relations as standard neural networks are able to do, also capture correlations among channels by treating the input as a unique entity.Instead, realvalued networks assign different weights to different dimensions, thus treating them independently when they are actually correlated.Indeed, this additional information, which networks in the real domain fail to grasp, allows quaternion models to learn a more powerful representation of the data and yield more accurate predictions as a result.
However, in order to extend this approach to inputs of any dimensionality nD, instead of being limited to 4D inputs, a generalization in the hypercomplex domain H was introduced, i.e., parameterized hypercomplex neural networks (PHNNs) [15], [21].In this case, the filter weight matrix W is expressed as a parameterized sum of Kronecker products: where n defines the number domain in which the network operates (e.g., n = 4 corresponds to the quaternion domain) and can be chosen freely in order to best represent the input data.While matrices A i and F i encode the algebra rules and the weight filters, respectively, both learned during training.With this formulation, the advantages of quaternion models are maintained, thus still leveraging hypercomplex algebra properties to model latent relations among channels, while reducing the number of free parameters by 1/n.

B. Parameterized hypercomplex attention maps (PHAM)
We design the parameterized hypercomplex attention maps (PHAM) framework as follows.We define an augmentation operation based on the computation of attention maps for each input image.Then we use them to condition the hypercomplex model during training to improve the classification performance of breast cancer.The relations among the original images and the relative attention maps are exploited through parameterized hypercomplex models thanks to their aforementioned properties.
Indeed, any neural method can be easily defined in the hypercomplex domain [15], therefore in this paper, we deploy parameterized hypercomplex ResNets (PHResNets).In such a manner, the standard residual block y = F(x) + x, where F is composed of interleaving convolutional layers, batch normalization (BN) and ReLu activation function, becomes: whereby PHC refers to parameterized hypercomplex convolutions, with the weight matrix defined as in Eq. ( 3), and x is the multidimensional input composed of breast cancer images and the corresponding attention map.
Attention maps are a visualization of what the attention layer has learned during training [13].We propose to exploit this information in image form, i.e. we augment the dataset with the attention maps, inspired by the similar utilization of heatmaps in recent works [5], [33].More in detail, to compute the attention maps we deploy the recent PatchConvNet [13], since it allows us to obtain high-resolution attention maps thanks to its non-hierarchical design.Due to the network being trained on ImageNet, it cannot be directly utilized for medical images, thus, we first perform a fine-tuning step on a breast cancer dataset and thereafter we apply the fine-tuned model to infer attention maps on other breast cancer databases.Then, we construct the augmented dataset in which each sample is composed of the original mammograms or histopathological images and the corresponding attention maps by considering them as a single multi-dimensional input.In this way, The attention map conditions the training process by emphasizing the most critical portion of the image.This allows the neural network to focus on these crucial areas, thereby enhancing its predictive capabilities and leading to improved performance.
To conclude, when processing images corresponding to mammograms, the input x in Eq. ( 4) has two dimensions, where the first represents the mammogram and the second corresponds to the attention map, thus we set the hyperparameter n = 2 and operate in the complex domain C in order to capture relations between them.On the other hand, when considering histopathological images, we set n = 4, thus operating in the quaternion domain, given that histology images are saved in RGB, and therefore are composed of 3 channels, plus the additional attention map.According to the aforementioned discussions, by endowing the architecture with PHC layers, we can better leverage the attention maps.Indeed, thanks to hypercomplex algebra properties, the hypercomplex network has the ability to capture local relations between original breast cancer images and the respective maps, which real-valued models fail at modeling, thus truly exploiting such additional information [16].
The Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) [22] consists of scanned film mammography images standardized in the DICOM format.It provides biopsy-proven pathology labels, which can be either benign or malignant.We utilize the official training and testing data splits and obtain the validation split by performing an additional stratified partition of the training set, for a total of 991 images for training, 240 for validation and 361 for testing.
On the other hand, INbreast [23] is a much smaller database consisting of 410 full-filed digital mammography images.Pathological labels are not available but instead, it provides BI-RADS classification from which we extract binary labels by considering categories 1 and 2 as negative, and 4, 5 and 6 as positive, discarding category 3 [9].We split the dataset in a stratified fashion and take 20% of images for validation and testing, respectively.
The third dataset, the Breast Cancer Histopathological Image Classification (BreakHis) [24], instead consists of 9, 109 microscopic images of breast tissue at four different magnifying factors, i.e., 40X, 100X, 200X and 400X.BreakHis is divided into two main groups, that is benign and malignant tumors.We create different sets for training, validation and testing by splitting the dataset patient-wise in order to avoid information-leaking, taking 20% for the two latter sets.
The same preprocessing is applied for all datasets: images are resized to 384 × 384 and standardized.Data augmentation operations are applied for training images only, i.e., random horizontal and vertical flips and a random rotation of degrees taken from (−10 • , +10 • ).

B. Validation metrics
To validate the experiments conducted on datasets of mammogram images we utilize the area under the ROC curve (AUC), as it is the primary metric employed in the literature [4], [5], [9].It provides a measure of the predictive ability of a classifier at different probability thresholds, taking into

Magnifying factor Model
Params PH AM Accuracy (%) account the trade-off between true positive and false positive rates.On the other hand, for the second set of experiments, the metric adopted in the original paper is the accuracy [24], thus we also utilize it for evaluating the performance of the proposed approach on the BreakHis dataset.

C. Training and architecture details
ResNet-based architectures are trained with the Adam optimizer [34], a learning rate of 10 −5 and weight decay at 5 × 10 −4 .The number of epochs is set to 100, but in order to avoid overfitting, we early stop the training when the AUC on the validation data does not improve for 20 epochs.Instead, PatchConvNet is fine-tuned with the Lamb optimizer [35], a learning rate of 5 × 10 −4 and weight decay at 10 −2 , following the recipe for fine-tuning experiments of the original paper.Moreover, we employ the s60 configuration of PatchConvNet, which consists of an embedding dimension of 384 and 60 repeated blocks in the trunk [13].Finally, for hypercomplex models, we employ ResNet18 and ResNet50 in the hypercomplex domain for the first and second sets of experiments, respectively, as BreakHis is a much larger dataset compared to INbreast.

D. Experiments and results
The first set of experiments is conducted on datasets of mammography exams.Specifically, in the first step, we finetune the PatchConvNet architecture on CBIS-DDSM utilizing the available ImageNet weights [13].As a second step, we operate the model at inference time to obtain the attention maps on INbreast.Finally, we train the hypercomplex network, i.e.PHResNet18 (n = 2) with the conditioning provided by the attention map, which from here on we denote with AM.We compare the results of our approach against a baseline ResNet18 without AM, state-of-the-art methods, and the realvalued counterpart of the proposed framework, i.e., ResNet18 with AM.Notably, we do not need to test PHResNet18 without AM, as in that case the parameter n would be set to 1, which is equivalent to its real-valued counterpart, i.e.ResNet18 [15].The first state-of-the-art method we consider is the cross-view attention module (CvAM) designed for multiview analysis of mammography, which is integrated inside a ResNet architecture [32].In order to compare it with our method, we utilize the single-view equivalent of this approach, which is the CBAM module integrated in the same fashion as described in the original paper [32].The second state-of-the-art network for comparison is PatchConvNet itself, directly finetuned on INbreast.The mean AUC over 5 runs is reported in Tab.I together with the number of parameters of each network.Evidently, our approach in both hypercomplex and real domains outperforms both baseline and state-of-the-art methods, showing that the proposed strategy for exploiting such knowledge is effective, thus resulting in more performant classifiers.Moreover, the proposed framework produces the most discriminant model, i.e.PHResNet with AM, which yields an AUC of 0.852.It is also important to note that it achieves such result with just 5M parameters, that is 1/5 of PatchConvNet.This is thanks to hypercomplex algebra rules which allow modeling local relations between input dimensions, thus grasping correspondences among mammogram and attention map.To conclude, even though the main gain is attained by including attention maps, by introducing hypercomplex algebra the performance is still slightly improved but with a much lighter model.Thereafter, we also perform the same experimental evaluation as described above but with the two datasets switched.Thus, PatchConvNet is fine-tuned on INbreast, then the attention maps for CBIS-DDSM are inferred and used to train the different networks.The results reported in the bottom part of Tab.I support our theory, this time even attaining much more gain from the introduction of hypercomplex algebra, i.e., from 0.694 to 0.725.
The second set of experiments is conducted on histopathological microscopic images of tumor tissue at different magnifying factors.In detail, we fine-tune PatchConvNet on images at a magnification factor of 40X and then we utilize it at inference time to compute attention maps for the remaining magnifying factors, i.e. 100X, 200X, and 400X.Finally, we train PHResNet50 with AM on the latter datasets and compare it against a vanilla ResNet50, PHResNet50 with n = 3 (since they are RGB images, i.e. with three channels), and the real-valued respective of our method, i.e., ResNet50 with AM.The results are reported in Tab.II, showing the average of the accuracy over 3 runs and the standard deviation.Firstly, as expected, the advantage brought by hypercomplex algebra can be seen in both scenarios with and without AM.In both cases, the mean accuracy is improved and the models are comprised of only 5M and 4M parameters, for n = 3 and n = 4, respectively.Secondly, the experiments demonstrate how attention maps are generalized for different magnifying factors, thus improving the performance in every case, with the hypercomplex network yielding the best accuracy for each scenario with a quarter of the parameters with respect to its counterpart in the real domain.Thus, we further demonstrate the efficacy of the proposed framework on microscopic images at different magnification factors of tumor tissue in addition to X-ray mammogram exams.
To conclude, Table I also includes different ablation experiments as the gain from each proposed component, i.e. conditioning on attention maps (AM) and hypercomplex algebra (PH) is shown.In fact, we test a baseline ResNet, then ResNet with AM and finally, we add hypercomplex algebra with PHResNet with AM.Table II also shows the experiment with hypercomplex algebra and without AM, i.e.PHResNet with n = 3.

V. CONCLUSIONS
In this paper, we have proposed a novel framework to exploit the information learned by the attention mechanism, that is we build an augmented dataset where each sample comprises the original image and the respective attention map.This new multi-dimensional input is used to condition the training of a PHResNet, which handles it as a single unit and, thanks to hypercomplex algebra properties, has the capacity to capture latent relations between the original breast cancer image and the attention map.In this way, we effectively exploit the additional information provided by conditioning on the attention map regarding the location of the tumor region, shifting the focus of the network on it.We demonstrate the validity of the proposed framework on breast cancer datasets, first comprising mammography exams and second histopathological microscopic images, outperforming attention-based stateof-the-art architectures and the real-valued counterpart of the proposed technique.

Fig. 1 .
Fig. 1.Top rows: attention maps of INbreast obtained from PatchCon-vNet fine-tuned on CBIS-DDSM.Bottom rows: attention maps of CBIS-DDSM obtained from PatchConvNet fine-tuned on INbreast.The left column comprises mammograms with a malignant finding, while the right presents negative/benign mammograms.

Fig. 2 .
Fig. 2. PHAM framework.On the left, the attention-map augmentation step is depicted.Herein, attention maps are computed offline with the fine-tuned PatchConvNet model.Then, they are used to perform a sort of conditioning on the hypercomplex model.On the right, a PHResNet with n = 2 for mammography images (n = 4 for histopathology images) is employed as the backbone to perform breast cancer classification.By defining the model in the hypercomplex domain, it gains the capacity to process the original image and the relative attention map as a unique entity, modeling relations between them as can be seen in the visualization of the parameterized hypercomplex convolutional (PHC) layer.

TABLE I RESULTS
ON THE TEST SETS OF THE INBREAST AND CBIS-DDSM DATASETS.ATTENTION MAPS (AM) ARE OBTAINED FROM PATCHCONVNET FINE-TUNED ON CBIS-DDSM (TOP) AND INBREAST (BOTTOM), RESPECTIVELY.RESULTS IN BOLD AND UNDERLINED CORRESPOND TO THE BEST AND

TABLE II RESULTS
ON THE TEST SET OF THE BREAKHIS DATASET AT MAGNIFICATION FACTORS 100X, 200X, AND 400X.ATTENTION MAPS (AM) ARE OBTAINED FROM PATCHCONVNET FINE-TUNED AT MAGNIFICATION FACTOR 40X.RESULTS IN BOLD AND UNDERLINED CORRESPOND TO THE BEST AND SECOND BEST, RESPECTIVELY.