A hybrid deep learning model for breast cancer diagnosis based on transfer learning and pulse-coupled neural networks

: Radiology experts often face difficulties in mammography mass lesion labeling, which may lead to conclusive yet unnecessary and expensive breast biopsies. This paper focuses on building an automated diagnosis tool that supports radiologists in identifying and classifying mammography mass lesions. The paper’s main contribution is to design a hybrid model based on Pulse-Coupled Neural Networks (PCNN) and Deep Convolutional Neural Networks (CNN). Due to the need for large datasets to train and tune CNNs, which are not available for medical images, Transfer Learning (TL) was exploited in this research. TL can be an effective approach when working with small-sized datasets. The paper’s implementation was tested on three public benchmark datasets: DDMS, INbreast, and BCDR datasets for training and testing and MIAS for testing only. The results indicated the enhancement that PCNN provides when combined with CNN compared to other methods for the same public datasets. The hybrid model achieved 98.72% accuracy for DDMS, 97.5% for INbreast, and 96.94% for BCDR. To avoid overfitting, the proposed hybrid model was tested on an unseen MIAS dataset, achieving 98.77% accuracy. Other evaluation metrics are reported in the results section.


Introduction
In 2018, approximately 18 million cancer cases entered the existing population living with cancer worldwide, and over half of them resulted in death [1]. According to statistical evidence, breast cancer is one of the most common types of cancer among women. Around 1 in 8 women in the USA, approximately 12%, will belong to the breast cancer population over the course of their lifetime [2]. Studies and statistics indicate that if breast cancer patients can be diagnosed at an early stage, the rate of five-year survival may reach 90%, while it would not exceed 20% in the terminal stage [1]. Generally, "mammography" provides high-resolution, clear, and accurate imaging for the purpose of breast cancer examination. Accordingly, an intelligent computer-aided diagnosis system assists radiologists in benefiting from mammography to make a precise diagnosis. Previous research trials indicate that computer-aided diagnosis systems can enhance the diagnostic success rate and eliminate the probability of misdiagnosis [2][3][4][5]. The main goal for most computer-aided diagnosis platforms is to perform labeling and to differentiate between malignant and benign lesions.
Recently, Deep Learning (DL) systems have outperformed other Machine Learning (ML) systems in computer vision problems [6][7][8][9]. This is especially the case for Convolutional Neural Networks (CNN) [10,11]. DL techniques have been exploited in diverse medical domains and applications, including pulmonary peri-fissural classification [12] and interstitial lung disease lymph node identification [13]. Most of the implemented systems that have used CNN have applied a "vanilla" approach. In particular, the extracted CNN features are used solely or in combination with other handcrafted descriptors to perform the classification [14,15]. Meanwhile, the most effective characteristic of employing CNNs is to neutralize feature engineering in the classification process and use raw images. CNN architecture is specially designed to benefit from the 2D structure of the input image. However, more importantly, it can generalize to other recognition problems [14,15]. To benefit from the characteristics of CNNs, large annotated datasets should be used in training, which are not available in the medical field, especially in breast cancer. Furthermore, when training CNNs from scratch, the training process consumes intensive computational power and memory. To overcome this issue, Transfer Learning (TL) may be used [16], where the idea is to leverage a pre-trained model from available images and exploit a fine-tuning for in-hand classes [17]. TL is widely applied in DL applications and it has shown effectiveness in the medical field, particularly in domains where data are normally limited. Another neural network that can be used as a transformation, which expresses image contents without segmentation, is known as the Pulse-Coupled Neural Network (PCNN) [18]. The PCNN can convert a 2D image into a 1D signal that can be perceived as an "Image Signature" [19]. The PCNN is a neural network model that was first proposed by Eckhorn [19] at the beginning of that century. The core idea of PCNN comes from the study of synchronous pulse bursts in the cat's visual cortex.

Why PCNN?
• Models involving only CNNs may be sensitive to image acquisition setting, scanning type, and pre-processing steps. PCNNs can neutralize these effects through the iterative pulses. • PCNNs offer a method for prioritizing the features and can act as a feature selector. The main contributions of our work are as follows: • To extract mammogram features more precisely, capturing both local and global features, and use different filter scales.
• Combining a PCNN model with a CNN pre-trained model to extract and select features before classification in the dense layer. • The proposed model neutralizes the problem of overfitting due to relatively small size datasets. • The proposed model guarantees the generalization by being exposed to multiple datasets.
The rest of this paper is organized as follows: Section 2 discusses related work published in the scientific community; Section 3 presents the scientific background and describes the datasets; Section 4 discusses the proposed system; and Section 5 presents the experimental results and concludes the findings.

Related work
In the medical community, examinations and screening tests for detecting and diagnosing breast cancer by domain experts are parts of an extremely sensitive, time-consuming, and costly process. The efficiency of the diagnosis procedure can be increased by exploiting technology and intelligent software components. Hence, using these tools causes direct cost reduction, along with eliminating diagnosing effort. Accordingly, a substantial number of studies have been undertaken based on AI and ML approaches. One of these approaches, presented in Akay [20], used Support Vector Machines (SVM) based on a feature selection algorithm to detect breast cancer. The implemented system achieved 99.51% classification accuracy. Karabatak [21][22][23][24][25] proposed a hybrid system using an expert system, association rules, and feed forward neural network to detect breast cancer. The role of the association riles was to eliminate the neural network input features, while the neural network was trained for the classification task. The system's overall classification accuracy was greater than 95%. A hybrid approach was proposed in [26] combining both evolutionary techniques and fuzzy systems. One advantage of this system was that it provided explainable rules for the domain experts. Recently, Deep Neural Networks (DNNs) have been introduced in the biomedical field to detect cancers, including breast cancer. Table 1 illustrates previous trials that have been undertaken using deep learning.
In past few years, exploiting ML and DL in breast cancer is world widely shining, many researchers have used different techniques and achieved a competitive results [29][30][31][32][33][34].
Transfer Learning (TL) is needed in the medical field because data are rare and costly, and few datasets are publicly available. Furthermore, it is a very time-consuming task for radiology experts to collect and annotate datasets [35]. At the same time, the task of training a CNN from scratch is computationally extensive and costly in terms of memory resources. Extracting features from the first layers of a pre-trained CNN should embrace generic features such as edge detectors. Those are useful for many requirements, but in the later layers, these features are merged in such a way as to become more focused on the details of the label classes in the training set.
Until a solution is devised to the problem of limited availability of large-scale medical datasets for mammography, the methodology of merging Transfer Learning and data augmentation seems to be a promising approach for CNN training.
The extracted features can show whether a model has successfully learned or not, allowing a decision to be made about whether to halt the training process early. This is referred to as early stopping.  In this work, four publicly available datasets were used: 1 for testing only and 3 for training and testing. The main reason why one uses a dataset for testing only is to ensure that the proposed model does not overfit. One way to boost the learning task is to merge all 3 datasets into a single one, DDMS [36]. The DDMS dataset includes more than 2650 instances ordered by the severity degree of the case findings. Each case instance has 4 view mammograms for the same person with the associated ground truth (Figure 1), along with other related information on the following: • Patient's age.
• Breast density (determined by the specialist both for benign and malignant cases).

INbreast
INbreast contains a total of 115 cases [37]. Both craniocaudal and mediolateral oblique views are included ( Figure 2). Overall, it contains a total of 410 images. INbreast contains the following information: • Acquisition date.
• Existence of previous cases in family.
INbreast contains a wide diversity of cases, including different types of lesions. In this work, we focus on benign and malignant cases.

MIAS
Mammographic Image Analysis Society (MIAS) is a commonly known dataset that includes 322 film mammograms [38]. It also contains expert validation ground truth ( Figure 4).

Convolutional neural network (CNN)
Convolutional Neural Networks (CNN) were proposed by Yann LeCun et al. in 1998 [39]. A CNN is a multistage combination of convolutional layers and fully-connected layers. Generally, a convolution process is used to minimize the memory required, and it is computed on limited regions to enhance performance.
There are four main operations in a CNN [15]: • Convolution.
Starting from a 2D input matrix such as an image, a CNN is composed of layers. Each layer contains various filters (kernels) ( Figure 5). Input through the forward pass, it is convolved with each filter and dot products between the entries of the filters and the input are computed. A feature map is obtained by repeatedly applying a function across sub-regions of the entire image (convolution) with a linear filter, adding a bias term, and then applying a non-linear function. If we denote the ℎ feature map at a given layer as , whose filters are determined by the weights and bias , then the feature map is obtained as follows [40]: To capture a richer representation of the input, each hidden layer is composed of many feature maps, {ℎ ( ) , = 0. . }

Transfer learning
Most computer vision studies that have used TL have customized the ImageNet dataset for pre-training tasks [41]. There are several commonly known pre-trained CNN models, including Alex-Net [42], VGG16 [43], ResNet50 [44], and Goog LeNet [45]. Each model was pre-trained using ImageNet, and each one is tended for a 1K-class classification job. Until high volume datasets for mammography medical images are available, the concept of combining TL and data augmentation is an extremely promising CNN training methodology. AlexNet: • 64-384 filters • 1000 fully-connected layer size is 1000 • Input is 224 x 224 As mentioned before, these TL architectures were trained on the 1 K object type identification on the ImageNet database. ImageNet includes 1.2 million images as a training set, 50,000 images for validation purposes, and 100,000 images as a test set. All of these models use a data augmentation methodology besides dropout layers to minimize the probability of overfitting.

Pulse-coupled neural networks
PCNNs constitute a biologically-inspired model based on a cat's sight model. The model consists of a single layer of neurons, each of which is connected directly to one input image pixel. The neuron accepts 2 different inputs: the linking and the feeding inputs. The feeding input embraces an external signal besides an internal one meanwhile the linking only accepts local signal [18]. The internal stimulus is perceived from the feeding cycle radius around the neurons. Meanwhile, the external stimulus comes from the intensity of the associated pixel in the input image. Afterward, the linking and feeding inputs are combined to construct the internal potential that determines the output firing of the neuron (Figure 6) [18].

Modified PCNN
There are many variations for PCNN training algorithms. These variations seek to mimic the physiological pulse-coupled neuron. They express different simplifications made to the computations and still retain the main features of the theory. One of these variations is "Modified PCNN". In this model, each neuron is governed by the following sequence of equations [18,19]: where L(i) represents input linking, F(i) represents feeding, P is the image's pixel intensity, U(i) is the neuron activation potential, and (i) is the neuron threshold. There are multiple hyperparameters, including (L, F and q, as well as the decay coefficients besides () which represent the linking coefficient. Finally, VL and VF are the linking and threshold potential hyperparameters. Regarding the neuron output, Y presents the firing information regarding the surrounding neurons, while Yo conveys the neuron firing information. Additionally, W is the weight matrix. The "Modified PCNN" neuron architecture is illustrated in Figure 7 [18].
The output of each iteration (the firing status of each neuron) is used to form a time-series-like signature. The standard feature generation formula G(n) for iteration (i) is computed as [46][47][48]: In this work, we use the enhanced formula proposed by Elons [18]. This improves signature quality by adding a weighting factor known as a "Continuity Factor".

Proposed system
The aim of this system is to build a classification model to classify the input images based on one of the following possible labels: Normal, Abnormal Benign, and Abnormal Malignancy. Different Transfer Learning (TL) models will be exploited to provide enhanced features for dense layers to classify the input image. The main challenge is to neutralize the effect of acquisition quality on the model, which is the main motivation for using PCNNs. PCNNs generate image signatures without the need for segmentation or pre-processing steps. This signature is fed into feedforward neural layers besides CNN features ( Figure 8). As mentioned above, different TL models were tested on different datasets. Both models represent a feature extractor models, both features vectors are fused to represent a unified feature vector. The fused feature vector is embraced into a deep feedforward network for the purpose of classification.
Most previous researchers have used a segmentation step to identify the Region of Interest (ROI). The main motivation for this is to reduce the computational effort of the CNN and eliminate the impact of limited training data size. Meanwhile, the proposed system accepts the complete image without segmentation or any pre-processing steps. This idea mainly depends on the PCNN's ability to generate image signatures that are invariant to acquisition quality, scaling, rotation, and translation.
The initial layers of a CNN seek to capture generic features and typically conduct tasks such as edge detectors. At the same time, the successive layers pay attention to more specific details in the classes.
Due to differences between the ImageNet dataset and the hand-crafted datasets, model fine-tuning is proposed to adjust the features and to consider the designated datasets.
The superiority of DL and PCNN compared to the other ML methods stems from the capability of multi-level feature representation in any hidden layer. Additionally, DL and PCNN can significantly neutralize the effects of low-quality imaging resolution and expected noise.

Results and configurations
This research was conducted using the PyTorch platform and the Python environment. For the DDMS dataset, this was a subset of 900 images from 450 patients, including 150 malignant cases, 150 benign cases, and 150 are normal cases. Meanwhile, we selected 75 cases from INbreast containing a total of 300 images, 100 images from 25 normal cases, 25 benign cases, and 100 from malignant. From BDCR, a subset of 450 patients was selected with the same distribution. The final dataset was a combination of all the subsets selected. The input was "augmented" using a sequence of random transformations, ensuring that the designated models would never receive the same image as an input multiple times. Shifting was conducted with a fraction of 0.3 from the image's overall height and width. Also, a rotation was randomly applied in a range between 0 degrees and 45 degrees. Shear and zoom were applied in the ranges of 0.5 (0.5 1.5) respectively. In the preparation phase, the training procedure considered balanced class preserving, and the dataset was separated into a training set (80%) and a test set (20%).
Since the data has no impactful imbalance, accuracy can be represent a key performance indicator. The simulation environment can be represented as a grid search process to select the optimal set of hyper-parameters. The model is evaluated for each dataset separately and on the merged dataset.
Since the model final output is a probability distribution and the data do not suffer from severe imbalance so we count on the default classification certain threshold. Meanwhile, we should consider the model uncertainty for biased or imbalance classes situation.
Selecting the correct hyperparameters when conducting fine-tuning is a sensitive process. The optimization step was performed using Stochastic Gradient Descent (SGD) rather than any other optimizer. The main motivation for this was to ensure that the learning rate magnitude was limited. The feedforward layers exploited adaptive ADAM. Optimizer as a training algorithm.
Moreover, to enhance the experimental results and to neutralize the probability of overfitting, data augmentation was exploited with L2 regularization, and dropout was added. At the outset, the dataset (training and testing) was got encoded on the label using one-hot encoding. This meant that the last layer in the feedforward layers consisted of 3 neurons. A softmax activation function was used in that layer, while ReLU was used in all hidden layers. Selecting hyperparameters for a PCNN is a complex process; the experiments simulations were conducted on the setting multiple values for hyperparameters as following: To perform the model performance evaluation, it was tested both on the test split from the merged dataset and the completely unseen dataset, the MIAS dataset.
To perform the model performance evaluation, it was tested both on the test split from the merged dataset and the completely unseen dataset, the MIAS dataset. Models were trained and tested using stratified 10-fold cross-validation. Different measurements were collected, including mean accuracy, standard deviation, and training time for each TL model. Results on a merged dataset testing set are illustrated in Table 2. To verify the main idea of this research, the experiments were conducted twice: one does not consider segmentation before the model (the main idea) and the second conducts segmentation before the model. These results were obtained when 3 hidden layers were used in the feedforward network and the sizes were 1000×1000×200. After the first experiment was conducted, the TL models were scored and tested on an unseen dataset, MIAS. Table 3 provides an overview of the results.
The parameters of a PCNN are difficult to optimize and the grid search operation is a very intensive computation process. Therefore, the values were set based on the best practices established in other studies. Meanwhile, an evaluation of the same dataset by Peng et al. [9] yielded 96% accuracy and 0.92 AUC. Also, Alxander et al [49] worked on Molecular predictions reach balanced accuracies up to 78%, whereas accuracies of over 95% can be achieved for subgroups of patients. Yu-Dong Zhang et al [50] ran BDR-CNN-GCN algorithm 10 times on the breast mini-MIAS dataset and achieved an accuracy of 96.1%.
The experimental results lead to two important insights: • Using PCNN can neutralize the effect of image differences and can eliminate any preprocessing steps. • GoogleNet outperforms other TL models as a pre-trained model due to its ability to generalize.
Based on the hyperparameter values, the overall system training time was measured between 20 and 30 hours, considering the grid search process to optimize model hyper-parameters.
Finally, the advantages and disadvantages can be summarized as: Advantages: • Model does not require any preprocessing or segmentation steps.
• Proved the generalization capability by testing on multiple datasets. • The model proved success for not falling into overfitting problem.
• Superiority in results over previous research trials. Disadvantages: • Complex model contains multiple models.
• Hard to maintain in production environment. The concept of "Uncertainty quantification" is a key factor in uncertainties reduction whether in optimization or decision making processes. UQ can be exploited to resolve a diverse set of practical applications hence, in will be addressed in focus in future while exploring wellestablished techniques [51,52]

Conclusions
In this research, different deep learning pre-trained models were exploited in combination with PCNNs to classify breast cancer candidate images. The model was trained and tested on different datasets, achieving 0.99 AUC and 98.77% accuracy. The implementation of the proposed model proves that using PCNNs can eliminate the segmentation preprocessing step, and the PCNN image signature presents an important discriminant feature to classify. The model was tested on a completely unseen dataset to ensure it does not fall to the overfitting problem due to limited dataset size. One of the major beneficial aspects of the proposed model is that it can be generalized to any computer vision application. PCNN and CNN merging can be used as complementary feature extractors for images and videos. In future work, new ideas to enhance the performance can be introduced. One of them is exploring other datasets and expand the merged dataset. Also, another idea is introduce other complex and deeper models. The last thing the authors will keep eyes on, is to ensure that the model avoids to fall in overfitting.