Double-Shot Transfer Learning for Breast Cancer Classiﬁcation from X-Ray Images

: Differentiation between benign and malignant breast cancer cases in X-ray images can be difﬁcult due to their similar features. In recent studies, the transfer learning technique has been used to classify benign and malignant breast cancer by ﬁne-tuning various pre-trained networks such as AlexNet, visual geometry group (VGG), GoogLeNet, and residual network (ResNet) on breast cancer datasets. However, these pre-trained networks have been trained on large benchmark datasets such as ImageNet, which do not contain labeled images related to breast cancers which lead to poor performance. In this research, we introduce a novel technique based on the concept of transfer learning, called double-shot transfer learning (DSTL). DSTL is used to improve the overall accuracy and performance of the pre-trained networks for breast cancer classiﬁcation. DSTL updates the learnable parameters (weights and biases) of any pre-trained network by ﬁne-tuning them on a large dataset that is similar to the target dataset. Then, the updated networks are ﬁne-tuned with the target dataset. Moreover, the number of X-ray images is enlarged by a combination of augmentation methods including different variations of rotation, brightness, ﬂipping, and contrast to reduce overﬁtting and produce robust results. The proposed approach has demonstrated a signiﬁcant improvement in classiﬁcation accuracy and performance of the pre-trained networks, making them more suitable for medical imaging.


Introduction
Recently, various machine learning algorithms have been used to develop computer-aided diagnosis (CAD) systems to enhance the diagnostic capabilities of breast cancer in medical images. These algorithms are mainly based on traditional classifiers that rely on hand-crafted features in order to solve a particular machine learning task. Therefore, these kinds of methods are considered to be tedious, time-consuming, and require experts in the field, especially in the feature extraction and selection tasks [1]. Recent studies have shown that deep learning methods can produce promising results on tasks such as image classification, detection, and segmentation in different fields of computer vision and image processing. Training these deep learning algorithms from scratch to produce accurate results and avoid overfitting remain an issue due to the lack of medical images available for experiments [2]. In recent years, some techniques such as transfer learning and image augmentation have shown promising opportunities towards increasing the number of training data, overcoming overfitting, and producing robust results [3]. There are some interesting studies about breast cancer digital repository (BCDR) to differentiate benign from malignant breast cancer. The advantage of the DSTL over the single-shot transfer learning (SSTL) technique is that DSTL can improve the overall accuracy, sensitivity, specificity, area under the curve (AUC), training time, epoch number, and iteration number. The contribution of this paper can be summarized as follows: 1.
An effective technique based on the concept of transfer learning, called double-shot transfer learning (DSTL), is introduced to improve the overall accuracy and performance of the pre-trained networks for breast cancer classification. This technique will make these pre-trained networks more suitable for medical image classification purposes. More importantly, DSTL can help speed up convergence significantly. 2.
DSTL can update the learnable parameters (weights and biases) of any pre-trained network by fine-tuning them on a large dataset that is similar, but not identical, to the target dataset. The proposed DSTL adds new instances (CBIS-DDSM) to the source domain (D s ) that are similar to the target domain (D t ) to update the weights of the parameters in the pre-trained models and form a distribution similar to the D t (MIAS and BCDR datasets). 3.
The number of X-ray images is enlarged by a combination of effective augmentation methods that are carefully chosen based on the most common image display functions performed by doctors and radiologists during the diagnostic image viewing. These augmentation methods include different variations of rotation, brightness, flipping, and contrast. These methods will reduce overfitting and produce robust results. 4.
The proposed DSTL will provide a valuable solution to the difference between the source and target domain problem in transfer learning.

Dataset Description
In this research, three publicly available breast cancer datasets have been used to assess the effectiveness of the proposed method and validate the experimental results. These three datasets include CBIS-DDSM, MIAS, and BCDR.

CBIS-DDSM Dataset
DDSM is a public resource for providing the research community with mammographic images to facilitate and enhance the development of computer algorithms and training aids in order to develop an effective CAD system. It is a collaborative work between Massachusetts General Hospital, Sandia National Laboratories, and the University of South Florida Computer Science and Engineering Department [17]. Curated breast imaging subset of DDSM (CBIS-DDSM) is an updated version of the DDSM. This dataset contains normal, benign, and malignant cases with verified pathology information. The CBIS-DDSM collection contains a subset of the DDSM data organized by professional radiologists. It also contains bounding boxes, pathological diagnosis, and ROI segmentation for training data. After eliminating the corrupted and noisy images as shown in Figure 1, the number of images has been reduced to 7277 images of abnormal cases [18,19]. These abnormal images include 4009 benign and 3268 malignant cases.

MIAS Dataset
The mammographic image analysis society (MIAS) is an organization of UK research groups interested in the understanding of mammograms. MIAS has created a database of digital mammograms taken from the UK National Breast Screening Programme. The database contains 322 digitized films and is available on 2.3 GB 8 mm (ExaByte) tape. In total, 114 images out of the total images are abnormal images, where 63 images are benign and 51 images are malignant. It also includes radiologists' annotation on the locations of cancers. The abnormality is divided into six classes of masses namely calcification, well-defined/circumscribed, speculated, ill-defined, architectural distortion, and asymmetry. The database images have been decreased to a 200 micron pixel edge and padded/clipped, making all the images 1024 × 1024. Mammographic images can be accessed from the Pilot European Image Processing Archive at the University of Essex [19,20]. The total images of benign and malignant cases before applying the augmentation methods are 63 and 51 respectively.

BCDR Dataset
The breast cancer digital repository (BCDR) project has two main objectives: (1) establishing a reference to explore computer-aided detection and diagnosis techniques, and (2) offering teaching opportunities to medical-related students. The BCDR has been publicly available since 2012 and it is still under development. BCDR provides comprehensive patients cases of breast cancer including mammography lesions outlines, prevalent anomalies, pre-computed features, and related clinical data. Patient cases are BIRADS classified, biopsy proven, and annotated by specialized radiologists. The bit depth is 14 bits per pixel and the images are saved in the TIFF format [21]. In this research, a total of 159 of abnormal images have been used, consisting of 80 benign and 79 malignant.
It is worth to be noted that all images have been converted into png and resized into 224 × 224 and 227 × 227 to fit every pre-trained network. Figure 2 shows some samples of the CBIS-DDSM, MIAS, and BCDR datasets including benign and malignant findings. With limited training data, one of the common problems deep learning algorithms might face is the overfitting problem [22]. Overfitting occurs when the training samples are too small which might cause the model to be unable to generalize. In other words, the issue of overfitting might lead to a good model at detecting or classifying features that were included in the training samples, but the same model will not be able to detect or classify features that were not trained on [23]. Additionally, since there is a small number of training breast X-ray images available, new images were augmented from these available breast X-ray images using image augmentation methods and include these augmented images in the training samples. The most common image display functions performed by doctors and radiologists during the diagnostic image viewing have been considered as the augmentation methods in this research. The augmentation methods are mainly inspired by the doctors' behavior in interpreting medical images [24]. Table 1 shows the number of every dataset before and after applying the image augmentation techniques. Tables 2-4 show the distribution of the datasets after applying the image augmentation techniques on every dataset.  Table 3. MIAS dataset distribution after the four augmentation methods combined. Training samples  857  694  1551  Validation samples  214  173  387  Testing samples  63  51  114   Total  1134  918  2052   Table 4. BCDR dataset distribution after the four augmentation methods combined.

Pre-Trained Networks
In this research, the proposed DSTL technique has been tested on most of the pre-trained networks that had been used in the breast cancer classification literature. Every pre-trained network that was used in this research is briefly explained below.

AlexNet
AlexNet is one of the most popular CNNs that has achieved high accuracy in various object detection and classification tasks. AlexNet is trained on ImageNet used in the ImageNet Large-Scale Visual Recognition Challenge 2010 (ILSVRC-2010) and ILSVRC-2012 competitions. AlexNet is an 8-layer-deep and can classify images into 1000 object classes. AlexNet contains five convolutional and three fully-connected layers. The input image size of AlexNet is 227 × 227 × 3. AlexNet has used the dropout technique which has reduced overfitting significantly [11].

GoogLeNet
GoogleNet achieved the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC2014). It is one of the models with great computational efficiency and can be run on a single device with the utilization of limited computing resources while increasing both depth and width in the network. It utilizes the concept of inception blocks which can reduce the number of parameters. The average pooling layer has also been used before the classification layer, in addition to an extra linear layer to make it more convenient to be fine-tuned. An average pooling layer with 5 × 5 filter size and stride 3 was applied, and 1 × 1 convolution with 128 filters was followed by a rectified linear activation function. Finally, a fully connected layer, dropout layer, and softmax layer were added [13].

VGG
The Visual Geometry Group (VGG) from the University of Oxford has proposed the VGG model in 2014. The VGG architecture and configurations are inspired by AlexNet. The size of the input image to the Conv layer is a fixed-size of 224 × 224, RGB image. It has three fully connected layers in which there are 4096 channels in each of the first and two fully connected layers, and the third fully connected layer has 1000 channels. Softmax layer is the final layer of the network. VGG-16 is a 16-layer deep and has a total of 138 million parameters. Similar to VGG-16, VGG-19 is trained on ImageNet dataset that contains more than a million images to be classified into 1000 object classes. VGG-19 is a 19-layer deep and contains a total of 144 million parameters [12]. In ImageNet challenge 2014, VGG team won the first place in localization and the second place in the classification.

MobileNet-v2
MobileNet-v2 is a network architecture that uses depthwise separable convolutions as building blocks. It is an efficient model for mobile applications especially in the field of image processing. MobileNet-v2 has convolution layers in a building block which are split into two separate layers. MobileNet-v2 uses linear bottlenecks between its layers, and shortcut connections between those bottlenecks. The architecture of MobileNet-v2 contains a convolution layer with 32 filters, followed by 19 residual bottleneck layers and a Relu activation function. The filter size that has been used in the network architecture is 3 × 3. Finally, the dropout and batch normalization have been utilized within its architecture [15].

ResNet
Microsoft research introduced ResNet (Residual Network) and won the first place in ILSVRC 2015. ResNet uses a new technique called skip connections. This has allowed to train a deeper network with more than 150 layers. ResNet model could reduce the effect of the vanishing gradient problem significantly. ResNet has reduced the error rate from 6.7% obtained by GoogLeNet to 3.57% on the ImageNet dataset. In this work, we have focussed on ResNet-50 and ResNet-101. The depth of the ResNet-50 is 50 layers and has 25.6 million parameters. The architecture of ResNet-50 consists of 5 stages with a residual block in each stage. These residual blocks work with a shortcut identity function that helps to skip one or more layers. On the other hand, ResNet-101 is a 101-layer deep network and consists of 44.6 million parameters. The size of the input image in ResNet is 224 × 224 × 3 [14].

ShuffleNet
Megvii Inc group introduced the ShuffleNet model in 2017, which uses two new operations, pointwise group convolution and channel shuffle to reduce the computation cost while maintaining high accuracy. In the latest trend of constructing deeper networks, CNNs utilize billions of floating point operations per second to attain better accuracy. ShuffleNet utilizes about 10-150 of mega floating point operations per second which makes ShuffleNet more suitable for mobile devices with limited computing power. ShuffleNet has 50 layers and 1.4 million parameters. ShuffleNet model helps to overcome the consequences obtained by the group convolutions with its special operation called channel shuffle [16]. Table 5 represents the proprieties of every pre-trained model including the model depth, size, number of parameters, and the image input size. In addition, the validation accuracy monitoring algorithm was implemented in order to obtain the optimal hyper-parameters for every pre-trained model. The hyper-parameters values that give the highest accuracy on the validation dataset have been considered for the pre-trained models. The steps of the algorithm are shown in Algorithm 1. Table 6 presents the training options and the hyper-parameters values for all pre-trained models that were used in the training process.  BestVal Accuracy ← Null

Double-Shot Transfer Learning
Transfer learning is a powerful technique that allows knowledge to be transferred across various tasks of neural networks. In transfer learning, a pre-trained network that has already learned informative features from a certain image classification task can be used as a starting point to learn a new task using a smaller number of training samples. Knowledge transferring can be done by fine-tuning some layers in the pre-trained network, such as input layer, fully-connected layer, classification layer, and train the pre-trained network on a new dataset. Fine-tuning a pre-trained network usually produces better accuracy, and it is faster than training a new network from scratch [25]. It has been shown in a previous work that transfer learning is very effective when the source and target domains/tasks are similar. In the previous studies, instead of learning from scratch, SSTL takes advantage of knowledge that comes from previously learned datasets, especially when the training samples in the target domain are scarce. Unfortunately, SSTL has been applied without taking into account that these pre-trained models had been trained on ImageNet, which has different feature space and distribution from our target datasets. In other words, the previous works did not take into account the relationship between source and target domain when SSTL is applied. A domain can be represented as D = {X , P(X)}, where X is the feature space, P(X) is the probability distribution function, and X = {x 1 , . . . , x n } ∈ X . A task can be represented by T = {Y, f (.)}, where Y is the label space and f (.) is the objective predictive function. T can be learned from the training data, which consists of pairs {x i , . . . , y i }, where x i ∈ X and y i ∈ Y. The function f (.) can be used to predict the corresponding label, f (x), of a new instance x. f (x) can also be considered as a conditional probability function P(y|x) [26]. In SSTL, given a domain source D s for a learned task T s can help to learn a target task T t of the domain D t . In most of the cases, D s = D t and/or T s = T t . However, the DSTL aims to bring the marginal probability distributions of both domains D s and D t similar to each other, D s D t , by providing D s with a large number of instances that are similar to D t , especially when D t has insufficient training samples. Hence, the performance of the prediction function f T (x) for learning task T t can be improved. In most cases, D s data are larger than D t data. Unlike TrAdaBoost [27] that filters out instances which are dissimilar to the target domain in source domains, the proposed DSTL adds new instances to the source domain D s that are similar to the D t to update the weights of the parameters in the pre-trained models in the D s and form a distribution similar to the target domain. Figures 3 and 4 show a sketch of the instances transferring in SSTL and DSTL respectively.

Definition 1 (Standard Transfer Learning).
Given a source domain D s and learning task T s , a target domain D t and learning task T t , transfer learning aims to help improve the learning of the target predictive function f (.) in D t using the knowledge in D s and T s , where D s = D t and/or T s = T t . D s = D t implies that either X s = X t or P s (X) = P t (X). T s = T t implies that either Y s = Y t or P(Y s |X s ) = P(Y t |X t ).
Definition 2 (DSTL). Given a source domain D s and learning task T s , a target domain D t and learning task T t , transfer learning aims to help improve the learning of the target predictive function f (.) in D t using the knowledge in D s and T s , where D s D t and T s T t . D s D t implies that X s X t and P s (X) P t (X). T s T t implies that Y s Y t and P(Y s |X s ) P(Y t |X t ).
In our context, the learning task is image classification (Benign or Malignant), and each pixel or weight is taken as a feature, hence X is the space of all pixel vectors, x i is the ith pixel vector corresponding to some images and X is a specific learning sample. Additionally, Y is the set of all labels, which is Benign, Malignant for the classification task, and y i is "Benign" or "Malignant". In our context, D s can be a set of weights vectors together with their associated Benign or Malignant class labels. Based on the above DSTL definition, a domain is a pair D = {X , P(X)}, hence, the condition D s D t implies that X s X t and P s (X) P t (X). This indicates that the images features or their marginal distributions in both D s and D t are related. Similarly, a task is defined as a pair T = {Y, P(Y|X)}, hence, the condition T s T t implies that Y s Y t and P(Y s |X s ) P(Y t |X t ). When the D t = D s and T t = T s , the learning task becomes a traditional machine learning task. Moreover, when D t = D s , then either (1) X t = X s or (2) X t = X s but P(X s ) = P(X t ), where X s i ∈ X s and X t i ∈ X t . In our case, situation (1) refers to when one set of images is medical images and the other set is natural images. Situation (2) can correspond to when the D s and the D t images come from different patients or sources. Eventually, since medical images share many features in common compared to natural images, DSTL technique creates an implicit relationship between D s and D t and extracts better feature maps than the pr-trained models that have been only trained on natural images. DSTL can be considered as a new strategy for adjusting the weights of the pre-trained models by mapping the instances from D s and D t to a new domain space. The new space will contain instances from D s and D t , making it domain invariant. In this research, various pr-trained models were fine-tuned on 98,967 of the augmented CBIS-DDSM dataset, and saved as the name of the original pre-trained network followed by the symbol (+) to distinguish them from the original pre-trained networks that were only trained on ImageNet dataset. Next, the updated pre-trained models (+) were fine-tuned for the second time on the augmented images of the target datasets of MIAS and BCDR. Figure 5 illustrates the process of DSTL. All the pre-trained networks that have been used in this research share three common layers namely input layer, FC layer, and classification layer. By fine-tuning these 3 layers using the CBIS-DDSM dataset first, we can update all the learnable parameters, and then augment them on the target datasets of MIAS and BCDR datasets. Figure 6 shows the fine-tuned layers where the input layer size is kept the same as the original one 224 × 224 except for Alexnet where the input size was set to 227 × 227. FC and classification layers were fine-tuned in every pre-trained model. Every pre-trained model has been fine-tuned by replacing the output size parameter of FC and classification layers from classifying images of 1,000 object categories to 2 classes of benign and malignant. The classification layer computes the cross entropy loss with mutually exclusive classes. The classification layer takes the output from the Softmax layer and allocates each input to one of the K mutually exclusive classes using the cross entropy function. In Figure 6, we only mention the layers which have been fine-tuned respectively. It is worth mentioning that although the CBIS-DDSM, MIAS, and BCDR datasets are similar, they come from different sources. Thus, these medical datasets have not been combined as a single dataset for this research experiment. This will shed light on the use of the DSTL technique in various medical image classification tasks such as liver cancer, lung cancer, kidney cancer, and other types of cancers, where the collection of high-quality annotated images is very expensive. However, due to the availability of the breast cancer datasets, the DSTL has been applied to the breast cancer classification.

Execution Environment
All the experiments were performed using a PC with Intel ® Core TM i5-8400, CPU @ 2.80GHz x 6, and 23 GB of RAM. NVIDIA ® TITAN Xp GPU with 12 GB of memory. MATLAB R2019b with CUDA V10.2 and cuDNN 7.6.5. The operating system is 64-bit Ubuntu 18.04.3.

Results
The most common performance evaluation metrics in the field of computer vision and image process were used for evaluating the performance of the pre-trained models with SSTL and DSTL for classifying between benign and malignant breast X-ray images. The evaluation methods include Sensitivity, Specificity, Classification Accuracy, and Receiver Operating Characteristic curve [28][29][30][31]. Finally, the performance analysis of different pre-trained network is demonstrated, including the training time, epoch number, and iteration number.

Sensitivity
It is also called the true positive (TP) rate. TP corresponds to Malignant cases in this research. It calculates the number of true positive predictions over the number of actual positive plus false negative (FN) cases, defined as:

Specificity
It can be called the true negative (TN) rate. In this paper, TN corresponds to Benign cases. It computes the proportion of actual negative cases that are predicted as negative cases. The specificity formula is defined as: Speci f icity = TN TN + FP (2)

Accuracy
Accuracy or overall accuracy represents the number of correctly predicted cases over the all cases. It can be formulated as:

Receiver Operating Characteristic (ROC)
In this paper, the ROC curve is used to evaluate the performance quality of the pre-trained models and present the Area Under the   Figure 9. The ROC curve of various pre-trained models with SSTL for breast cancer classification using the BCDR dataset.   Figure 10. The ROC curve of various pre-trained models with DSTL for breast cancer classification using the BCDR dataset. Table 7 demonstrates the results summary of the pre-trained models using single-shot transfer learning using the CBIS-DDSM dataset. It can be noted from Table 7 that most of the pre-trained models have produced resealable results due to the large number of training samples they were trained on. Table 8 shows the comparison summary of the pre-trained models with SSTL and DSTL technique. Table 9 illustrates the comparison summary of the performance evaluation of different pre-trained models in terms of the training time, number of epochs, and number of iterations. As can be seen from Tables 8 and 9, the DSTL technique has improved the accuracy and performance of the pre-trained networks significantly. In this research, instead of reinventing the wheel, the existing pre-trained models have been used to evaluate the method of transfer learning from ImageNet (SSTL) against our proposed technique (DSTL). We did not consider training from random initialization as the case in [32]. By observing the results in Table 8, it is obvious that DSTL can enhance the performance of the lightweight and non-lightweight models alike and provide faster convergence as shown in Table 9. However, training from random initialization can be analyzed in the future.  In Table 9, the number of iterations and epochs is different for each model because we use the validation accuracy monitoring algorithm which helps with reducing the number of iterations and epochs by keeping track of the best validation accuracy and the number of validations, hence, when there has not been any improvement of the validation accuracy (validation lag), an early stop mode of the training process will be triggered. For example, if the validation accuracy is not improving after 10 iterations, the training process will stop automatically.