Spatially localized sparse approximations of deep features for breast mass characterization

We propose a deep feature-based sparse approximation classification technique for classification of breast masses into benign and malignant categories in film screen mammographs. This is a significant application as breast cancer is a leading cause of death in the modern world and improvements in diagnosis may help to decrease rates of mortality for large populations. While deep learning techniques have produced remarkable results in the field of computer-aided diagnosis of breast cancer, there are several aspects of this field that remain under-studied. In this work, we investigate the applicability of deep-feature-generated dictionaries to sparse approximation-based classification. To this end we construct dictionaries from deep features and compute sparse approximations of Regions Of Interest (ROIs) of breast masses for classification. Furthermore, we propose block and patch decomposition methods to construct overcomplete dictionaries suitable for sparse coding. The effectiveness of our deep feature spatially localized ensemble sparse analysis (DF-SLESA) technique is evaluated on a merged dataset of mass ROIs from the CBIS-DDSM and MIAS datasets. Experimental results indicate that dictionaries of deep features yield more discriminative sparse approximations of mass characteristics than dictionaries of imaging patterns and dictionaries learned by unsupervised machine learning techniques such as K-SVD. Of note is that the proposed block and patch decomposition strategies may help to simplify the sparse coding problem and to find tractable solutions. The proposed technique achieves competitive performances with state-of-the-art techniques for benign/malignant breast mass classification, using 10-fold cross-validation in merged datasets of film screen mammograms.


Introduction
Cancer is one of the leading causes of death worldwide [1].Among women, breast cancer is the most commonly diagnosed type of cancer [2].It is projected that about 12% of all women in the U.S. will be diagnosed with breast cancer in their lifetime [3,4].Mammography is one of the main imaging modalities used initially to diagnose breast cancer and is a standard preventive measure [5].However, mammography examination by radiologists with variable clinical experience and training, poses the issue of variability in radiologist interpretive performance [6].Thus, an automated computer-aided diagnosis (CAD) system would be a useful assistive tool in modern medicine and a second opinion for medical professionals.Automated computer-aided diagnosis systems can ultimately improve diagnosis accuracy, and reduce the time and expenses of diagnostic workflows.
Two important stages of conventional image classification systems are feature extraction and implementation of a classifier.Feature extraction is a process by which image descriptors/ features are found, ideally, features that have the most discriminative power.CAD systems have used handcrafted features, such as texture features and statistical features to train a classifier [7][8][9][10].In [8], a statistical t-test feature extraction and feature quantity optimization through thresholding method was used to distinguish between benign and malignant tumors.The extracted image features were fed into a support vector machine (SVM) classifier to perform class assignment.
Another category of techniques leverages the inherent sparsity of signals in nature to produce signal representations suitable for coding, superresolution and classification [11][12][13][14].These techniques aim to approximate test signals by linear combinations of column vectors (or atoms) chosen from dictionary matrices that minimize sparsity under residual approximation constraints.These matrices may be directly sampled from the training set, or learned from it, using dictionary learning approaches typically based on the K-SVD algorithm [11].
Furthermore, state-of-the-art machine learning systems, like deep neural networks, have shown considerable applicability in medical imaging classification tasks.An ideal dataset of medical images used in machine learning would have physician annotations and contain a sufficient number of data samples (i.e., millions of medical images) [15].However, medical image datasets with these ideal components are not widely available yet.Considering the lack of sufficient data, several deep neural network classification approaches apply pretrained networks to medical imaging data [16,17].Convolutional neural networks (CNNs), have achieved impressive performance in medical image classification and recent work concentrates on mammography data [2,[18][19][20][21][22][23][24][25][26].In [18], transfer learning and data augmentation was used to overcome the challenge of limited mammography data.They validated end-to-end CNN classification on a shallow CNN architecture and two early CNNs, AlexNet and Googlenet.The authors in [23] proposed deep learning lesion detection and CNN classification.They compared the performance of a basic CNN to a modified ResNet50 and InceptionResNetV2 architecture for breast lesion classification.Their experimental results showed that deep learning CAD systems achieved better accuracy performance than conventional systems such as SVM.In [20], transfer learning and finetuning was applied to the AlexNet and Googlenet architectures for breast mass classification.The authors observed that transfer learning outperformed a shallow CNN trained from scratch.The feature extraction capacity of CNNs, combined with a traditional classifier, is evaluated in [19].The authors apply an ensemble classifier approach to classify benign and malignant mass ROIs from full field digital mammograms.The ensemble classifier averaged the output of an SVM classifier trained on CNN features and an SVM classifier trained on analytically extracted features.The ensemble classifier outperformed the individual SVM classifiers.
Despite the significant progress in this field, there is still a need for effective and interpretable classification models that may use the representation capacity of deep learning and be trainable on small-sized datasets.In this work, we leverage the strengths of inductive representation capacity of features computed by deep convolutional networks to form dictionaries for sparse approximation-based classification of breast masses in mammograms.An original contribution of this work is that it investigates the suitability of deep feature maps as dictionaries to be used for sparse approximations.In addition, the sparse coding module can be used for visualization and interpretability of deep learning.For example, we can display the atoms/training samples that best approximate an unknown sample.Furthermore, we develop and evaluate block and patch selection, reconstruction and decision fusion techniques to increase the number and diversity of dictionary atoms for sparse analysis.We apply our deep feature spatially localized sparse analysis method (DF-SLESA) on a merged dataset of breast mass regions of interest (ROIs) from mammograms for separation of benign and malignant masses.We compare this technique to fully-sparsebased classifiers, and to end-to-end CNN classification.Experimental results suggest that deep feature-based dictionaries yield more discriminant sparse approximations of mass characteristics than pixel intensity-based dictionaries and K-SVD learned dictionaries [11], and improve classification performance.

Methods
In this section, we first describe the CNN architectures that will be interfaced with the sparse approximation stage.We then detail our Deep Feature-Spatially Localized Ensemble Sparse Analysis (DF-SLESA) technique.

Convolutional neural network descriptions
2.1.1.Googlenet-The Googlenet introduced a network architecture that utilizes the Inception module.The Googlenet made its debut as a submission in the ILSVRC14 competition and outperformed in accuracy performance over the revolutionary network, Alexnet, with 10 times fewer parameters.It was designed with practical use and computational expense in mind.This network first employs two convolutional layers, each followed by max pooling.Next, the architecture includes 3 stages of Inception modules each followed by max pooling.The final Inception module is followed by an average pooling layer.Googlenet introduces a 1 × 1 convolutional kernel that computes reductions before convolutions in the Inception module to reduce computational expense.Convolution kernels of size 1 × 1, 3 × 3, and 5 × 5 are employed within an Inception module and the outputs of all convolutions are concatenated to produce the final feature map.To tackle the gradient vanishing problem during training, auxiliary classifiers are added to intermediate layers.A linear layer with a softmax loss is used as the classifier [27].

InceptionV3-InceptionV3
shares similarities with the Googlenet as it includes a sequence of convolutional layers followed by Inception modules and a linear softmax classifier.Some of the unique properties of the InceptionV3 network are the factorization into smaller convolutions using 3 × 3 kernels, asymmetric convolutions to reduce the number of parameters, and batch normalization in the fully connected layer of the auxiliary classifier.InceptionV3 produced the lowest Top-1 and Top-5 errors on the ImageNet dataset at the time it was published [28].

ResNet architectures-In 2015, a new network architecture known as Residual
Network or ResNet, emerged in deep learning and introduced residual connections, which were beneficial for training deeper networks.The main concept is that shortcut connections (or skip connections) can be added to a plain network to facilitate learning of the deeper layers.Skip connections essentially allow activations from a layer to be fed to a layer deeper in the network.The ResNet family of CNNs follows a structure of an initial convolution stage without skip connections, followed by multiple stages of convolutional layers with residual connections, average pooling, and ending with a fully connected layer.In [29], deeper ResNets such as ResNet50 and ResNet101 yielded significantly better classification accuracy than the baseline ResNet18 and ResNet34 models.

2.1.4.
DenseNet-Recent studies have shown that shorter connections between layers near the input and near the output produce greater training efficiency, accuracy improvement, and support increased network depth.The authors in [30] proposed a densely connected convolutional neural network, DenseNet, that has unlike traditional L-layer CNNs that have L connections.DenseNets use a simple layer connection where all layers are connected directly to each other.The feed forward nature is maintained by ensuring every layer receives additional inputs from previous layers and passes them on as feature maps.Features are combined through concatenation (unlike summation as in ResNets) before passing on to subsequent layers.An important difference between DenseNet and other networks is that DenseNet can have very narrow layers using a growth rate hyperparameter.[31] network is an extension of the InceptionV3 network.InceptionResNetV2 incorporates residual connections into the Inception architecture.The experimental results presented in [31], show that the use of residual connections in the Inception architecture accelerates training.The InceptionResNetV2 architecture begins with a stem block that has a series of 3 × 3 and 1 × 1 convolutions of different strides, filter concatenation and max pooling.Subsequently, this design employs multiple InceptionResNet1-A modules followed by a reduction layer, and multiple InceptionResNet1-B modules also followed by a reduction layer.The network architecture ends with multiple InceptionResNet1-C modules, average pooling and a softmax layer for classification.

InceptionResNetV2-The InceptionResNetV2
2.1.6.Xception-The Xception (or Extreme Inception) network proposed in [32], applies depthwise separable convolution layers.The Xception network architecture has a three stage model that consists of an entry flow, middle flow and exit flow.In the entry flow, two initial convolution layers perform 3 × 3 convolutions, each followed by a ReLU layer.Subsequently, three separable convolution blocks follow and the entry flow outputs 728 feature maps of size 19 × 19.The middle flow has one separable convolutional block that is repeated 8 times.The exit flow takes as input the output of the middle flow, performs additional separable convolutions, global average pooling, and lastly logistic regression.
The depthwise separable convolutional layers function like Inception modules.Experimental results reported in [32] show that when residual connections are added to the Xception architecture there is a significant boost in accuracy.

Deep feature-spatially localized ensemble sparse analysis
We combine localized sparse approximations and the feature extraction capabilities of convolutional neural networks to classify breast masses into malignant or benign states.
The DF-SLESA method workflow has three major stages, namely, deep feature extraction, deep dictionary construction and ensemble classification and decision making [33].Further details of the DF-SLESA method and its main stages are given in the following sections. .This process can be expressed mathematically as where L is the layer number, N (L − 1) represents the number of kernels at the L − 1 layer, m represents the sample id, w is the weight kernel, b is the bias and '*' denotes convolution.
Deep features for each ROI were extracted from the layer that precedes the first fully connected layer.Average pooling (AP) is applied to reduce the deep feature vector length to l.Once all training mass ROIs undergo deep feature extraction, together they form a deep feature dictionary specific to the employed convolutional neural network.

Deep dictionary construction using BlockBoosting-BlockBoost
decomposition constructs dictionaries of spatial information from specific regions of the training feature vectors.Each training feature is divided into blocks -that is, spatially localized patches-of length s.Combining all blocks with spatial index i from the training features generates a block-specific dictionary of the form where k is the number of training samples.Sparse representations of each block y j i of a test sample y j are computed by optimizing the following objective function given a noise margin ϵ: where j represents the test sample id.
In this manner, sparse representation classification for each test sample is achieved using the corresponding block dictionary from the training set.We denote the number of patches per sample as np = floor(l ∕ s).Therefore, np unique sparse coders are applied to make a classification decision for a single test sample as shown in Figure 1.

Deep dictionary construction using PatchSampling-PatchSampling
decomposition begins in a similar fashion to BlockBoosting by dividing the deep feature vectors into 1-D blocks of length s.The major difference in this decomposition method is that it forms a single deep feature dictionary that is composed from all training patches.A test sample block feature y j i is classified by finding sparse solutions, given the deep features dictionary.A dictionary of this form is not index specific, therefore, it contains many more atoms than the block specific dictionary D i .The PatchSampled dictionary is defined as where np = floor(l ∕ s) represents the number of feature patches per subject.The following objective function is optimized to find sparse solutions x j i : x j i = arg min ‖x j i ‖ 1 subject to ‖y j i − Dx j i ‖ 2 < ϵ . (2.3) We illustrate in Figure 2, the class prediction flowchart of a single test sample when PatchSample decomposition is applied.In contrast to BlockBoosting, the PatchSampled deep dictionary D requires just a single sparse coder to approximate each test patch.However, the same number of patch decision scores with BlockBoosting are combined to make a classification decision.

2.2.4.
Ensemble classification-Our approach uses a block-based log-likelihood decision score to make an ensemble classification decision.The log-likelihood approximation decision function is defined as, where r m (x j i ) and r n (x j i ) are the approximation residuals using the mth and nth class-specific atoms respectively.For example, r m (x j i ) = ‖y j i − D i x jm i ‖ 2 is the residual of approximation using dictionary atoms and solution coefficients x jm i from the mth class only.
To estimate the ensemble score ELLS(x j ), we average the individual scores, (2.5) We employ the sign function to determine the class prediction ω c , that is (2.6)

Experiments
We utilized 10-fold cross-validation to evaluate the performance of our DF-SLESA method for classification of breast masses as benign or malignant.For comparison, we performed conventional sparse representation classification (SRC), spatially localized ensemble sparse analysis (SLESA) [34,35], label-specific dictionary learning SLESA (LS-SLESA) [36] and the aforementioned end-to-end CNNs on the MergedBreast dataset.In conventional SRC experiments, when no block decomposition is applied, we apply dimensionality reduction via principal component analysis (PCA).Table 1 details the deep feature length of the extracted features by convolutional neural network.

Evaluation measures
There are several common performance measures used to assess a classifier, such as precision, recall, accuracy, and area under the ROC curve.In this work, we measure the classification performance by calculating the true positive rate (TPR), true negative rate (TNR), classification accuracy (ACC), and area under the receiver operating characteristic curve (AUC).
Classification accuracy indicates how well a classifier makes a correct prediction.Accuracy is determined using the following equation: where T P , T N, F P , and F N represent the number of true positives, trues negatives, false positives, and false negatives respectively.The true positive rate (TPR) (or recall/sensitivity) is an indication of how well our classifier correctly predicts malignancy, TPR = T P T P + F N . (3.2) Similarly, the true negative rate (TNR) indicates how well our classifier correctly classifies benign masses, TNR = T N T N + F P . (3. 3) The AUC is the area under the receiver operating characteristic (ROC) curve.The ROC curve graphs TPR versus FPR.

Data description
We formed a merged dataset, which we refer to as MergedBreast, by combining benign and malignant masses ROIs from two publicly available datasets, Mammographic Imaging Analysis Society (MIAS) [37,38] and the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) [39].The Mammographic Imaging Analysis Society (MIAS) dataset contains 322 MLO view mammograms from a total of 161 patients.The centroid and radius of each mass is provided from radiologist examination readings, and these values are used to generate a bounding box for cropping regions of interest (ROIs).The CBIS-DDSM an updated version of the Digital Database for Screening Mammography (DDSM) dataset that contains 10,239 mammogram images from 1,566 patients.This dataset provides cranial caudal (CC) and MLO mammogram views of breast masses and calcifications with verified pathology.The CBIS-DDSM dataset is considered a standardized version of DDSM as questionable cases were removed, along with other image processing enhancements such as image decompression, noise reduction, and image cropping.The CBIS-DDSM dataset provides ROIs that were cropped by a bounding box about the center of each abnormality.The experimental procedures involving human subjects were approved by the Institutional Review Boards of the institutions, where the data were acquired.
We combined the mass ROIs from the MLO view images from the mass training subset of CBIS-DDSM and ROIs from MIAS to form the MergedBreast dataset.A total of 388 malignant masses and 434 benign masses were considered to form the MergedBreast dataset.The number of malignant and benign samples from both datasets are introduced in Table 2. To prepare the MergedBreast data ROIs for DF-SLESA, SRC and other SLESA methods, the ROIs are downsampled or oversampled to a fixed size.

Pre-processing
To improve the contrast between the masses and their surrounding tissues and reduce the noise level, we applied image enhancement techniques to the complete mammographs and then cropped the mass ROIs.The mammogram image enhancement pipeline begins with median filtering and artifact removal (i.e., removal of label annotations).It then applies unsharp masking, Gaussian filtering and morphological edge enhancement.It finally employs wavelet frames for reconstruction [40], and CLAHE to increase the image contrast.

Results
In Figure 4 (first row) and Table A1 we report the classification performance of conventional SRC and SRC with dictionary learning (LS-SRC) on the MergedBreast dataset.No block decomposition is applied in these experiments, thus providing a baseline for BlockBoosting and PatchSampling performance.We performed resizing to dimensions of 128 × 128, 64 × 64 and 32 × 32 pixels in this set of experiments.We then vectorized the ROIs and applied PCA for dimensionality reduction.The classification rates indicate that dictionary learning and a resizing to dimension of 32 × 32 produces the best classification accuracy.In general, the use of dictionary learning through KSVD [12] slightly improves the ACC and AUC rates.
BlockBoosting and PatchSampling SLESA methods do not improve performance when compared to SRC as shown in Figure 4 (second row) and Table A2.The significant disparity between TPR and TNR is a consistent trend in BlockBoosting and more so in PatchSampling SLESA.Dictionary learning through KSVD improves TPR rates in all PatchSample SLESA experiments, thus providing a better balance between TPR and TNR performances.
As a baseline comparison for our DF-SLESA experiments, we also evaluated the endto-end classification performances of the same CNNs that we used in DF-SLESA.We applied fine-tuning to networks that were pre-trained on Imagenet.We employed Bayesian optimization to tune the minibatch size (8 to 128) and number of epochs (2 to 80).Geometric transforms, such as rotation and random horizontal and vertical reflection, were used for data augmentation.The network weights were updated in the training stage of CNN classification using Adam optimization with initial learning rate of 10 −3 , learning rate drop factor of 0.95 per epoch, momentum of 0.9 and ℓ 2 regularization of 10 −4 .
In our DF-SLESA experiments, we tested block lengths of 512, 256 and 128 in block decomposition.Since ResNet18 deep features have a dimensionality of 512, block decomposition is not possible for block division of block length 512.In Table 3, we report the performance of SRC using deep features.Before block decomposition, we see that deep features improve SRC classification performance significantly, by approximately 10%.Block decomposition further improves classification performance as seen in Tables A3 and  A4.We denote the DF-SLESA method by the convolutional neural network name followed by -SLESA.Furthermore, Figure 5 provides a summary of classification accuracy and area under the ROC curve of DF-SLESA methods using BlockBoosting and PatchSampling.BlockBoosting and PatchSampling performance in DF-SLESA experiments indicates the limited scalability of sparse classifiers.As the number of training samples increase through block decomposition of smaller lengths, the performance of PatchSampling generally declines.Dimensionality reduction by block decomposition is seemingly most efficient with a reasonable training set size for sparse solvers.While BlockBoosting uses block specific sparse coders and PatchSampling uses a single sparse coder for all test patches, the size of the dictionary seems to have a greater impact on performances than the number of sparse coders used in representation.Convolutional neural networks on the other hand thrive with larger amounts of data, thus having greater scalability.

Discussion
Comparing end-to-end CNN classification rates (Table 3) with DF-SLESA classification performance (Figure 5, and Tables A3 and A4) indicates that our sparse ensemble classifier outperforms the neural network classifiers for the majority of CNNs.For instance, SLESA with PatchSample decomposition on InceptionV3 deep features significantly outperforms end-to-end classification using InceptionV3 solely.Densenet201 end-to-end classification produced the top accuracy performance among end-to-end CNN classifiers, achieving an ACC of 72.29% and AUC of 77.51%.However, DF-SLESA produces competitive rates with CNNs.DF-SLESA methods outperform CNN classification for Googlenet, InceptionV3, ResNet50, ResNet101 and InceptionResNetV2 features.Overall, the use of block decomposition yields improvement in DF-SLESA performance for most network deep features, as it reduces the dimensionality of the approximation problem.In addition, the InceptionV3-SLESA method achieved the highest ACC of 72.31% and second best AUC of 77.04%.Figure 6 displays the ACC, AUC, TPR and TNR measures produced by DF-SLESA with average pooling of 256 and BlockBoosting with block length 128 corresponding to the top InceptionV3-SLESA performer.
Furthermore, comparisons with related work may not be straightforward as the reported results of mammography classification seem to be highly heterogeneous.The major factors that may cause the variability of results are the type of mammograms (digital or film screen) and views (MLO and/or CC), the classification task (normal versus abnormal tissue, benign versus malignant, masses and/or calcifications), and the use of ROIs versus complete mammograms [2,21,22].In a directly comparable work published in [41], the authors employ an ad hoc CNN architecture for classification of benign and malignant masses from the CBIS-DDSM dataset and achieved 71.19% ACC.In [42], the authors performed a similar task of benign versus malignant mass classification using sparse representation classification on the DDSM dataset.Their patch ROI characterization and classification strategy was assessed on 80 masses with various mass margins and obtained 70.00% classification accuracy.

System interpretability.
The subject of system interpretability becomes increasingly important as machine learning is employed in multiple domains [43].In clinical applications, interpretability may help to increase the level of confidence in the applicability of a system.A strength of sparse approximation-based approaches is their interpretability as it is possible to monitor the sparse solutions, the approximations, and the residuals.
We first aim to obtain insight into the deep dictionaries that we use for sparse approximations.In Figure 7, we display t-SNE clustering plots using 5-D embeddings of InceptionV3 deep feature data.We observe, through the single feature histograms and pair-wise scatterplots, significant similarity of benign and malignant deep features for the complete feature vector and for the first block features.However, some feature components show greater separability, such as the second component of the complete InceptionV3 feature data.We computed the Kullback-Leibler (KL) divergence to measure the class discrimination for each feature component.The greatest KL divergence measure for the complete feature vector data, 0.3304, is produced by component two.The greatest KL divergence measure for the first block feature data, 0.1843, is produced by component two.
Next, we explore the decision function of DF-SLESA techniques defined in (2.4) that uses log ratios of residuals by displaying histograms of the decision scores.We calculated the histogram of scores for a single test fold in 10-fold cross-validation.We used InceptionV3-SLESA to explore the cases of a single block, PatchSampling, and BlockBoosting. Figure 8 displays the histograms of decision scores for each case.We observe that the scores approximately form two distributions that correspond to each mass type.We note that the scores of malignant masses are more dispersed than those of the benign masses.This explains the lower true positive rate in our results.In addition, the probability of error generally decreases as the score magnitude increases.This indicates that ELLS values could be used to produce heat maps, or to provide a level of confidence that could help to make a decision in a diagnostic workflow.

Conclusions
Benign and malignant mass separation is considered a more challenging task in machine learning than natural image classification.In this work, we combine the inductive representation capacity of CNNs to form dictionaries for sparse approximation-based classification of mass ROIs in a merged mammography dataset.Our aim is to show that this approach yields an effective and interpretable classification technique.Our results indicate that deep features produce numerically feasible sparse approximations, which ultimately improves the performance of our sparse analysis methods.Class prediction flow diagram for BlockBoost decomposition.SC i corresponds to the sparse coder specific to block i.Individual DF-SLESA (BB) performances using a fixed average pooling length of 256 and block length of 128.

2. 2 . 1 .
Deep feature extraction-For a L layer deep CNN, a recursive nonlinear activation function ς( .) is used to compute activations y m (L) produced by the training sample

Figure 3 .
Figure 3. Examples of pre-processing enhancement on benign and malignant mass ROIs.(a) a benign mass ROI before and after enhancement (left-to-right) from the MIAS dataset, (b) a malignant mass ROI before and after enhancement (left-to-right) from the MIAS dataset, (c) a benign mass ROI before and after enhancement (left-to-right) from the CBIS-DDSM dataset, (d) a malignant mass ROI before and after enhancement (left-to-right) from the CBIS-DDSM dataset.

Figure 4 .
Figure 4.Classification accuracy and area under the ROC curve of SRC and LS-SRC (first row) and SLESA and LS-SLESA (second row).

Figure 5 .
Figure 5.Classification accuracy and area under the ROC curve of DF-SLESA methods using BlockBoosting and PatchSampling.

Figure 7 .
Figure 7. t-SNE clustering plots with 5-D embedding of InceptionV3 deep features produced by DF-SRC (top) and DF-SLESA with BlockBoosting (bottom).The greatest KL divergence measure for DF-SRC, 0.3304, is produced by component two.The greatest KL divergence measure for DF-SLESA(BB), 0.1843, is produced by component two.

Figure 8 .
Figure 8. Decision scores for a single test fold in 10 CV by class using InceptionV3 deep features using (a) no block decomposition, (b) PatchSampling with block length of 512, (c)-(d) BlockBoosting with a block length of 512, (e) combined scores for both blocks for BlockBoosting with a block length of 512.

Table 4
provides a comparison summary of the top performances of all methods.Considering the high level of difficulty of benign and malignant mass separation in mammograms, our DF-SLESA methods achieved promising classification performance.The InceptionV3 deep features in conjunction with our SLESA model produce the best performance with the use of block decomposition.InceptionV3-SLESA with PatchSampling with average pooling 256 and block length of 128 produced the best classification performance of 72.31% ACC and 77.04% AUC.InceptionResNetV2-SLESA with BlockBoosting is the second top performing DF-SLESA method.Given that deep features are extracted after 94 convolutional layers in the InceptionV3 network and after 244 convolutional layers in the InceptionResnetV2 network, we note that the number of convolutional layers plays a major role in the quality of the deep features produced.

Table 1 .
Dimensionality of extracted deep features by convolutional neural network.

Table 2 .
Number of malignant and benign samples from both datasets.