Adaptive learning cost ‐ sensitive convolutional neural network

Real ‐ world classification often encounters a problem called class imbalance. When the data of some classes are redundant than that of other classes, traditional classifiers usually bias their decision boundaries to the redundant majority classes. Most proposed cost ‐ sensitive strategies often ignore the hard ‐ to ‐ learn examples or have a large amount of hyper ‐ parameters. This article proposes an adaptive learning cost ‐ sensitive convolutional neural network to solve this problem. During the training process, the proposed method embeds a class ‐ dependent cost to each class in the global error, making the decision boundary bias to the minority classes. Meanwhile, a distribution weight is assigned to each example to enhance the learning of the hard ‐ to ‐ learn examples. Both the class ‐ dependent costs and distribution weights are learnt automatically in the net. This cost ‐ sensitive approach makes the algorithm focus on the examples in the minority classes as well as the hard ‐ to ‐ learn examples in each class. Besides, this approach can be applied to both binary and multi ‐ class image classification problems without any modification. Experiments are conducted on four image classification datasets to evaluate this algorithm. The experimental results show that the proposed method achieves better performance than the baseline algorithms and some other algorithms.


| INTRODUCTION
A dataset is imbalanced if some of its classes have fewer data than other classes. This situation can be seen in many realworld applications, such as financial crisis prediction [1], medical diagnosis [2] and e-mail filtering [3]. In these scenarios, the minority class examples are so rare that it is difficult for researchers to collect sufficient data to represent them. Since the data distribution of the training set is asymmetric, traditional algorithms that aim to minimise the global error often misclassify the minority class examples into majority classes. Actually, the minority classes are often the objects that the researchers are more interested in [1][2][3].
Many approaches have been proposed to solve the imbalance problem. These approaches can be divided into two categories: data-level approaches and algorithm-level approaches. Data-level approaches change the data distribution by under-sampling the majority classes or over-sampling the minority classes. Among them, the under-sampling methods [4,5], which solve the class imbalance problem by removing some examples from the majority classes, may lose some useful information. The over-sampling methods, which solve the class imbalance problem by replicating some examples from the minority classes, may result in over-fitting of the model in its build process. To overcome over-fitting, Chawla et al. [6] proposed a synthetic minority over-sampling technique (SMOTE) to generate synthetic minority class examples by linear interpolation. However, SMOTE may result in a problem called overgeneralization since some of its generated minority class examples locate inside the convex hull of the majority class examples. To solve this overgeneralization problem, several variants of SMOTE have been proposed such as SL-SMOTE [7], RSB-SMOTE [8], B-SMOTE [9], LN-SMOTE [10], etc. However, a main shortage of these variants is the large computational cost. Algorithm-level approaches modify the traditional classifiers for the balanced datasets such that the modified algorithms can be applied to the imbalanced datasets. A popular algorithm-level approach is a cost-sensitive strategy which assigns different misclassification costs to each class in the imbalanced dataset. Kukar et al. [11] proposed four cost-sensitive approaches to modify the neural networks. Wang et al. [12] proposed boosting-SVM with an asymmetric cost algorithm to solve the imbalance problem. Arya et al. [13] proposed a cost-sensitive support vector machine algorithm by extending the SVM hinge loss with cost information. Sun et al. [14] proposed three cost-sensitive boosting algorithms by inserting cost items into the adaptive-boosting algorithm (AdaBoost). Zhang et al. [15] proposed an evolutionary costsensitive extreme learning machine algorithm, which can adaptively learn the cost matrix through an evolutionary backtracking search algorithm. Ting et al. [16] introduced an instance-weighting method to induce cost-sensitive trees. Zhang et al. [17] integrated cost information into deep belief network and used an adaptive differential evolution method to estimate the misclassification costs based on the training set. However, all these methods are not able to deal with the class imbalance problem of convolutional neural network (CNN), which has been the most popular tool for image classification. In recent years, some new algorithms have been proposed to insert class-dependent costs into the deep CNN. Wang et al. [18] proposed a loss function method called mean false error together with its improved version mean squared false error to train deep neural networks on imbalanced dataset. Raj et al. [19] integrated a class-to-class separability score into deep CNN to improve the classification performance on an imbalanced dataset. However, these two methods mainly focus on the binary classification problem and cannot be applied to the multi-class classification problem directly. Chung et al. [20] designed a smooth one-sided regression loss function for the multi-class classification problem, which embeds the cost information into the training process of deep learning. However, the misclassification cost for each class is assigned based on expert judgement, which turns into a tedious task for a large number of classes. Khan et al. [21] proposed a cost-sensitive deep neural network in which costs are determined by data distribution, classification errors and class separability. Although Khan's method can jointly optimise the neural network parameters and class-dependent costs, implementing this method is a complicated task. Besides, all these costsensitive methods do not take into account the differences between examples in the same class, which is an important factor affecting the performance of classifiers [22][23][24]. In recent years, other methods, such as online hard example mining (OHEM) [22], focal loss (FL) [25], gradient harmonising mechanism (GHM) [23] and distributional ranking loss (DR) [26] were proposed to solve the imbalance problem in object detection. These methods take into account the imbalance between majority and minority classes as well as that between easy and hard examples. However, these methods have many hyper-parameters, tuning them should be a tedious task during the training process.
An adaptive learning cost-sensitive convolutional neural network (Alcs-CNN) approach is proposed, which can avoid the setting of the costs artificially. Alcs-CNN incorporates a class-dependent cost for each class into the global error of CNN, so that the minority classes are being more concerned. Meanwhile, to enhance the learning of the hard-to-learn examples in each class, a distribution weight is assigned to each example during the training process. Both the classdependent costs and distribution weights are updated automatically.
The innovations of this article are as follows: 1) A costsensitive strategy for CNN, which can avoid setting the costs artificially is proposed. The proposed strategy can be applied to both binary and multi-class image classification. 2) The strategy is embed into several baseline CNNs and conduct a large number of experiments on four image classification datasets. The results have shown that this approach performs better than the baseline CNNs and some other algorithms.
This article is organised as follows. Section 2 recalls the CNN and the cost-sensitive strategy. Section 3 describes our Alcs-CNN in detail. Section 4 compares our proposed approach with other algorithms, and Section 5 summarises this paper.

| Convolutional neural network
Convolutional neural network is a feed-forward artificial neural network, which has been successfully used in classification [27], image recognition [28] and segmentation [29]. As the network avoids the complex feature extraction of images, it is possible to input the original images directly for processing. The core idea of convolutional neural network is to obtain an optimal model θ* by minimising the global error E(θ ) in backpropagation process.
As indicated, a training set T=(X, d) N , X is the given data, d is the actual label, and N is the number of examples in the training set. The global error of training set T can be expressed as Equation (2).
where y (i) is a predicted label for example X i . and l(•) is the loss function. There are several loss functions including mean squared error loss (MSE), cross entropy loss (CE) and SVM hinge loss.
(1) MSE loss MSE loss is the squared error between actual label and predicted label. The mathematical formula can be expressed as Equation (3).
where K is the number of classes in training set, the actual label d∈{0,1} 1�K (s.t. P k d k = 1) and the predicted label y k is given by HOU ET AL.
where o k is the output of the previous layer.
(2) CE loss CE loss measures the proximity between actual label and predicted label. The mathematical formula can be expressed as Equation (5).
where the predicted label y k is given by (3) SVM hinge loss SVM hinge loss measures the margin between each pair of classes. The mathematical formula can be expressed as Equation (7).
where the predicted label y k is given by In artificial neural networks, back-propagation is applied to calculate a gradient that is needed in the update process of network weights. The weights θ are updated according to the following equation.
where η θ is the learning rate of θ.

| Cost-sensitive strategy
During the training process of a CNN, the network weights are updated in order to minimise the global error of the training set. However, since the minority classes are underrepresented in an imbalanced dataset, they only account for a small percentage of the global error, which may cause the classifier to ignore these minority classes. It is undesirable and dangerous to misclassify the minority class examples. Cost-sensitive strategy is an effective way to deal with this imbalance problem [11][12][13][14][15][16][17][18][19][20][21]. It improves the accuracy of minority classes by penalising the misclassification of each class differently based on a given cost matrix C. The cost matrix C is a K by K matrix, where each element C(y, k)∈[0,∞) denotes the cost for classifying a y-th class example as k-th class. In this cost matrix, the cost of misclassifying a majority class example is less than that of misclassifying a minority class example. Specially, C(y, y) = 0 in the matrix C. Such a cost matrix can make the classifier focus more on the minority classes. In most cases, the value of each element in the cost matrix is designed based on expert judgement. This is a tedious task for a dataset with a large number of classes. Here we try to solve the imbalance problem in a novel and simple way to avoid setting the costs artificially.

| Evaluation criterion
Average accuracy is the most popular evaluation criterion to evaluate the classification performance of classifier. In order to overcome the shortage of average accuracy, a new evaluation criterion, G-mean, is introduced to measure the classification performance of classifier on imbalanced datasets [30].
where R i is the recall value of i-th class.

| THE PROPOSED METHOD
In traditional CNN, the global error is the average error of all examples in the training set. Since the minority classes are underrepresented, they only account for a small percentage of the global error. This make the minority classes are easily misclassified in the training process of CNN. To bias CNN to the minority classes, we introduce Global Mean Square Error Separation (GMSES) [31] to differentiate the errors of different classes. The core idea of GMSES is to achieve the expected sensitivity and specificity by calculating a weighted sum of the gradients for each class. In GMSES, the global error is the weighted sum of errors of all classes in the training set.
where K is the number of classes in the training set T. C k is the class-dependent cost of k-th class. E k is the error of k-th class.

Algorithm 1 The Pseudo-code of Alcs-CNN
Input: Training set: T= (X T , d T ); learning rate: η θ ; the number of epochs: M ep and batch size: B.
Step 5. Output the parameters θ

Algorithm 2 The subprogram for calculating G-mean
Input: Actual labels: d T ; predicted labels: y. Output: G-mean.
Output the value: G-mean.
Since the minority classes are easily misclassified, their errors are usually larger than the majority classes. Here, we try to assign the class-dependent costs according to the error of each class directly. The goal is to increase the proportions of minority classes in the global error, forcing the classifier to focus on those minority classes. The details of Alcs-CNN are shown in Algorithm 1. Note that class-dependent costs are updated iteratively during the whole training process. The initial cost of each class is set uniformly to 1/K. The class-dependent costs in next epoch are updated according to their own errors in the previous net.
where ε k is the mean error of k-th class.
where w(X i ) is the weight of example X i .In contrast to Ada-Boost, this algorithm updates the distribution weight of each example according to the G-mean value of the previous net. The initial weight of each example in the training set is set uniformly to 1/N. The updating process is shown as the following equation. where α = 0.5 In(G-mean/(1-G-mean)) is the weight updating parameter. Z e = P (W e (X i ) exp(-α1(d (i) , y (i) ))) is a normalisation factor. 1(d, y) is an indicator function.

1ðd; yÞ
As the Equation (16) shows, the weights of correctly classified examples are reduced to the original exp(-α) times while the weights of misclassified examples are unchanged. Note that the distribution weights are only updated when Gmean > 0. 5. In this approach, the minority classes that are often misclassified will obtain larger class-dependent costs than the majority classes. Moreover, the hard-to-learn examples are assigned larger distribution weights than other examples. This forces the classifier to focus on those minority classes and the hard-to-learn examples of each class during the next training process.
Since the global error is just the weighted sum of errors of all classes, the gradient computation of a single example at each neuron is the same as in the original case. The details of the computational process are as follows.
(4) MES loss: For the MES loss, the gradient at each output neuron is calculated by ∂lðd; yÞ The y k value is given by Equation (4).
Therefore, the gradient in Equation (18) can be expressed as ∂lðd; yÞ

(5) CE loss
For the CE loss, the gradient at each output neuron is calculated by ∂lðd; yÞ The partial derivative in Equation (20) is solved in two cases: k ≠ j and k = j. When k ≠ j, expðo l Þ ! 2 ¼ −y k y j ; s:t: : k ≠ j: ; s:t: : k ¼ j: Therefore, the gradient in Equation (20) can be expressed as Since P k d k = 1, Equation (21) where 1(·) is an indicator function.

| Network architecture
Experiments were conducted on three networks: VGG-19, Resnet-18 and Densenet (L = 40, k = 12). Their network architectures follow the works in [33][34][35], respectively. Their error functions are modified by inserting cost-sensitive strategy. The modified architectures are identical to the baseline CNNs except the error funtions in back-propagation process. All the networks are trained using stochastic gradient descent (SGD). Furthermore, all the experiments are conducted with CE loss function, as it has been proved that CE loss performs better than MES loss and SVM hinge loss [36].
In the experiments, the weight decay is 10 −4 , the momentum is 0.9, the batch size is 200 and the epochs are 163. A dropout layer is added after each convolutional layer (except the first one) and the dropout rate is set to 0.5. The initial learning rate is 0.01 and is decreased by a factor of 0.1 after 50-th epoch and then 80-th epoch. For image preprocessing, we normalise the data using the channel means and standard deviations. To facilitate the comparison with the data-level methods, the experiments use VGG as the baseline network.

| Datasets
In this aticle The experiments are conducted on two balanced datasets: CIFAR10 and MNIST and two imbalanced datasets: Edinburgh dermofit image library (DIL) and moorea labelled corals (MLC). CIFAR10 is a well-known image classification dataset which contains 10 classes such as animals, ship and car, etc. MNIST is a handwritten digit dataset which contains 10 classes of digits from 0 to 9. For these two datasets, the training and test split followthe source websites. One of the most important uses of inserting cost-sensitive strategy into classifiers is to improve the accuracy of the minority classes in imbalanced datasets. Nevertheless, the two datasets CIFAR10 and MNIST are somewhat balanced. To evaluate the algorithm on imbalanced datasets, the construction of the corresponding imbalanced datasets based on these two balanced datasets is needed. For each dataset, an imbalanced dataset is constructed by removing 70% or 90% of the examples that belong to the even-numbered classes. For convenience, these imbalanced datasets are named as CIFAR10 imb70 , MNIST imb70 , CIFAR10 imb90 and MNIST imb90 ,respectively. DIL is an imbalanced dataset used for melanoma detection, which contains 960 skin lesion images belonging to five classes (45 actinic keratosis, 239 basal cell carcinoma, 331 melanocytic nevus/ mole, 88 squamous cell carcinoma, 257 seborrhoeic keratosis). We split the dataset to build the training and testing sets according to [37].

| Experimental results
Both the CIFAR10 dataset and its two imbalanced variants, were experimented to compare our Alcs-CNN with the baseline CNN (VGG-19) [33]. Table 2 shows the accuracy and Gmean results of the comparisons and the best result in each experiment is highlighted in bold. From Table 2, it is observed that all the accuracy and G-mean values decrease with the increase of imbalanced degree. This method has slightly higher accuracy than baseline CNN. Meanwhile, this method achieves a G-mean gain of 3% and 5% over baseline CNN on CIFAR10 imb70 and CIFAR10 imb90 ,respectively. Figure 1a,b shows the confusion matrices of the baseline CNN and Alcs-CNN on CIFAR imb90 ,respectively. From the figure, a boost in performance for the even-numbered classes is observed, for example, in aeroplane, bird, deer, frog and ship. For the MNIST dataset and its two imbalanced variants, experiments following the works on the CIFAR10 dataset are conducted. Note that MNIST is a binary dataset, and the -351 colour channel of the first convolution layer was changed to one in the network architectures. Table 3 shows the experimental results. From Table 3, it is observed that this method achieves higher accuracy and G-mean values than baseline CNN on the imbalanced MNIST datasets. Their confusion matrices are shown in Figure 1c,d.
We also conduct experiments on two real-world imbalanced datasets: DIL and MLC. For the dataset DIL, threefold cross validation was performed on the five classes similar to the work of Ballerini et al. [37]. For the dataset MLC, experiments were conducted following the work of Khan et al. [21]. Table 4 shows the experimental results. From Table 4, it is observed that our method achieves higher accuracy and G-mean values than baseline CNN on these imbalanced datasets. Figure 1e performance for the least frequent classes is observed. For example, actinic keratosis, squamous cell carcinoma, turf, macro and monti.
This method is also compared with two data-level methods (Rus [4], SMOTE [6]) and three cost-sensitive deep learning methods (CSNN [19], Sosr-CNN [20] and Cosen-CNN [21]) on these imbalanced datasets. Besides, it is also compared with other two methods used for object detection: FL [25] and GHM-C [23]. For fairness, the experimental process should be conducted as close as possible to the proposed Alcs-CNN. For the two data-level methods, a 2048 dimensional feature extracted from the pre-trained deep CNN [33] was used as the input feature. Note that, this 2048 dimensional feature is the input of the first fully connected layer in the pre-trained deep CNN. Furthermore, a three-layered neural network, which is identical to the last three fully connected layers in VGG19, was used for classification with these data-level methods. For CSNN, one-vs-all method was used to make it applicable to multi-class classification. For Sosr-CNN, their proposed smooth one-sided regression (SOSR) loss was incorporated as the last layer of the baseline CNN in these experiments. The fixed cost matrix of Sosr-CNN was generated according to the approach in [38]. For Cosen-CNN, two extra fully connected layers were deleted before the output layer in the experiments.
For FL, the hyper-parameters are set as γ = 2 and α = 0.25. For GHM-C, the hyper-parameters are set as α = 0.75 and M = 30. Table 5 shows the experimental G-mean results of the comparisons. From Table 5, it is observed that the proposed method has an improvement of G-mean over all of these methods.
In Alcs-CNN, the class-dependent costs are updated according to the errors of different classes in each epoch. To validate the effectiveness of the proposed method, the classdependent costs in two cases were fixed and experimented with them. In the first case, the cost of each class was set to 1/K. In the second case, the cost of each class was set according to its own proportion in the training set (PT). For convenience, the former is named as Alcs-CNN 1/K and the latter as Alcs-CNN PT . The experimental G-mean results are shown in Table 6. The results show that this cost-sensitive strategy achieves better performance than these two fixed cost strategies.
Moreover, to further validate the effectiveness of the proposed cost-sensitive strategy, Resnet [34] and Densenet [35] were also used as the baseline algorithms for experiments. Their cost-sensitive variants were named as Alcs-Resnet and Alcs-Densenet, respectively. In the experiments, Resnet and its variant have 18 layers while Densenet and its variant have the configurations of L = 40, k = 12. Table 7 shows the experimental G-mean results of the comparisons. From Table 7, it is observed that the proposed cost-sensitive strategy can improve the performance of both Resnet and Densenet.
Finally, the training and validation loss curves were shown in Figure 2. As the figure shown, both the two loss curves tend to be flat as the number of epoch increases.

| CONCLUSION
Imbalance classification is an important task in many domains. In this article, we proposed a novel cost-sensitive convolution neural network to solve the imbalance problem. The proposed method replaced the global error function of deep CNN with GMSES so that the minority classes were being more concerned. Meanwhile, this method has enhanced the learning of hard-to-learn examples in each class by assigning a distribution weight to each example. Experiments were conducted on four image classification datasets to evaluate this method. From the experimental results, it can be concluded that this method can improve the performance of the classifier on an imbalanced dataset. In comparison with other class-imbalance methods, this cost-sensitive strategy not only takes into account the imbalance between classes (class-class imbalance), but also the imbalance between examples (easy/hard examples imbalance). Besides, the proposed strategy can also avoid setting the hyperparameters artificially. However, for the training set containing noisy labels, this strategy will continuously increase the distribution weights of noisy examples in the following epochs, which may degrade the performance of the classifiers. This issue will be delved in the future work.

F I G U R E 2
The training/validation loss curves of adaptive learning cost-sensitive-convolutional neural network on CIFAR imb90