Hard Sample Aware Noise Robust Learning for Histopathology Image Classification

Deep learning-based histopathology image classification is a key technique to help physicians in improving the accuracy and promptness of cancer diagnosis. However, the noisy labels are often inevitable in the complex manual annotation process, and thus mislead the training of the classification model. In this work, we introduce a novel hard sample aware noise robust learning method for histopathology image classification. To distinguish the informative hard samples from the harmful noisy ones, we build an easy/hard/noisy (EHN) detection model by using the sample training history. Then we integrate the EHN into a self-training architecture to lower the noise rate through gradually label correction. With the obtained almost clean dataset, we further propose a noise suppressing and hard enhancing (NSHE) scheme to train the noise robust model. Compared with the previous works, our method can save more clean samples and can be directly applied to the real-world noisy dataset scenario without using a clean subset. Experimental results demonstrate that the proposed scheme outperforms the current state-of-the-art methods in both the synthetic and real-world noisy datasets. The source code and data are available at https://github.com/bupt-ai-cz/HSA-NRL/.


I. INTRODUCTION
C ANCER is a serious threat to people's life and health.
The studies [1] confirmed that the early screening of cancer is crucial for enhancing the survival rate. The pathological examination is the golden standard of early cancer detection, which can determine the tissue source, nature, and scope of the tumor relying on the visual observation of pathologists. However, there are still many challenges to overcome. During the actual diagnosis process, pathologists analyze the overall tissue along with nuclei organization, density, and variability, which requires tedious workloads. The diagnosis accuracy can be negatively affected by many factors, such as pathologist fatigue and distraction, and the complexity of the tissue structure [2].
The first two authors contributed equally to this work. This work was supported in part by the National Natural Science The deep learning (DL) techniques, such as convolutional neural networks (CNNs) have been widely used in the fields of histopathology image analysis [3], [4]. These efforts are designed to help physicians in improving the accuracy and promptness of cancer diagnosis. One typical task in the field of histopathology image analysis is image classification. Several works [5], [6] are developed to build a deep learning model to classify the histopathology images. However, all these works assume that the utilized dataset was clean for model training.
Actually, it is very expensive and difficult to collect a large dataset with clean labels [7]. In the real medical diagnosis scenarios, noisy labels are often inevitable in manual annotation due to the following reasons: 1) expert domain knowledge is required to perform labeling; 2) manual annotation suffers from large intra-and inter-observer variability even among experts; 3) it is time-consuming and tedious to annotate a large number of patches. Therefore, designing robust algorithms with noisy labels is of great significance [8].
In the literature, a lot of approaches were proposed, which can generally be classified into three categories: estimating the noise transition matrix [9]- [11], designing noise-robust loss functions [12]- [14], and sample correcting/selecting [15]- [20]. The schemes based on transition matrix estimating try to capture the transition probability between the noisy label and true label [18]. Different transition matrix estimation methods were proposed in work [9]- [11], such as using additional softmax layer [9], utilizing trusted samples in a data-efficient manner [10], and two-step estimating scheme [11]. However, these transition matrix estimations fail in real-world datasets where the utilized prior assumption is no longer valid [18]. Being free of transition matrix estimation, the second category targets designing loss functions that have more noise-tolerant power. Work [12] adopted mean absolute error (MAE) function which demonstrates more noise robust ability than crossentropy loss. In work [13], the authors combined Generalized Cross Entropy (GCE) loss and MAE to address the slow convergence speed of work [12]. Recently, the authors in work [14] proposed determinant-based mutual information loss which can be applied to any existing classification neural networks regardless of the noise pattern. However, this kind of method suffers generalization performance loss due to the low quality of the validation sets [21].
The third category targets correcting noisy labels or selecting the possibly clean samples. Bootstrap in work [15] adopted the predicted correcting labels together with the raw labels to lower the interference from the noisy samples. In work [16], the authors utilized some clean annotations to reduce the noise in the large dataset. However, it is difficult to obtain the required certain amount of clean data in some cases. In work [17], a joint optimization framework was proposed to gradually estimate the true labels. The self-learning framework was applied to train the label correction network without extra supervision [18]. Some works just selected the clean data by dropping the noisy data directly to avoid the estimation of true labels. Co-teaching has appeared in the literature recently [19], [20], which trains two deep neural networks simultaneously and makes them teach each other by selecting some data of possibly clean labels. Compared with the first two types of methods, the third category is more general and can be integrated into many image classification tasks.
Although many studies are proposed to suppress the noisy labels for the general image classification problem, there are few works on the classification of noised histopathology images. Work [22] proposed online uncertainty sample mining and individual re-weighting methods to train their network. In work [23], a double-softmax classification module was adopted to prevent overfitting the noisy labels and a teacherstudent module was used to strengthen the effect of clean labels. Unfortunately, almost all these approaches fail in distinguishing informative hard samples from harmful mislabeled ones. Although the work in [22] realized the significance of hard samples, their algorithm still didn't separate hard samples from noisy ones, and thus many important hard samples were mistakenly discarded [24].
On the one hand, the hard samples can make training more effective and efficient [25]. On the other hand, the deep neural networks can easily overfit to some label noise, and thus cause performance degradation [26]. How to involve hard samples for training while reducing noise interference at the same time is of great significance. Many works found that the deep models first memorize the easy training data with clean labels and then memorize the hard or noisy data [27], [28]. This phenomenon can be used to distinguish the easy clean data from the noise, however how to distinguish the hard clean data from the noise is still not clear.
In this work, we strive to reconcile this gap by proposing a hard sample aware noise robust learning algorithm. Our analysis reveals that the prediction history for each sample can be used as guiding information for distinguishing the hard and noisy samples. A deep model for hard and noisy sample detection is thus designed and integrated into our noisy label correction architecture. In the architecture, self-training is applied to conduct the label correction automatically. Based on the corrected data, the noise suppressing and hard enhancing (NSHE) scheme is designed to further enhance the hard sample and weaken the possible noisy sample. Our key contributions are summarized as follows.
• We proposed a two-phase hard sample aware noise robust learning algorithm for histopathology image classification. Our method can save more clean samples by detecting the hard sample and noise in label correction phase. The hard samples can be further enhanced in our NSHE phase. • We built an EHN (easy/hard/noisy) detection scheme and integrated it into our self-learning label correction flow. We found that hard and noisy samples can be recognized using sample prediction history. Different from the previous works, our scheme can save more hard samples and discard more noisy samples. • In the NSHE phase, we smoothly trained our model to further suppress the interference from noisy samples and enhance the hard samples based on our proposed colearning architecture. • Our proposed method can be directly applied to the real-world dataset without using clean annotations. The experimental results verify that the proposed algorithm can achieve superior performance in our collected clinical pathology data of one top hospital in Beijing, China.

A. Noisy Label Correction
Noisy label correction aims at improving the quality of the raw data by replacing the noisy labels with their true labels. Generally, the true labels are predicted by an extra model that is trained on a subset of clean data, such as work [29] and [16]. However, for the real-world dataset, the required clean dataset is not available and thus these methods will fail in this case.
To get rid of the dependence on clean samples, several noisy label correction methods [17], [18] were proposed based on self-learning by pseudo-labels. In fact, pseudo-labeling is a type of self-training which is often used in semi-supervised learning scenario [30] with many unlabeled data [17]. In the semi-supervised learning scenario, the pseudo-labels are initially assigned to unlabeled data by predictions of a model that is trained on a labeled subset. This process is repeated and pseudo-labels are thus updated gradually. However, when processing the noisy datasets with labels, the challenge comes from the uncertainty about what is correct and what is incorrect in the data. In work [17], the authors replaced all the labels with pseudo-labels to improve the quality of the original noisy dataset. The authors in work [18] proposed a self-learning with multi-prototypes (SMP) scheme to train a robust model on the real-world noisy data.
In the above self-learning based methods, all the pseudo labels are involved in the correction model training. However, misleading samples are inevitable, which will thus ruin the performance of the obtained correction model. To address this problem, we proposed an EHN detection scheme based on the prediction history of each sample to recognize the possible noisy samples. Then, we will remove these noisy samples in the correction model training and thus improve the label correction quality. Note that the output of EHN is also used in guiding the noisy sample discarding module (post-processing), which can save more hard samples for our NSHE scheme.

B. Learning to Teach
Learning to teach refers to the schemes that consist of two networks, the teacher and student networks [31], [32]. The teacher tries to choose more informative samples to guide the training of student networks. However, these methods cannot process the dataset with noisy labels.
To make the above learning to teach algorithms be capable of processing noisy data, the authors in work [33] proposed a novel MentorNet to supervise the training of the student network by focusing on the probably correct samples. However, the designed MentorNet suffers the disadvantage of accumulating error introduced by the sample-selection bias [19]. Another method called Decoupling proposed by work [34] trains two models simultaneously and updates the models by sampling with different predictions. However, in the selected subset with disagreement labels, there are still some noisy ones, which will decrease the performance of the trained model [19]. To solve these problems, based on Co-training of work [35], the authors in [19] proposed a learning scheme called Co-teaching, which can train the model successfully even in the extremely noised dataset. Co-teaching also includes two models, and each model adopts the samples with small losses to train its peer network. Through the prediction information exchange, the error flows can be reduced accordingly. The authors in [20] tried to improve the performance by proposing an Iterative Noisy Cross-Validation (INCV) method with the seriously noised dataset.
The above Co-teaching based methods try to conduct noiserobust learning by selecting clean samples with a small loss as much as possible. However, both hard and noisy samples have a large loss, and this will inevitably ignore the informative hard samples. We attempt to solve the problem in two aspects. First, we pre-discard most of the noisy samples with our hard-sample aware post-processing module. Second, we enhance most of the hard samples while suppressing the few existing noisy ones at the same time with our NSHE scheme.  The top architecture contains two main phases: 1) label correction, 2) NSHE scheme, as depicted by Fig. 1. The label correction takes the noisy data as the input and generates almost clean data. The NSHE scheme takes the corrected almost-clean dataset as the input and produces the final robust classification model.

III. METHOD
For the label correction, we designed a hard sample aware self-learning to achieve high-quality pseudo-labels and further cleaned our dataset by post-processing to drop out the possibly noisy samples for NSHE. The target of the label correction phase is to restore as many clean samples as possible. After performing label correction, the NSHE scheme aims at further reducing the impact of noisy and emphasizing the hard samples at the same time. Specifically, we smoothly trained our model based on the proposed co-learning architecture.

B. Label Correction
The proposed label correction architecture mainly consists of the classification/correction model, easy/hard/noisy (EHN) detection scheme, and post-processing component.
Firstly, the classification model is trained based on noisy data. By using this model, the prediction behavior for all the training samples can be obtained. Note that the prediction behavior means the prediction history for one sample through all the k (such as 30) training epochs. Then, the EHN detection scheme is applied to divide the dataset into three parts: easy, hard, and noisy. With the obtained easy and hard samples, the correction model is trained and used to correct the noisy data. Through repeating the above flow, the dataset quality is thus improved gradually. Finally, the dataset is further filtered by getting through the hard-aware post-processing component. In the following section, we will focus on the details of the EHN detection scheme, correction model, and post-processing component.
EHN Detection Scheme. Previous work [27] showed that CNNs tend to memorize simple samples first, and then the networks can gradually learn all the remaining samples, even including the noisy samples, due to the high representation capacity. However, overfit to the noise leads to poor generalization performance. To avoid the memorization of noisy data, work [33] selected the samples with small loss to train the model, where such samples are treated as clean ones.
Sample with small loss means the prediction probability of the model output is closer to the supervising label. However, the normalized probability is much easier to analyze than the loss value. Different from work [33], we apply the mean prediction probability value of the sample training history in our EHN detection scheme. Fig. 2 shows the mean prediction probability histogram of clean and noisy samples. The figure shows most of the clean samples have higher mean prediction probabilities than the noisy ones. Therefore, we can set a threshold (such as the red dotted line in Fig. 2) to preliminarily extract some clean samples, and we call these clean samples bigger than the threshold as easy samples. However, there's still a part of clean samples that's behind the threshold, and we can't distinguish them from noisy ones. We define this part of clean data as hard samples in our work. So far as we know, there are no existing schemes that can distinguish the hard from the noisy samples.
We constructed our EHN detection scheme based on the prediction history of the training samples, as depicted by Fig.  3. For the training set D with N samples, we gradually obtain the corresponding N prediction probability maps through the training of a CNN classification model for k epochs. Our EHN detection scheme first selected easy samples D e by using the mean prediction probabilities according to the threshold T . For convenience, the threshold T is implemented as selected easy sample ratio τ e in this paper. The higher τ e corresponds to the lower threshold T , and vice versa. Then we added noise to the D e as D a by switching the labels of some samples in D e , and recorded whether the samples are noise or not as R. The noise ratio of the adding noise is the same as the original dataset, which can be estimated by the noise cross-validation algorithm of [20]. After that, we trained the same classification model by using D a and recorded training history again. Then we discarded the "easy samples" of D a according to mean prediction probabilities and utilized the rest samples as training data for the Multi-Layer Perceptron (MLP) classifier. So far, we will obtain an MLP classifier that takes prediction probability maps of training history as input and output whether it is a hard sample or a noisy one. Finally, we put the samples in D \ D e into the MLP classifier and get the hard sample set D h and the noisy set D n . Algorithm 1 shows the details of the EHN detection scheme.
Label Correction Model. Our correction model is trained by using D e ∪ D h from the EHN detection scheme. After training, the model has some ability to correct noisy labels. Therefore, the labels of samples in D h ∪ D n are replaced by the pseudo labels generated from the correction model, where the pseudo labels are the class with the highest probability of model output. The reason we also put the hard samples into the correction model is that we cannot trust the result of the MLP classifier in our EHN detection scheme completely.
These steps above are called our self-learning flow; it takes the original dataset into the classification model and gets through the EHN detection scheme by using Algorithm 1. Then, it trains the correction model and updates some sample labels with the pseudo ones. Finally, it iterates over the above steps to further purify our dataset.
Post-processing Component. Our post-processing component is to drop out the noisy samples which still can not be corrected after the processing of EHN and label correction. In D n , we drop out the samples whose labels were not changed if (y i = g i and d i ∈ D n ) or (y i = g i and d i ∈ D h ) then 6: else 8: end if 10: end for by the correction model. In D h , we drop out the samples whose labels were changed by the correction model. Algorithm 2 shows the details of the post-processing component. After post-processing, we then obtain the almost-clean dataset.

C. Noise Suppressing and Hard Enhancing (NSHE)
Here we developed our robust NSHE algorithm by using the almost-clean dataset. The overview of the NHSE phase is shown in Fig. 4. It is found in the experiment that different samples may have completely opposite optimization directions for model parameter values, which leads to frequent dithering of model parameters during the training process, resulting in a poor effect. This phenomenon is even more serious in noisy dataset, and the noisy samples will mislead the model training. Inspired by MoCo [36], we initialized two models M 1 , M 2 with the same backbone and parameters. Formally, denoting  the parameters of M 1 as θ 1 and those of M 2 as θ 2 , we update θ 2 by: Here m ∈ [0, 1) is a momentum coefficient. Only the parameters θ 1 are updated by back-propagation. The momentum update in (1) makes θ 2 evolve more smoothly than θ 1 . Since the almost-clean dataset still has some noisy samples, we ranked the samples according to the prediction probabilities of the labeled class at each epoch, and set a very small ratio to make the samples with small prediction probabilities unable to participate in back-propagation. To avoid confirmation bias, we proposed the co-learning architecture based on Co-teaching [19]. The probabilities are computed by M 2 , namely the sample selection information was given by M 2 . To further emphasize the significance of hard samples, we used focal loss [37] to strengthen hard samples. The loss function is defined as follows: where p t is the predicted probability of the correct class, γ is

A. Dataset
We extensively validate our method on five datasets. classes, which crawled from websites. The training set contains many real-world noisy labels. Since the dataset is quite large, for quick experiments, we follow the previous work [20] and only use the first 50 classes of the Google image subset. Finally, it contains 65944 samples for training and 2500 samples for testing. Among them, 1) to 3) are medical scenario datasets; 4) and 5) are natural computer vision datasets. We randomly added different ratios (10%, 20%, 30%, and 40%) of noise to the DigestPath2019 and the Camelyon16. Due to these two datasets only have two classes, the noise type is simply changing labels into another class. For CIFAR-10 dataset, as it originally does not contain label noise, following previous work [20], we experiment with two types of label noise: symmetric and asymmetric. Symmetric noise is generated by randomly replacing the labels for a percentage of the training data with all other classes, and asymmetric noise is only generated by replacing the labels with adjacent class. Following work [20], we tested noise ratios 20%, 50%, and 80% for symmetric noise, and noise ratio 40% for asymmetric noise. The Chaoyang and Webvision datasets are constructed in the real scenario, and the noise refers to the actual labeled samples that are wrong, rather than the artificial addition.

B. Implementation and Parameter Settings
For medical scenario datasets (DigestPath2019, Came-lyon16, and Chaoyang). We used the Resnet-34 as the backbone and trained it using Adam with a momentum of 0.9, and a batch size of 96. During the label correction phase, the network was trained for 30 epochs. We set the initial learning rate as 0.001, and linearly reduced it after 15 epochs. For the NSHE phase, the networks were trained for 40 epochs. We set the initial learning rate as 0.001, and linearly reduced it after 15 epochs.
For natural computer vision datasets (CIFAR-10, Webvision), we followed the same settings in work [20]. For CIFAR-10, we used the Resnet-32 as the backbone and trained it using SGD with a momentum of 0.9, a learning rate of 0.02, and a batch size of 128. The networks were trained for 300 epochs both in the label correction phase and the NSHE phase. For Webvision, we used the Inception-Resnet v2 [40] as the backbone and trained it using SGD with a momentum of 0.9, a learning rate of 0.01, and a batch size of 32. The networks were trained for 80 epochs both in the label correction phase and the NSHE phase.
For all the datasets, τ e was set to 0.1 for 80% noise ratio, and 1 − 1.5 * ρ for other noise ratios, where ρ is the dataset noise ratio. The parameter γ in focal loss we set to 2, and the discarding ratio τ was set to 0.1 * ρ. For real-world datasets Chaoyang and Webvision, ρ was estimated by the noise crossvalidation algorithm of [20].

C. Evaluation Criteria
We used Accuracy (ACC), Precision, Recall, F1 Score (F1), AUC, and ROC curve as evaluation criteria. Their definitions are as follows: P recision = T P T P + F P (4) where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively.  ROC curve is the receiver operating characteristic curve. Its abscissa is false positive rate and ordinate is the true positive rate. AUC is the area under the ROC curve.
For multi-classification tasks, we compute the Precision, Recall, F1 Score, ROC curve, and AUC for each class and average them by using macro-average.

D. Objective Comparison
We compare our methods with the following methods using the same network architecture.     Among them, work 1) to 5) are the state-of-the-art methods for general noisy data processing in recent years; work 6) and work 7) are the state-of-the-art methods proposed for medical data scenarios. We choose these schemes to contrast with to fully prove the superiority of our method. The testings of 2) to 6) are based on the open-source codes from the authors.
We re-implemented and tested 1) and 7) based on the settings from the original papers.
For experiments on medical scenarios datasets, Table I and  Table II shows the test ACC, AUC, F1 Score, Precision, Recall on DigestPath2019 and Camelyon16 with different levels of  label noise ranging from 10% to 40%. Our method almost outperforms the state-of-the-art methods across all noise ratios. Table III shows these metrics on the Chaoyang dataset. Our method outperforms all other methods by a large margin in every criterion. For experiments on natural computer vision datasets, Table  IV shows the test ACC, AUC, F1 Score, Precision, Recall on CIFAR-10 with different levels and different types of label noise ranging from 20% to 80%. Our method outperforms the state-of-the-art methods across all noise ratios. Table V shows these metrics on Webvision dataset. Our method consistently outperforms all other methods. Besides, we show the mean ROC curves of all five datasets in Fig. 6.

E. Ablation Study
We study the effect of removing different components to provide insights into what makes our method successful. Fig.  Fig. 6. (a) DigestPath2019 dataset average ROC curve from 10% to 40% noise ratios (used marco-average). (b) Camelyon16 dataset average ROC curve from 10% to 40% noise ratios (used marco-average). (c) Chaoyang dataset average ROC curve from 4 classes (used marcoaverage). (d) CIFAR-10 dataset average ROC curve from 10 classes with 20% to 80% noise ratios (used marco-average). (e) Webvision dataset average ROC curve from 50 classes (used marco-average).  Fig. 8 show the ablation study results in different noise ratios. The result details are shown in Table VI to VII, and we discuss them below.

and
To study the effects of the NSHE scheme, we removed the NSHE scheme (w/o NSHE), namely train the single model by using the dataset from label correction. The results show the hard samples play quite a significant role in training final models. By removing the NSHE scheme, the test accuracy decreased by an average of about 1.5%.
To study the effects of the EHN detection scheme, we removed both the EHN detection scheme and NSHE scheme (w/o (EHN + NSHE)). In this situation, following work [17], we directly used the classification model as the correction model, and the dataset is processed only by the correction model. The results show the EHN detection scheme is very effective to save more hard samples and filtered more noisy ones. Without the EHN detection scheme, the test accuracy further decreased by an average of about 2.8%.
To study the effects of the correction model, we further removed the whole label correction phase (w/o whole), namely we train the model by using original data. To converge the model under the same epoch, we adjusted the learning rate, which would smooth the test accuracy in the last epoch but the highest accuracy will be affected. By removing the correction model, the test accuracy further decreased by an average of about 1.2%.
Among the NSHE scheme, EHN detection scheme, and correction model, the EHN detection scheme introduces the maximum performance gain, followed by the NSHE scheme. All components have a certain gain at any noise ratio.
To analyze the effect of self-training rounds, we recorded the training times and test accuracy in different self-training rounds. Fig. 9 shows the results on DigestPath2019 dataset  with 20% noise ratio. Training more rounds consumes more computing resources but brings little gain. Therefore, we choose to train only one round in our experiment. We also studied how much higher noise ratio could our scheme tolerate to reach a similar performance against a simple model. We first directly trained a simple Resnet-34 under 5 different noise ratios (0%, 10%, 20%, 30%, 40%, respectively) on DigestPath2019 dataset. Then we also carried out experiments our algorithm (backbone is Resnet-34) at 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 49% noise ratios (note that we select 49% noise ratio because it is close to the theoretical limit of 50% for this type of noise on binary classification task). The results are shown in Fig. 10.
According to Fig. 10, to reach the same performance, for 0%, 10%, 20%, 30%, and 40% noise ratios, our scheme can tolerate about up to 22%, 34%, 40%, 44%, and 49% noise ratio, respectively. We also find that in the noise ratios of 0% to 20% on DigestPath2019 dataset, the results of our method are even better than the trained simple Resnet-34 on a completely clean dataset. We believe that this phenomenon is due to our enhancement of the information of hard samples.

F. Analysis of EHN Detection Scheme
Effectiveness and Convergence Analysis. To analyze the effectiveness and convergence of our EHN detection scheme, we trained on DigestPath2019 dataset with different noise ratios and plotted the curves of test ACC vs. Iterations ("test" set here means D \ D e ) of M m . The results are shown in  Confusion matrix and accuracy (ACC) of EHN detection scheme in DigestPath2019 dataset with different noise ratios. Fig. 11. According to Fig. 11, M m has converged in the later training stage at each noise ratio, so the convergence of M m is relatively stable. Also from Fig. 11, M m can achieve 98%, 97%, 96%, and 90% classification accuracy in 10% to 40% noise ratios, respectively. Thus, M m can effectively distinguish the hard and noisy samples. We also recorded the confusion matrix of EHN detection scheme as Fig. 12 (clean samples are divided into easy and hard in this test, as depicted by Fig.  2 (b)).
Parameter Analysis. To analyze how sensitive the EHN detection scheme is to the training epochs k and easy sample ratio τ e , we trained on different k and τ e in DigestPath2019 dataset with 30% noise ratio, recorded the confusion matrix, and calculated the ACC of EHN detection scheme. Specifically, we first adjusted the value of k with fixed τ e = 0.55, and thus obtained the sensitivity of EHN to k. Then we adjusted the value of τ e with fixed k = 30, and thus obtained the sensitivity of EHN to τ e . There are a total of 9 different parameter configurations. The results are shown in Fig. 13. And we also plotted the ACC vs. Parameters graph as Fig. 14. These results show that as k increases from 0, the performance continues to increase. When k reaches about 30, the performance achieves maximum accuracy. However, as k increases further, the performance begins to decrease slightly. But overall, when k exceeds a certain threshold, the performance of EHN is relatively stable. And for the parameter τ e , changing the value Fig. 13. Confusion matrix and accuracy (ACC) of EHN detection scheme on different "k" and "τe" in DigestPath2019 dataset with 30% noise ratio.  within a reasonable range has little impact on the performance. By the way, our utilized parameters are effective according to Fig. 14.
Why D a works? To study why M m trained on the artificial created D a can recognize hard and noisy samples in original dataset D, we plotted both the mean prediction probability histogram of the clean and noisy samples in DigestPath2019 L e a r n i n g F o r g e t t i n g Fig. 16. The diagram of "Learning" and "Forgetting" event.
dataset (40% noise ratio) D and the corresponding D a . The results are shown in Fig. 15. According to Fig. 15 (b), there are also some clean samples which can not be distinguished from the noisy ones by mean prediction probability. Although the samples in D e are all easy ones to dataset D, part of the samples became hard ones to dataset D a . In Fig. 15, it should be noted that the mean prediction probabilities of samples trained on D a have similar distribution with the original dataset D, and this is why our EHN detection scheme works by using the artificial created D a .

G. Hard and Noisy Sample Analysis
Behavior Analysis. To analyze the training process behavior of the hard sample and the noisy sample, inspired by [45], we calculated the frequency of the "Learning" event and  "Forgetting" event of them through the whole training epochs. The "Learning" event in the t epoch is defined as an event that the prediction probability of the labeled class is less than 0.5 in t − 1 epoch, while greater than 0.5 in t epoch. The "Forgetting" event in the t epoch is defined as an event that the prediction probability of the labeled class is greater than 0.5 in t − 1 epoch, while less than 0.5 in t epoch. Fig. 16 shows the diagram of "Learning" and "Forgetting" event. The statistical results are shown in Fig. 17 (a). This results show that the hard sample and the noisy sample have great behavioral differences with the increase of the epoch in training. In the early stage of training, the hard samples tend to have more learning events, while the noisy samples tend to have more forgetting events.
In the late stage of training, the hard samples tend to have more forgetting events while the noisy samples tend to have more learning events. On the whole, the frequency of learning and forgetting events of the hard samples is higher than that of the noisy samples during the whole training epochs. We also calculated the frequency histogram of the gradient absolute value. The gradient absolute value is the absolute value of the gradient between adjacent epochs, that is, the absolute value of the difference between the prediction probabilities of adjacent epochs. As shown in Fig. 17 (b), the gradient absolute value of the hard samples tends to be higher than the noisy samples. We believe that the reason for this phenomenon is that the hard samples' hard fitting attribute makes the prediction probability of the model jump frequently. Some noisy samples, however, are conspicuously at the center of other classes; they are "super hard" for the model to fit. Their effect on the optimization of model parameters would be suppressed by the clean samples around. So in the latter part of the training process, their output probabilities would    Fig. 18, the noisy samples are scattered throughout the dataset. When these samples fall in the intersection area of two categories, their training behaviors are more similar to the hard ones; when these samples fall in the center area of other classes (yellow circles in Fig. 18), they have distinct training behavior differences with the hard samples. To further prove the conjecture above, we selected samples in the Camelyon16 dataset with 20% noise ratio and used the Resnet-34 model (pre-trained in ImageNet) to extract the features and visualized them by t-SNE [46] in Fig. 19.
We selected the noisy samples that fall in the center of other classes in Fig. 19 and plotted their training history to compare with the hard samples in Fig. 20. It can be seen that these noisy samples are indeed difficult to predict in the latter part of the training, which confirms our previous analysis. Benefit Analysis. To show the gain of our hard-aware label correction phase more intuitively, we count the remaining noise of the processed dataset and compare it with the baseline in Table VIII. The baseline is set to update the labels by classification model and drop out the samples by using the mean prediction probability of the training history. The results show that with the same number of remaining samples, our strategy eliminates more noisy samples by protecting hard samples as much as possible, and thus generates a higher quality data set for the second phase of training. Besides, for the real medical scenario dataset "Chaoyang", our hard-aware label correction phase only drops out 191 samples (total 4021 samples), and this shows that our method reduces noise with little damage to the original dataset.

V. CONCLUSION
Deep learning-based histopathology image classification can improve the diagnosing accuracy of cancer. It is difficult to collect a large clean dataset for training such a classification model. The existing noisy label correction methods fail to distinguish the hard samples from the noisy samples and thus ruin the model performance. In this study, we proposed a hard sample aware noise robust learning for histopathology image classification to save more clean samples, and thus boosted the model performance. We found that the training prediction history can be used to distinguish the hard samples and noisy samples. By integrating our EHN detection scheme into the noise removing, more hard clean samples can be saved. Besides, in our NSHE scheme under co-learning architecture, we adopted different parameter updating speed for the two models. This can further suppress the interference of noisy samples. Our results provide compelling performance for the noisy dataset, and the proposed method can be directly applied to the real-word noisy scenario. In the future, we will conduct hard sample aware semantic segmentation, such as malignant tissue segmentation for histopathology images.