AWeakly SupervisedMethod for MudDetection in Ores Based on Deep Active Learning

Automatically detecting mud in bauxite ores is important and valuable, with which we can improve productivity and reduce pollution. However, distinguishing mud and ores in a real scene is challenging for their similarity in shape, color, and texture. Moreover, training a deep learning model needs a large amount of exactly labeled samples, which is expensive and time consuming. Aiming at the challenging problem, this paper proposed a novel weakly supervised method based on deep active learning (AL), named YOLO-AL. (e method uses the YOLO-v3 model as the basic detector, which is initialized with the pretrained weights on the MS COCO dataset. (en, an AL framework-embedded YOLO-v3 model is constructed. In the AL process, it iteratively fine-tunes the last few layers of the YOLO-v3 model with the most valuable samples, which is selected by a Less Confident (LC) strategy. Experimental results show that the proposed method can effectively detect mud in ores. More importantly, the proposed method can obviously reduce the labeled samples without decreasing the detection accuracy.


Introduction
Bauxite is usually mixed with a large amount of mud lump, which is the main impurity in alumina ore. It requires a large dose of chemical reagents (such as alkali) for removal of the mud, which increases the production cost and environmental pollution. More seriously, the mud is highly viscous, which likely blocks production equipment and affects the stability of production. At present, mud removal still relies on traditional manual operating. So, automatically detecting and removing mud from ores with AI technology is important and valuable for production cost reduction and environmental pollution.
However, it is challenging to distinguish the mud and the ore in a real scene. e reason lies in several aspects. (1) Since the mud and the ore are both in the form of lumps, the shape difference is not obvious. (2) Since ore usually cannot be cleaned thoroughly, there is little difference between the mud and the ore in color and texture (see Figure 1). Even experienced experts need careful identification to distinguish. (3) One image often contains multiple pieces of mud whose sizes vary significantly (diameter from 50 mm to 500 mm). (4) More seriously, since the ores from different mines have different compositions and contents, their color and texture have obvious differences.
Currently, there is no special method for mud detection in ores. But we can benefit from the common object detection method which is usually based on deep neural networks and trained with an amount of exactly labeled samples.
ere are two typical methods: the region-proposal-based method and the regression-based method. e former is also called the two-stage method. A region proposal algorithm finds candidate object regions in the first stage, and then a CNN network extracts the features and classifies candidate objects in the second stage. ese methods include the R-CNN [1], the Fast R-CNN [2], the Faster R-CNN [3], the SPP-NET [4], the SSD [5], the R-FCN [6] and the newest Cascaded RCNN [7]. e latter treats object detection as a regression problem and predicts the location and category at the same time.
e most representative ones are the YOLO deep neural networks, including the YOLO [8], the YOLO9000 [9], and the YOLO-v3 [10]. Compared with these two typical methods which emerged at the same time, the former is more accurate while the latter is faster overall.
ere are several problems in directly using the aforementioned methods for mud detection. Firstly, since the mud and the ore are difficult to distinguish, the common object detection method cannot give a highly accurate result. It needs a special and finer model to give higher accuracy. Secondly, the aforementioned methods are strongly supervised, which need a large number of exactly labeled samples to train a model. Since there is similarity between mud and ores in the real scene, even experienced experts need careful identification to distinguish. So, it is expensive and time consuming to exactly label a large number of samples. Last but not least, since the ores from different mines have different colors and textures, it needs a model that can be transferred easily from a mine to another, which is important for mud detection.
To solve the challenging problem, this paper proposed a weakly supervised method based on deep active learning (AL), named YOLO-AL. e method uses the YOLO-v3 model as the basic detector, which is initialized with the pretrained weights on the MS COCO dataset. en, an AL framework-embedded YOLO-v3 model is constructed. In the AL framework, it iteratively fine-tunes the last few layers of the YOLO-v3 model with the most important samples.
is paper focuses on the important problem of automatically detecting mud in ores, which is rarely studied. e contributions are summarized in three aspects. (1) We propose a weakly supervised method based on deep active learning for detecting mud in ores, which extensively reduces human labor for annotating training data while achieving performance comparable with the fully supervised learning approaches. (2) We propose a sample selection method based on Less Confident (LC) strategy, which selects the most valuable samples according to the confidences. e confidence of an object is calculated with the scores predicted by the YOLO-v3 detector. (3) Since the proposed method only fine-tunes the last few layers of the YOLO-v3 model with the most valuable samples, it can easily be transferred from one mine to another.

Related Work
Active learning [11,12] assumes that the ground-truth labels of unlabeled instances can be queried from a database [13]. For simplicity, it assumes the labeling cost only depends on the number of queries. us, the goal of active learning is to minimize the number of queries. Such that, the labeling cost for training a good model can be minimized. Given a small set of labeled data and abundant unlabeled data, active learning attempts to select the most valuable unlabeled instance to query [13].
Active learning is always used in scenes where data collection is convenient while sample labeling is expensive. Kapoor et al. [14] combined active learning with Gaussian stochastic processes for object categorization. Yang et al. [15] used the AL to train a group of fully convolutional networks (FCN) for biomedical image segmentation. Sun et al. [16] proposed an AL framework based on MRF model for the spectral-spatial classification of hyperspectral imagery. Yang et al. [17] proposed a semisupervised batch mode multiclass active learning algorithm for visual concept recognition, which selects uncertainty sampling with diversity maximization. Dutt Jain and Grauman [18] proposed an active learning method for natural scene image segmentation, which achieves state-of-the-art level performance using significantly less training data.
Recently, weakly supervised learning, in which training sets require only binary labels indicating whether an image contains the object or not, has attracted considerable attention. Han et al. proposed a novel object detection framework by combining the weakly supervised learning and high-level feature learning [19]. Zhou Figure 1: e mud in ores. ere are only slight differences between mud and ore in shape, color, and texture, and their scales vary significantly. e red squares with confidences are mud.
rotation-invariant CNN for object detection from remote sensing image [22]. However, there are few papers that use active learning for object detection, especially for low-distinguishable objects (such as mud and ore). is paper proposed an AL methodintegrated YOLO-v3 model for mud detection in ores. e YOLO-v3 model detects the mud and predicts its class confidence and box bound confidence. Based on these confidences, the AL selects the most valuable samples to be labeled. With the gradually increasing labeled samples, a more accurate YOLO-v3 model is trained. e method will bring at least two benefits. (1) Only the most valuable samples are selected to the expert for labeling, which will reduce the number of training samples. (2) Since the expert only needs to check and modify the labels instead of relabeling, the work of labeling is reduced further.

e Overall Framework.
e overall framework of the proposed deep active learning method, named the YOLO-AL, is shown in Figure 2, which contains four basic modules: YOLO-v3 model fine-tuning, object detection, sample selection, and expert verification.
As shown in Figure 2, the proposed method is actually an iterative training process. At the beginning, we initiate a YOLO-v3 model with the weight pretrained on the MS COCO dataset. en, it fine-tunes the last few layers with a little of labeled mud and ore samples to get a new mud detector. With the mud detector, all the unlabeled samples are tested and given confidence to each object. Based on the confidence, a sample selection method selects the most valuable samples which are sent to the expert. e expert checks and modifies the labels of these samples and adds these samples to the labeled training set. With the updated training set, the YOLO-v3 model will be fine-tuned again. e process iterates with gradually increasing labeled samples until reaching the termination condition.

YOLO-v3 Model Fine-Tuning.
e YOLO-v3 model was proposed in [10] for general object detection in the nature scenes. Due to its excellent performance on the speed and accuracy, we use the YOLO-v3 as the detector. We initialize the YOLO-v3 model with the pretrained weights on the MS COCO dataset (http://images.cocodataset.org). en, we fine-tune the last layers of the YOLO-v3 model iteratively in the AL framework.
As Yosinski et al. [23] pointed out, fine-tuning a deep neural network can preserve the general feature and overcome the difference between datasets to extract special high level features, which help us to quickly construct a new model on a new dataset. In this paper, we froze the Darknet-53 of the YOLO-v3 and fine-tuned the last layers shown in Figure 3. It is worth noting that the Darknet-53 has far more layers and weight parameters than the layers in the dashed box.
In Figure 3, the DBL unit consists of three layers: convolutional (Conv), Batch Normalization (BN), and Leaky ReLU Activation Layer. e ResUnit is a unit with the residual structure. e Resn is composed by a zero-padding, a DBL, and n ResUnit. e DBLU unit consists of a DBL layer and an upsampling layer, while the DBLC consists of a DBL layer and a convolutional layer. e Concat layer combines features at different scales. e loss function of the YOLO-AL is defined as follows: where b x and b y are the location, while b w and b h are the width and the height of the predicted box. S is the number of the grid, namely, S 2 is usually set as 13 × 13, 26 × 26, and 52 × 52 for scales from coarse to fine. B is the predicted box number. 1 obj i,j is defined as BCE is binary cross entropy: p c is the object class probability. λ coord , λ obj , and the λ noobj are the proportions of the three parties.

e Sample Selection.
e sample selection strategy is the core of the AL [24]. Since the proposed AL method shown in Figure 2 is used for object detection, the sample selection strategy is defined based on the predicted results of the YOLO-v3 detector. It predicts the class probability of each object, based on which we can calculate the confidence. As shown in Figure 4, the YOLO-v3 predicts a vector containing 3 boxes for each grid in the feature map. Each box is a prediction of an object where p 0 is the objectness score and p 1 ∼ p n are the class scores for n-classes. Here, we only consider 3 classes, namely, the mud, the ore, and others. e objectness score p 0 indicates the possibility of whether the box contains an object, namely,p(object), while the class score is the posterior probability p(class | object). So, the confidence of a box can be calculated as follows: In AL, the strategy of sample selection decides which sample to query or to be labeled by experts. In this paper, we consider two sample selection strategies, the random selection (RS) and the less confident (LC). e RS method is referred to as passive selection method in contrast to the active selection methods. In the RS method, unlabeled candidates are selected randomly without any active criterion. e RS method is often served as the baseline to be compared with the active selection methods. e LC method selects the samples with less confident based on the posterior probabilities of all the classes. When using a probabilistic model for binary classification, the LC method selects the sample whose posterior probability is near 0.5.
x ∧ LC � arg min where max m∈L p(y i � m | x i ) means the most possible label of sample x i is m and D u is the unlabeled dataset. Specific to the problem of mud detection based on YOLO-v3, p(y i � m | x i ) is actually p(class) in formula (5). If a candidate object satisfies condition (7), it is considered to be an object containing the most useful information and should be labeled as a training sample. Considering the efficiency of the model training, we sort the mud objects in ascending order according to p uncertain and take the first n mud objects to be labeled. If there is an uncertain object in an image, the image with its labels is put into a dataset I uncertain which will be recommended to expert to check and modify labels.
Since the images recommended to the expert have predicted objects with labels, the experts only need to modify the class labels or box bounds instead of relabeling them, which will also reduce the work of sample labeling.

e Algorithm.
e YOLO-v3-based AL framework is provided in Algorithm 1.

Dataset Description.
e open MS COCO [25] dataset is used to pretrain a YOLO-v3 model. It is downloaded from http://images.cocodataset.org, which contains 12 major classes and 80 subclasses. To compare with the mud and the ore in the Ore Dataset, we only focus on the dog and the cat in the MS COCO2017 dataset. e dog and the cat belong to the same major class and have high similarity to each other, which is like the mud and ore. e training set of the MS COCO contains 4385 images with 5508 label boxes for the dog and 4114 images with 4768 label boxes for the cat. Each image is 608 × 608 pixels.
e Ore Dataset is used to fine-tune the YOLO-v3model. It is collected from a real mine and labeled by the experienced workers. Since the actual production is more focused on a larger object, there is no label for objects with diameters less than 50 mm. It contains 5683 images, and each image is 720 × 640 pixels. e Ore Dataset is also organized with the format of the MS COCO. Different from the MS COCO, each image in the Ore Dataset contains ore object, but partly contains mud object. e detail is shown in Table 1.
As shown in Figure 5, the scene of the Ore Dataset is more complex than the MS COCO. One image often contains multiple pieces of mud and ores with large-scale change. e inhomogeneous slurry makes the background more complex. Nonuniform illumination and the occlusion between ore and mud further complicate the scene. So, detecting mud from the ores is more challenging than dog or cat detection.

Experiment Design.
In order to verify the effectiveness of the proposed method, this paper designs comparison experiments with YOLO-v3 method, YOLO-v3 (RS) method, and YOLO-v3 (LC).  Table 2 and are marked with a red * in Figure 6. It is worth noting that all samples were used for model training and testing at one time, which is significantly different from the increasing training samples in the YOLO-AL method.
(2) YOLO-AL (RS). e YOLO-AL (RS) method with the RS strategy randomly selects samples for training. It treats the training sample indiscriminately, which is the same as that in experiment 1. So, they are essentially the same with each other. e only difference is that the RS increases the labeled samples gradually, while the YOLO-v3 uses all samples at one time. However, the RS can still find out exactly how many training samples are enough for training a model. On the cat and dog dataset and the Ore Dataset, the YOLO-AL (RS) was trained with the sample increment h � 50. For simplicity, the expert verification is

Results and Analysis.
To assess the effectiveness of the proposed YOLO-AL model, the average precision (AP) and mean average precision (mAP) are adopted. e AP measures the quality of bounding box prediction in the test set. If the IoU of a predicted box with the ground truth is larger than 0.5, the prediction is considered as true positive [26]. In order to avoid the randomness of detection performance, we performed five experiments for each method. en, we calculated the mean and standard deviation of the mAP to form Figures 6 and 7, where the colored background areas represent the standard deviation floating range.
In Figures 6 and 7, the x coordinate is sample number with the increment h � 50, while the y coordinate is the mAP. As shown in Figure 6, the converged mAPs of the three methods have no obvious difference on the dog and cat dataset, which can also be seen from Table 2. However, the required training samples in the proposed methods are much less than those in the YOLO, as shown in Table 3. e required samples of the three methods are 2350, 3100, and 4400, respectively. e YOLO-AL (LC) is about 53.4% of the YOLO and 70.5% of the YOLO (RS).
On the Ore Dataset, the YOLO-AL (RS) is not obviously different from the YOLO-AL (LC). However, the accuracy of the YOLO-AL (LC) is about 1.5% higher than the other. e result is amazing due to the complex scene and the low discrimination between ore and mud. e required samples of the three methods are 1950, 2650, and 4400, respectively. e YOLO-AL (LC) is about 44.3% of the YOLO and 73.6% of the YOLO (RS). e following conclusions can be drawn. (1) e detection of accuracy of the proposed method is no less than that of the YOLO-v3. (2) e required training samples of the proposed method are obviously less than those of the YOLO-v3. (3) e proposed method can be easily transferred from one mine to another. On the one hand, the proposed method uses the most valuable samples to finetune a model which needs less labeled samples. On the other hand, the difference between mines is less than that between Ore Dataset and COCO dataset, so it needs fewer labeled samples to transfer the model from one mine to another. e main reason maybe lies in the training process. e AL (LC) can pick the most uncertain samples, which are most valuable for model training. In other words, the sample that cannot be accurately "understood" by the current model can provide meaningful information for improving model accuracy.
e samples that cannot be accurately "understood" by the current model provide only a little meaningful information and can even be ignored.
Another reason may be that the AL with LC strategy can prevent samples from being overconcentrated in a certain area of the feature space, which may lead to biased estimates.
With the model trained by the proposed method, the Ore Dataset was tested. Partial results of the mud detection are shown in Figure 5. For the sake of clarity, the bounding boxes of the ore object are hidden here. Although the scene is complex and the mud and ore are only slightly different in color, texture, and shape, the proposed method can effectively distinguish ore and mud.

Conclusions and Future Work
Automatically detecting mud in bauxite ores is valuable and challenging. is paper proposed a novel weakly supervised method which combines the deep active learning and the YOLO-v3 model. To select the most valuable samples, it adopts the Less Confident (LC) strategy according to the confidences of objects predicted by the YOLO-v3 detector. en, it fine-tunes the model in the AL process with the   More importantly, the proposed method needs much fewer labeled samples than YOLO-v3 without decreasing the detection accuracy, which extensively reduces human labor for annotating training data. Also, the proposed method can be easily transferred from one mine to another, which is important for the practical application of mud detection.
In future work, we will study more appropriate sample selection strategies to further reduce the labeling cost. In addition, ores containing more types of impurities will be considered.
Data Availability e ore images and labeled data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.