Deep neural learning on weighted datasets utilizing label disagreement from crowdsourcing

Experts and crowds can work together to generate high-quality datasets, but such collaboration is limited to a large-scale pool of data. In other words, training on a large-scale dataset depends more on crowdsourced datasets with aggregated labels than expert intensively checked labels. However, the limited amount of high-quality dataset can be used as an objective test dataset to build a connection between disagreement and aggregated labels. In this paper, we claim that the disagreement behind an aggregated label indicates more semantics (e.g. ambiguity or difficulty) of an instance than just spam or error assessment. We attempt to take advantage of the informativeness of disagreement to assist learning neural networks by computing a series of disagreement measurements and incorporating disagreement with distinct mechanisms. Experiments on two datasets demonstrate that the consideration of disagreement, treating training instances differently, can promisingly result in improved performance.


Introduction
A well labeled dataset is essential for training neural networks where all instances are usually considered correctly labeled.However, some researchers have noticed that dataset collections more or less have inevitable wrong labeling problems, that can mislead the training.Our goal is to take the labeling process into account to reduce the impact of wrong labeling problem.A few works try to use selective attention, e.g., [1], over multiple instances to automatically learn to reduce the weights of noise instances.They achieve this by applying a conditional probabilistic prediction with an accumulative probabilistic optimization function.But this leads to an absence of sound explanation and an absence of insight of labeling process.
Recently, mostly large-scale datasets are assessed on crowd-sourcing platforms where a labeling task can be set up with each instance being assessed by, e.g., three assessors.Then, a deterministic label is aggregated for each instance through majority voting or averaging, sometimes with experts involved.As a result, one instance with one single label dataset is presented.In other words, the disagreement involved in human assessments is hidden in the dataset.
Though crowd and expert work together in order to give rise to a better quality of data, it is not realistic to obtain such labels for a large * Corresponding author.E-mail addresses: wang@di.ku.dk (D.Wang), prayag.tiwari@aalto.fi(P.Tiwari), m.shorf@tu.edu.sa(M.Shorfuzzaman), ingo.schmitt@b-tu.de(I.Schmitt).pool of data.In other words, training on a large dataset more depends on crowdsourced data with aggregated labels.Hereby, we claim that the small high quality dataset can be used as an objective test dataset that enables us to gain an insight how to take full advantage of the crowdsourced dataset together with the disagreement information.
In this paper, instead of dropping these conflicting human assessments, we take advantage of the disagreement during aggregation as a complementary weighting strategy.The underlying assumption is that the disagreement is an important indicator of the label quality that can be used to improve the learning for a better prediction.Specifically, the labeling of certain instances with high disagreement can be caused by • The instance itself is wrong or with error information (Error), • The assessors produce spams or have different background knowledge (Spam), • The instance itself is ambiguous among several labels (Ambiguity), or • Correct labeling requires too much effort (Difficulty).
Hence, the instances with conflicting labeling would mislead the learning of neural networks.Assuming that these raw assessments are https://doi.org/10.1016/j.comnet.2021.108227Received 11 November 2020; Received in revised form 4 May 2021; Accepted 4 June 2021 available before aggregation, we propose to take advantage of the disagreement of the labeling as a weight of importance.

Contribution
The main contribution of this paper is follows: • We first measure the disagreement with entropy and Gini index, then normalize them with Gaussian or rank based weight decision mechanisms.• Consequently, we define an adapted loss function where the weight decision is adopted in a neural network model, as shown in Fig. 1. • The experimental results on two scenarios, ideal and realistic, show a promising improvement with our method.By this work, we encourage that the dataset provider can keep the raw assessments before aggregation since they contain more semantics than only spam or noise, from which the dataset consuming community may benefit.

Organization
This paper is organized as follows: Section 2 discusses the works done on label disagreement.Section 3 provides brief details about the proposed methodology consist of problem formulation, disagreement computation, label distance, entropy-based disagreement, GINI-based disagreement, along other approaches.Experimental results can be found in Section 4 followed with conclusion in Section 5.

Related works
Imbalanced datasets have been investigated on class level imbalance [2][3][4][5][6].For instance, in [3], class distribution is measured and was taken into account to weight the imbalanced datasets, and a modified kNN algorithm was proposed to demonstrate the improvement of the adoption.Same with kNN, [4] employs k-means to compute the weights for training samples.[2] focused on reducing the minority class influence, by using a mixed-kernel based weighted extreme learning machine (MK-WELM).
Some researchers proposed alternative decision tree model [7] to improve the classification accuracy by assigning weights to each training instance using naive Bayes classifier.
[8] introduced an active learning method for classification that handles label noise without relying on crowdsourcing.The basic idea is to select those instances of high influence and eliminate noisy labels to assist the classification.This is the opposite of our method while somehow complementary to each other.All of the above-mentioned works rely on datasets with aggregated labels (one instance one label).
The closest work to us is the CrowdTruth measures for language ambiguity [9] on instance level.It shows the benefit that a classifier can gain from ambiguity measures of weighted label.However, the adoption is normalized and re-scaled in a simple linear way, and the adoption of the threshold requires elaborate manipulation on the crowd-sourced raw data.More importantly, they only focus on improving the quality of dataset instead of on the classification kernels while our work focuses on the formalization of disagreement and its adoption on neural learning models.
Supervised learning models are solely dependent upon the ground truth which are annotated by humans.Perhaps, these ground truths are very noisy and also comes from noisy platforms like Amazon Turk. 1  Generally, multiple labels are collected for each example and then combine the outputs to alleviate the noises.In this way, unnecessary annotation takes place at the cost of insufficient labeled instances.Two 1 https://www.mturk.com/.very basic questions arise; how to learn from the noisy workers in the best way, and how to manage the annotation budget to enlarge the classifier performance?. [10] proposed a novel algorithm for modeling the worker's quality and labels from the noisy crowd sourced data.The proposed model uses the current model to evaluate worker quality from disagreement, and then the model is updated by optimizing the loss function responsible for the current evaluation of worker quality.
The essence of repeated labeling is also analyzed in several papers.For example, [11] did deep research for analyzing the consequences of repeated labeling and demonstrated that it is likely to be dependent upon the cost of labeling as well as the respective cost of obtaining an unlabeled instance.[12] demonstrates that repeated labeling is very essential if the work quality is below the threshold value.[13,14] mentioned that the expressiveness of any classifier, as well as several factors assessed by others, also play an important role.

Problem formulation
Assume there is a set of  labeled instances  = {(  ,   ,   )} where   is the instance,   is its deterministic label that is aggregated from assessment set   where   = {  }, indicating for each instance there are  workers giving their assessment   ∈  from the label set  = { 1 , … ,   }.We also introduce the disagreement values of Given a neural network with parameters , the softmax layer that outputs the probability distribution  (  |  , ) of a neural network can be expressed as below, where  (  ) =

𝑒 𝑜 𝑖 ∑
∈    is the output of the neural network which corresponds to the probabilities associated with all labels . as function reeturns the vector of the labels; ℎ is the hidden layer, e.g., CNNs, RNNs, or Transformers; and b is a bias.We have the predicted label with the maximum probability, indicated as   .Then, we define the loss function as below, where we incorporate the weight decision ℎ(   ) for each instance in this formula, i.e., the bigger the weight the more influence does the instance have.Subsequently, in our assumption, the labels of training data   is a probability for each label instead of a single aggregated one.Therefore, we define the new version of loss function as Eq. ( 3) where we replace   with   of the label set .
As a result, we focus on two factors, the computation of disagreement    (discussed in Section 3.2) and then the weight decision ℎ(   ) based on disagreement (introduced in Section 3.3).

Disagreement computation
We employ a variety of variants to compute the disagreement   , including entropy, GINI, deviation, etc. as shown in Table 1.Ordinal or graded labels are widely used in different datasets, therefore, we take label distances into account when computing disagreement.We achieve this by bringing alternative versions of disagreement formulas with or without considering label distance.For theory competency, we discuss the ordinal type with label distance as well, but our experiment only covers binary and category type; while we leave that for ordinal type as future work.

Label distance
For ordinal type, we first define the average labeling distance for   as, where (  ) computes the distance between each assessor's label   and the deterministic label   , similar to deviation formula.Then we normalize the distance indicator as (  ) =

Entropy based disagreement
For categorical labels, the disagreement is defined as the entropy among assessments   as Eq. ( 5), where  (  ) is the relative frequency of the label   accumulated from  workers.For datasets that have ordinal type labels, we have, where (  ) is the normalized distance indicator, meaning the more similar labeling among assessors the less distance and the less disagreement.For example, if two instances with label 1 aggregated from three assessments {1, 2, 1} and {1, 5, 1} respectively, the entropy values will be the same but we can observe the former case has smaller disagreement than the latter.Thus, the ordinal version penalizes the latter case with larger label distance.

GINI based disagreement
For categorical labels, we have For labels that are ordinal or graded, we have,    =  ′ (  ) = (  ) * (  ).It is noted that entropy based disagreement has smoother range than Gini based version.

Deviation based disagreement
For completeness, we introduce deviation based disagreement as well, but leave the experimental validation for the future work.Standard deviation is only applicable for ordinal (graded) or numeric labels.It is defined as below, where the   is the average frequency of the  labels,  is the total number of the instances.

Confidence based disagreement
Confidence is calculated by some crowd-sourcing websites, e.g. Figure Eight2 .The confidence is similar to the reversed version of disagreement but calculated with distinct parameters with the internal worker trust calculated based on their historical performance.There are three steps to calculate confidence with Figure Eight.First, () is defined as the sum of trust of workers who assessed an instance as  as Eq. ( 9), where   is the internal trust of Figure Eight for worker , and [  , ] is a indicator of whether   == .Then, the sum of all trust is defined as Eq.(10).

Weight decision based on disagreement
In this section, we introduce approaches to compute the weight ℎ(   ), including normalized, Gaussian distribution and ranked list.We introduce them next.

Normalized weighting
One simple way is to normalize the disagreement values as the weight, where    (  ) is the normalized disagreement value.The 1.0 minus the normalized value is to turn the disagreement into ''agreement'' value since it works as a influence weight in Eq. (2).

Gaussian distribution weighting
Assuming   satisfies a Gaussian distribution (normal distribution).Then, we consider awarding some instances while penalizing the other by defining the weight as below, where  −   ∈ (0, 1]. is the natural constant.The formula makes the instance the more disagreement it contains the less weight it gains.

Rank based weighting
As a complementary, we propose another alternative weighting formula based on rank position.The underlying assumption is that disagreement values vary case by case, especially for those datasets that have few assessors; while the rank of them reflect simple but stable pattern.Therefore, we normalize the weight only based on rank position of disagreement value.We turn   into a ranked disagreement values as permutation   = { 1  (1) ,  2 (2) , … ,   () } in a descending order from most disagreement (worst quality) to least disagreement (best quality) instances.It is noted that we take tiers into account, for example, values of [0.8, 0.7, 0.7, 0.5] are turned into rank [1,2,2,3].Then, the weight value is defined as below, ℎ ′ (  ) =   (  + 1) (14) where we have the root as  so that the weight falls on the range of [0, 1].For instance, the first instance has   1.0 = 0 weight, while the last instance has value    = 1.

Models
We choose the most representative deep neural networks, i.e., convolutional neural networks (CNNs) and Transformers (DistilBERT) in our experiment to conduct an apples-to-apples comparison.
For CNNs, we use the multi-scale CNNs in line with the work [15], which concatenates a set of convolutional kernels with different kernel sizes.For Transformer, we use DistilBERT, which is a fast and lightweight version of BERT.

RNN, CNN and transformers
RNN [16][17][18] and CNN [19][20][21] are two well-studied successful frameworks of neural networks that perform effectively on various tasks of different datasets.RNN has the advantage of capturing long-term dependencies on a series of data in tasks that can be shaped as ''what will happen next given. . .''.In NLP scenario, words are often treated as a series of data in a given sentence.In contrary, CNNs perform significantly successful on image classification tasks, then being adapted to text classification tasks as well with competitive performance [22][23][24][25][26][27].
A more recent and more powerful deep learning framework is the transformers, such as BERT (Bidirectional Encoder Representations from Transformers) [28].The core component of the BERT is the Transformer's encoder representation, which practically pre-trains the bidirectional encoder representation on unlabeled texts with masks.Therefore, BERT is also called the masked language model.

Experiment
Taking reality into consideration, we conduct our experiments on two scenarios, the first one is on an ideal dataset that has both the raw information of crowd-sourcing and the deterministic labels after aggregation; while the other is on a realistic dataset with only aggregated labels and confidence for the labels.In real-world scenarios, the latter case is more commonly existing.We discuss both within our framework.Though we introduce an ordinal version disagreement for the completeness of our method, we do not include the validation of them in this section, but leave it as the future work.

Dataset description
For the ideal scenario, we use the GrowdTruth medical relation extraction dataset [9] where they aggregate the label and the weights of disagreement with different strategies.There is a total of 3984 instances. 3They include 1043 instances containing treatment relations and 1787 containing causal relations.Agreement of crowd and expert in sentences for negative and positive threshold for Cause, and treat can be seen in Figs. 2 and 3. We use the same partitions as provided  with train, valid and test.In particular, the labels we use in train and validate dataset are expected to be ''relatively'' correct since they are aggregated from crowd, while in the test dataset the labels are expected to be ''absolute'' correct, by selecting those that have more than 75% agreement between crowd and expert.
For a more realistic scenario, we use ''Sentiment Analysis -Global Warming/Climate Change'' from Figure Eight. 4Global warm dataset assess tweets for belief in the existence of global warming or climate change.The label is ''Yes'' if the tweet suggests global warming is occurring, ''No'' if the tweet suggests global warming is not occurring, and ''I cannot tell'' if the tweet is either unrelated or ambiguous to global warming.It also includes a confidence score for the classification of each tweet.There is a total of 6090 instances.Following a tradition, we randomly partition the dataset into 80% for training and 20% for testing.

Hardware setting
The hardware setting is listed below in Table 2.It is necessary to mention here that we used normal CPU server instead of using GPU server.

Results and discussion
For the ideal datasets, the result is shown in Table 3. ''Baseline'' is the performance for labels (positive or negative) aggregated for each instance by the distant supervision method, based on whether the relation is expressed between the two terms in the sentence; ''Expert'' is the performance for labels based on an expert's judgment as to whether the baseline label is correct; and ''Crowd'' is that for the score used to train the relation extraction classifier [29] with crowd data.
We can observe that for the ''cause'' relation dataset with the CNN encoder, we achieve an F1 of 0.73 with an absolute 9.2% improvement, while an F1 of 0.858, with a slight improvement of 0.2% in treat relation.For the first one, we have high recall and satisfactory precision, while for the latter, the recall is lower, but the accuracy is satisfactory.In order to check the different encoder, we show the results with DistilBERT on the same datasets, as shown in Table 4.
We can observe the best improvement is almost the same or similar with an F1 of 0.72.However, the consistent fact is that the adoption of disagreement will lead to improvements.
For our methods, we found that the combination of entropy-based disagreement and rank gain better performance.This can be that the entropy has a more smooth range for disagreement measurement while the rank based weighting is in a controlled range for distinct datasets.
For realistic dataset of three-class prediction, the result is demonstrated in Table 6.In this case, we do not have a disagreement computation but only a weight decision.The baseline is the result without any weighting, treating each instance equally important.We can observe that rank based measuring achieves the best performance with an absolute improvement of 2.39% accuracy.This brings us the hint that (1) even with the lack of raw data, we still can use the confidence to assist learning (2) different weighting mechanisms of confidence value lead to different performance (see Table 5).

Conclusion
We claim that the disagreement behind an aggregated label of an instance contains more semantics than singly spam or noise, which can be employed to assist the learning of neural networks.Therefore, we propose to incorporate the disagreement as instance weight into an adapted loss function in deep neural networks.To achieve this, we measure the disagreement with distinct mechanisms, including entropy and Gini index, followed by a normalization of Gaussian or rank based weighting decision.
The design has the advantage of avoiding threshold analysis from the raw annotating data.We validate our method on two scenarios where one is on an ideal dataset with information from crowdsourcing to aggregation (medical information); while the other is a close to reality dataset with only aggregated label and confidence.
The experiments demonstrate that the weighted decision improves the performance by an absolute improvement of 7.19% (F1) for ideal dataset and an absolute 2.39% (accu) in realistic dataset.

Fig. 1 .
Fig. 1.The scheme of the training on a weighted network.

Fig. 2 .
Fig. 2. Agreement of crowd and expert in sentences for negative and positive threshold for Cause.

Fig. 3 .
Fig. 3. Agreement of crowd and expert in sentences for negative and positive threshold for Treat.

Table 1
Label types that different disagreement measurements support.

Table 2
Hardware settings.

Table 3
Performance of different methods on medical info dataset for cause relation with CNN encoder.

Table 4
Performance of different methods on medical info dataset for cause relation with DistilBERT encoder.

Table 5
Performance of different methods on medical info dataset for treat relation.

Table 6
Performance of different methods on global warm dataset.