Active^2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation

While deep learning is a powerful tool for natural language processing (NLP) problems, successful solutions to these problems rely heavily on large amounts of annotated samples. However, manually annotating data is expensive and time-consuming. Active Learning (AL) strategies reduce the need for huge volumes of labeled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which may not contribute significantly to the learning process. Our proposed approach, Active\mathbf{^2} Learning (A\mathbf{^2}L), actively adapts to the deep learning model being trained to eliminate such redundant examples chosen by an AL strategy. We show that A\mathbf{^2}L is widely applicable by using it in conjunction with several different AL strategies and NLP tasks. We empirically demonstrate that the proposed approach is further able to reduce the data requirements of state-of-the-art AL strategies by \approx \mathbf{3-25\%} on an absolute scale on multiple NLP tasks while achieving the same performance with virtually no additional computation overhead.


Introduction
Active Learning (AL) (Freund et al., 1997;McCallum and Nigam, 1998) reduces the need for large quantities of labeled data by intelligently selecting unlabeled examples for expert annotation in an iterative process.Many Natural Language Processing (NLP) tasks like sequence tagging (NER, POS), Neural Machine Translation (NMT), etc., are very data-intensive and require a meticulous, timeconsuming, and costly annotation process.On the other hand, unlabeled data is practically unlimited.
Due to this, many researchers have explored applications of active learning for NLP (Thompson et al., 1999;Figueroa et al., 2012).A general AL method proceeds as follows: We claim that all AL strategies select redundant examples in step (ii).If one example satisfies the selection criterion, then many other similar examples will also satisfy it (see the next paragraph for details).As the examples are selected independently, AL strategies redundantly choose all of these examples even though, in practice, it is enough to label only a few of them (ideally just one) for training the model.This leads to higher annotation costs, wastage of resources, and reduces the effectiveness of AL strategies.This paper addresses this problem by proposing a new approach called A 2 L (read as active-squared learning) that further reduces the redundancies of existing AL strategies.
Any approach for eliminating redundant examples must have the following qualities: (i) The redundancy should be evaluated in the context of the trained model.(ii) The approach should apply to a wide variety of commonly used models in NLP.(iii) It should be compatible with several existing AL strategies.The first point merits more explanation.As a model is trained, depending on the downstream task, it learns to focus on certain properties of the input.Examples that share these properties (for instance, the sentence structure) are similar from the model's perspective.If the model is confused about one such example, it will likely be confused about all of them.We refer to a simi-larity measure that is computed in the context of a model as a model-aware similarity (Section 3.1).
Contributions: (i) We propose a Siamese twin's (Bromley et al., 1994;Mueller and Thyagarajan, 2016) based method for computing model-aware similarity to eliminate redundant examples chosen by an AL strategy.This Siamese network actively adapts itself to the underlying model as the training progresses.We then use clustering based on similarity scores to eliminate redundant examples.(ii) We develop a second, computationally more efficient approach, that approximates the first one with a minimal drop in performance by avoiding the clustering step.Both of these approaches have the desirable properties mentioned above.(iii) We experiment with several AL strategies and NLP tasks to empirically demonstrate that our approaches are widely applicable and significantly reduce the data requirements of existing AL strategies while achieving the same performance.To the best of our knowledge, we are the first to identify the importance of model-aware similarity and exploit it to address the problem of redundancy in AL.

Related Work
Active learning has a long and successful history in the field of machine learning (Dasgupta et al., 2009;Awasthi et al., 2017).However, as the learning models have become more complex, especially with the advent of deep learning, the known theoretical results for active learning are no longer applicable (Shen et al., 2018).This has prompted a diverse range of heuristics to adapt the active learning framework to deep learning models (Shen et al., 2018).Many AL strategies have been proposed (Sha and Saul, 2007;Blundell et al., 2015;Gal and Ghahramani, 2016a;Haffari et al., 2009;Bloodgood and Callison-Burch, 2010), however, since they choose the examples independently, the problem of redundancy (Section1) applies to all.
We experiment with various NLP tasks like named entity recognition (Nadeau and Sekine, 2007), part-of-speech tagging (Marcus et al., 1993), neural machine translation (Hutchins, 2004;Nepveu et al., 2004;Ortiz-Martínez, 2016) and so on (Tjong Kim Sang and Buchholz, 2000;Landes and Leacock, 1998).The tasks chosen by us form the backbone of many practical problems and are known to be computationally expensive in both training and inference.Many deep learning models have recently advanced the state-of-art for these tasks (Siddhant and Lipton, 2018;Lample et al., 2016;Bahdanau et al., 2014).Our proposed approach is compatible with any NLP model, provided it supports the usage of an AL strategy Many recent attempts at applying active learning to sequence tagging and NMT have been made (Siddhant and Lipton, 2018;Peris and Casacuberta, 2018), however, the issue of redundancy (Section 1) has largely been ignored.Existing approaches have used model-independent similarity scores to promote diversity in the chosen examples.For instance, in Chen et al. (2015), the authors use cosine similarity to pre-calculate pairwise similarity between examples.We instead argue in the favor of model-aware similarity scores and learn an expressive notion of similarity using neural networks.We compare our approach with a modified version of this baseline using cosine similarity on Infersent embeddings (Conneau et al., 2017).

Proposed Approaches
We use M to denote the model being trained for a given task.M has a module called encoder for encoding the input sentences, for instance, the encoder in M may be modeled by an LSTM (Hochreiter and Schmidhuber, 1997).

Model-Aware Similarity Computation
A measure of similarity between examples is required to discover redundancy.The simplest solution is to compute the cosine similarity between input sentences (Chen et al., 2015;Shen et al., 2018) using, for instance, the InferSent encodings (Conneau et al., 2017).However, sentences that have a low cosine similarity may still be similar in the context of the downstream task.Model M has no incentive to distinguish among such examples.A good strategy is to label a diverse set of sentences from the perspective of the model.For example, it is unnecessary to label sentences that use different verb forms but are otherwise similar, if the task is agnostic to the tense of the sentence.A straightforward extension of cosine similarity to the encodings generated by model M achieves this.However, a simplistic approach like this would likely be incapable of discovering complex similarity patterns in the data.Next, we describe two approaches that use more expressive model-aware similarity measures.

Model-Aware Siamese
In this approach, we use a Siamese twin's network (Bromley et al., 1994) to compute the pairwise similarity between encodings obtained from model M. A Siamese twin's network consists of an encoder (called the Siamese encoder) that feeds on the output of model M's encoder.The outputs of the Siamese encoder are used for computing the similarity between each pair of examples a and b as: where o a and o b are the outputs of the Siamese encoder for sentences a and b respectively.Let N denote the number of examples chosen by an AL strategy.We use the Siamese network to compute the entries of an N × N similarity matrix S where the entry S ab = sim(a, b).We then use the spectral clustering algorithm (Ng et al., 2002)  We train the Siamese encoder to predict the similarity between sentences from the SICK (Sentences Involving Compositional Knowledge) dataset (Marelli et al., 2014) using mean squared error.This dataset contains pairs of sentences with manually annotated similarity scores.The sentences are encoded using the encoder in M and then passed on to the Siamese encoder for computing similarities.The encoder in M is kept fixed while training the Siamese encoder.The trained Siamese encoder is then used for computing similarity between sentences selected by an AL strategy for the given NLP task as described above.As M is trained over time, the distribution of its encoder output changes and hence we periodically retrain the Siamese network to sustain its model-awareness.
The number of clusters and the number of examples drawn from each cluster are user-specified hyper-parameters.The similarity computation can be done efficiently by computing the output of the Siamese encoder for all N examples before evaluating equation 1, instead of running the Siamese encoder O(N 2 ) times.The clustering algorithm runs in O(N 3 ) time.For an AL strategy to be useful, it should select a small number of examples to benefit from interactive and intelligent labeling.We expect N to be small for most practical problems, in which case the computational complexity added by our approach would only be a small fraction of the overall computational complexity of training the model with active learning (see Figure 1).

Integrated Clustering Model
While the approach described in Section 3.2 works well for small to moderate values of N , it suffers from a computational bottleneck when N is large.We integrate the clustering step into the similarity computation step to remedy this (see Figure 1) and call the resultant approach as Integrated Clustering Model (Int Model).Here, the output of model M's encoder is fed to a clustering neural network C that has K output units with the softmax activation function.These units correspond to the K clusters and each example is directly assigned to one of the clusters based on the softmax output.
To train the network C, we choose a pair of similar examples (say a and b) and randomly select a negative example (say c).We experimented with both SICK and Quora Pairs dataset 3 .All examples are encoded via the encoder of model M and then passed to network C. The unit with the highest probability value for example a is treated as the ground-truth class for example b.The objective is to maximize the probability of b belonging to its ground truth class while minimizing the probability of c belonging to the same class: Here λ 1 , λ 2 , and λ 3 are user-specified hyperparameters, p x j is the softmax output of the j th unit for example x, j = 1, 2, . . ., K, x = a, b, c, and i a = arg max j∈{1,2,...K} p a j .The third term encourages the utilization of all the K units across examples in the dataset.As before, a trained network C is used for clustering examples chosen by an AL strategy, and we select a fixed number of examples from each cluster for manual annotation.
It is important to note that: (i) These methods are not AL strategies.Rather, they can be used in conjunction with any existing AL strategy.Moreover, given a suitable Siamese encoder or clustering network C, they apply to any model M. (ii) Our methods compute model-aware similarity since the input to the Siamese or the clustering network is encoded using the model M. The proposed networks also adapt to the underlying model as the training progresses.Algorithm 1 describes our general approach called Active 2 Learning.

Experiments
We establish the effectiveness of our approaches by demonstrating that they: (i) work well across a variety of NLP tasks and models, (ii) are compatible with the most popular AL strategies, and (iii) further reduce the data requirements of existing AL strategies, while achieving the same performance.In particular, we experiment with two broad categories of NLP tasks: (a) Sequence Tagging; (b) Neural Machine Translation.Table 1 lists these tasks and information about the corresponding datasets (including the two auxiliary datasets for training the Siamese network (Section 3.2)) used in our experiments.We begin by describing the AL strategies for the two kinds of NLP tasks.

Active Learning Strategies for Sequence Tagging
Margin-based strategy: Let s(y) = P θ (Y = y|X = x) be the score assigned by a model M with parameters θ to output y for a given example x.Margin is defined as the difference in scores obtained by the best scoring output y and the second best scoring output y , i.e.: where, y max = arg max y s(y).The strategy selects examples for which M margin ≤ τ 1 , where τ 1 is a hyper-parameter.We use Viterbi's algorithm (Ryan and Nudd, 1993) to compute the scores s(y).
Entropy-based strategy: All the NLP tasks that we consider require the model M to produce an output for each token in the sentence.Let x be an input sentence that contains n(x) tokens and define sj = max o∈O P θ (y j = o|X = x) to be the probability of the most likely output for the j th token in x.
Here O is set of all possible outputs and y j is the output corresponding to the j th token in x.We define the normalized entropy score as: A length normalization n(x) is added to avoid bias due to the example length as it may be undesirable to annotate longer length examples (Claveau and Kijak, 2017).The strategy selects examples with M entropy ≥ τ 2 , where τ 2 is a hyper-parameter.
Bayesian Active Learning by Disagreement (BALD): Due to stochasticity, models that use dropout (Srivastava et al., 2014) produce a different output each time they are executed.BALD (Houlsby et al., 2011) exploits this variability in the predicted output to compute model uncertainty.Let y (t) denote the best scoring output for x in the t th forward pass, and let N be the number of forward passes with a fixed dropout rate, then: Here the mode(.)operation finds the output which is repeated most often among y (1) , . . ., y (N ) , and the count(.)operation counts the number of times this output was encountered.This strategy selects examples with M bald ≥ τ 3 (hyper-parameter).

Active Learning Strategies for Neural Machine Translation
Least Confidence (LC) This strategy estimates the uncertainty of a trained model on a source sentence x by calculating the conditional probability  and Buchholz, 2000), SEMCOR2 , Europarl (Koehn, 2005), SICK (Marelli et al., 2014), Quora Pairs3 .
of the prediction ŷ conditioned on the source sentence (Lewis and Catlett, 1994).
A length normalization of L (length of the predicted translation ŷ) is added.
Coverage Sampling (CS) A translation model is said to cover the source sentence if it translates all of its tokens.Coverage is estimated by mapping a particular source token to its appropriate target token, without which, the model may suffer from under-translation or over-translation issues (Tu et al., 2016).Peris and Casacuberta (2018) proposed to use translation coverage as a measure of uncertainty given as: Here α i,j denotes the attention probability calculated by the model for the j th source word in predicting the i th target word.It can be noted that the coverage score will be 0 for samples for which the model fully covers the source sentences.
Attention Distraction Sampling (ADS) Peris and Casacuberta (2018) claimed that in translating an uncertain sample, the model's attention mechanism will be distracted (dispersed throughout the sentence).Such samples yield attention probability distribution with light tails (e.g.uniform distribution), which can be obtained by taking the Kurtosis of the attention weights for each target token y i .

Kurt(y
where 1 n(x) is the mean of the distribution of the attention weights (for a target word) over the source words.Kurtosis value will be lower for distributions with light tails, so the average of the negative kurtosis values for all words in the target sentence is used as the distraction score.

Details about Training
For sequence tagging, we use two kinds of architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for wordlevel encoding) and a BiLSTM-BiLSTM-CRF model (BiLSTM for both character-level and wordlevel encoding) Lample et al. (2016); Siddhant and Lipton (2018).For the translation task, we use LSTM based encoder-decoder architecture with Bahdanau attention (Bahdanau et al., 2014).These models were chosen for their performance and ease of implementation.
The Siamese network used for model-aware similarity computation (Section 3.2) consists of two bidirectional LSTM (BiLSTM) encoders.We pass   each sentence in the pair from the SICK dataset to model M and feed the resulting encodings to the Siamese BiLSTM encoder.The output is a concatenation of terminal hidden states of the forward and backward LSTMs, which is used to compute the similarity score using (1).As noted before, we keep model M fixed while training the Siamese encoders, and use the trained Siamese encoders for computing similarity between examples chosen by an AL strategy.We maintain the model-awareness by retraining the Siamese after every 10 iterations.
The architecture of the clustering model C (Section 3.3) is similar to that of the Siamese encoder.Additionally, it has a linear layer with a softmax activation function that maps the concatenation of terminal hidden states of the forward and backward LSTMs to K units, where K is the number of clusters.To assign an input example to a cluster, we first pass it through the encoder in M and feed the resulting encodings to the clustering model C. The example is assigned to the cluster with the highest softmax output.This network is also retrained after every 10 iterations to retain model-awareness.
The initial data splits used for training the model M were set at 2% of randomly sampled data for Sequence Tagging (20% for NMT).These are in accordance with the splitting techniques used in the existing literature on AL.The model is then used to provide input to train the Siamese/Clustering network using the SICK/Quora Pairs.At each iteration, we gradually add another 2% of data for sequence tagging (5% for NMT), by passing randomly picked samples through the A 2 L pipeline (which includes the low confidence examples extracted from the AL step).We average the results over five independent runs with randomly chosen initial splits.See Appendix C for details on hyperparameters.

Baselines
We claim that A 2 L mitigates the redundancies in the existing AL strategies by working in conjunction with them.We validate our claims by comparing our approaches with three baselines that highlight the importance of various components.
Cosine: Clustering is done based on cosine similarity between last output encodings (corresponding to sentence length) from encoder in M.Although this similarity computation is model-aware, it is simplistic and shows the benefit of using a more expressive similarity measure.
None: In this baseline, we use the AL strategy without applying Active 2 learning to remove redundant examples.This validates our claim about redundancy in examples chosen by AL strategies.
Random: No active learning is used and random examples are selected at each time.

Ablation Studies
We perform ablation studies to demonstrate the utility of model-awareness using these baselines:  Infersent: Clustering is done based on cosine similarity between sentence embeddings (Chen et al., 2015) obtained from a pre-trained InferSent model (Conneau et al., 2017).This similarity computation is not model-aware and shows the utility of model-aware similarity computation.Iso Siamese: To show that the Siamese network alone is not sufficient and model-awareness is needed, in this baseline, we train the Siamese network by directly using GloVe embeddings of the words as input rather than using output from the model M's encoder.This similarity, which is not model-aware, is then used for clustering.

Results
Figure 2 compares the performance of our methods with baselines.It shows the test-set metric on the y-axis against the percentage of training data used on the x-axis for all tasks.See Figures 6 and 7 in Appendix for additional results.
1.As shown in Figure 2, our approach consistently outperforms all baselines on chosen tasks.Note that one should observe how fast the performance increases with the addition of training data (and not just the final performance) as we are trying to evaluate the effect of adding new examples (Table 3).Our ablation studies in   2 also demonstrates that one can achieve performance comparable to a complex AL strategy like BALD, using simple AL strategies like margin and entropy, by using the proposed A 2 L framework.4. Additionally, from Figure 1, it can be observed that for one step of data selection: (i) The proposed MA Siamese model adds minimal overhead to the overall AL pipeline since it takes an additional time of fewer than 5 seconds (≈ 1 12 of the time taken for ALS); (ii) By approximating the clustering step, Integrated Clustering (Int) Model further reduces the overhead down to 2 seconds.However, owing to this approximation, MA Siamese is observed to perform slightly better than the Int Model (Fig 3).A comparison of training time for various stages of the A 2 L pipeline is provided in Figure 4 of Appendix.
In Appendix A, we provide a qualitative case study that demonstrates the problem of redundancy.It should be noted that the reported improvement numbers are not relative with respect to any baseline but represent an absolute improvement and are very significant in the context of similar performance improvements reported in the literature.
In this paper, we show that one can further reduce data requirements of Active Learning strategies by proposing a new method A 2 L, which uses a model-aware-similarity computation.We empirically demonstrated that our proposed approaches consistently perform well across many tasks and AL strategies.We compared the performance of our approach with strong baselines to ensure that the role of each component is properly understood.
Active 2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation: Appendix

A Understanding Redundancy and Model Aware Similarity
To convey the notion of redundancy and the idea of model aware similarity, in this section, we examine some example sentences that were deemed similar by the model aware Siamese in our proposed approach.To obtain these examples, we followed the training procedure outlined in Section 4.3 for the NER task on CoNLL 2003 dataset.After the model had seen roughly 10% of the data, we collected examples that were: (i) selected by the AL strategy (BALD) as examples on which the model has low confidence, and (ii) grouped by the clustering procedure in the same cluster based on model aware Siamese similarity scores.We present two sentences each from some randomly chosen clusters below: 1. Cluster 1: • Ground truth tags have been reported alongside the words, except for the words that belong to the "Other" class.For the sake of comparison, we also provide examples from two clusters that were obtained by using the cosine similarity metric on the InferSent embedding (Infersent baseline described in Section 4.4).As in the previous case, these examples have been selected by the AL strategy (BALD) for the same task and dataset as before.Note that similarity computation is not model aware in this case.
1. Cluster 1: • "His condition is serious," said Rimma As expected, when cosine similarity is used, sentences that have roughly similar content have been assigned to the same cluster.However, when model aware similarity is used, in addition to having similar content, the sentences also have a similar tagging structure.As the InferSent based similarity is agnostic of the downstream task, it cannot predict similarity between sentences based on the downstream task, unlike the model-aware Siamese approach.However, for the NER task, it is sensible to eliminate sentences having similar tagging structures, as they are redundant as far as the learning on the downstream task is concerned.
This example not only supports our claim that AL strategies choose redundant examples, but it also highlights the utility of using model aware similarity computation.

B Additional Remarks
In this section, we make a number of additional remarks about the proposed approach.

B.1 What is the significance of our work?
Obtaining labeled data is both time-consuming and costly.Active learning is employed to minimize the labeling effort.However, as we point out in Section 1, existing techniques may select redundant examples for manual annotation.Due to this redundancy, there is a scope for improvement in the performance of active learning strategies, and our proposed approach fills this gap.Since we demonstrate that our method is compatible with many active learning strategies and deep learning models that are currently in use, it can be applied in a wide range of contexts and is likely to be useful for many sub-communities within the domain of natural language processing without adding significant complexity to the existing systems.

B.2 How do we validate our claim regarding
the sub-optimality of standard AL strategies due to redundancy?
The comparison of our approach with None baseline suggests that performance comparable to the state-of-art can be achieved by using fewer labels if one incorporates the second step, which eliminates allegedly redundant examples even when every other aspect of training is exactly the same (same model, AL strategy and dataset).Thus, we can say that the discarded examples were of no additional help for the model and hence were redundant.Avoiding annotation of such samples saves time and brings down both computational and annotation costs.This can especially be effective in, for instance, the medical domain where high expertise is required.C Hyper-parameters and other Implementation Details Similar hyper-parameter values work across all the tasks.Hence, the same values were used for all experiments and these values were determined using the validation set of CoNLL 2003 dataset for NER task.We use two different sequence tagging architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for word-level encoding) and a BiLSTM-BiLSTM-CRF model (Lample et al., 2016) (BiLSTM for both character-level and word-level encoding).The CNN-BiLSTM-CRF architecture is a light-weight variant of the model proposed in (Siddhant and Lipton, 2018), having one layer in CNN encoder with two filters of sizes 2 and 3, followed by a max pool, as opposed to three layers in the original setup.This modification was found to improve the results.We use glove embeddings (Pennington et al., 2014) for all datasets.We apply normal dropout in the character encoder instead of the use of recurrent (Gal and Ghahramani, 2016b) in the word encoder of the model presented in (Siddhant and Lipton, 2018) owing to an improvement in performance.For numerical stability, we use log probabilities and, thus, the value for margin-based AL strategy's threshold is outside the interval [0, 1].We use the spectral clustering (Ng et al., 2002) (i) The partially trained model for a given task is used to (possibly incorrectly) annotate the unlabeled examples.(ii) An active learning strategy selects a subset of the newly labeled examples via a criterion that quantifies the perceived utility of examples in training the model.(iii) The experts verify/improve the annotations for the selected examples.(iv) These examples are added to the training set, and the process repeats.AL strategies differ in the criterion used in step (ii).

Figure 1 :
Figure 1: Comparison of time taken for one data selection step in NMT task by the Model Aware (MA) Siamese and Integrated Clustering (Int) Model across different ALS.It can be observed that A 2 L adds a negligible overhead (≈ 1 12 of the time taken for ALS) to the overall process.

Figure 2 :
Figure 2: [Best viewed in color] Comparison of our approach (A 2 L) with baseline approaches on different tasks using different active learning strategies. 1 st row: POS, 2 nd row: NER, 3 rd row: SEMTR, 4 th row: NMT.In the first three rows, from left to right, the three columns represent BALD, Entropy and Margin AL strategies.4 th row represents AL strategies for NMT, from left to right (LC: Least Confidence, CS: Coverage Sampling, ADS: Attention Distraction Sampling) : Legend Description {100% data : full data performance, A 2 L (MA Siamese) : Model Aware Siamese, A 2 L (Int Model) : Integrated Clustering Model, Cosine : Cosine similarity, None : Active learning strategy without clustering step, Random : Random split (no active learning applied)}.See Section 4.4 for more details on baseline.All the results were obtained by averaging over 5 random splits.These plots have been magnified to highlight the regions of interest.For original plots, refer Fig 7 in Appendix.

Figure 3 :
Figure 3: [Best viewed in color] Ablations studies on POS task using different active learning strategies.From left to right, the three columns represent BALD, Entropy and Margin based AL strategies.Legend Description {100% data : full data performance, A 2 L (MA Siamese) : Model Aware Siamese, A 2 L (Int Model) : Integrated Clustering Model, Iso Siamese : Model isolated Siamese, InferSent : Cosine similarity based on InferSent encodings}.See Figure 6 in Appendix for experiments on other tasks.All the results were obtained by averaging over 5 splits.
Figure 3 show the utility of using model-aware similarity.An interpretation of the plot on the top left corner of Figure 2 (CoNLL 2003 (POS), BALD) is given in Russian (B-MISC) double Olympic (B-MISC) swimming champion Alexander (B-PER) Popov (I-PER) was in a serious condition on Monday after being stabbed on a Moscow (B-LOC) street.• Vitaly (B-PER) Smirnov (I-PER), president of the Russian (B-MISC) National (I-MISC) Olympic (I-MISC) Committee (I-MISC), said President Boris (B-PER) Yeltsin (I-PER) had given the swimmer Russia's (B-LOC) top award for his Olympic (B-MISC) performance.2. Cluster 2: • The newspaper said the Central (B-ORG) Bank (I-ORG) special administration of Banespa (B-ORG) ends in December 30 and after that the bank has to be liquidated or turned into a federal bank since there are no conditions to return Banespa (B-ORG) to Sao (B-LOC) Paulo (I-LOC) state government.• The newspaper said Bamerindus (B-ORG) has sent to the Central (B-ORG) Bank (I-ORG) a proposal for restructuring combined with a request for a 90-day credit line, paying four percent a year plus the Basic Interest Rate of the Central (B-ORG) Bank (I-ORG) ( TBC (B-ORG) ).
(B-PER) Maslova (I-PER), deputy chief doctor of Hospital (B-LOC) No (I-LOC) 31 (I-LOC) in the Russian (B-MISC) capital.• Popov (B-PER) told NTV (B-ORG) television on Sunday he was in no danger and promised he would be back in the pool shortly.2. Cluster 2: • MOTORCYCLING -JAPANESE (B-MISC) WIN BOTH ROUND NINE SU-PERBIKE RACES.• Honda's (B-ORG) Takeda (B-PER) was pursued past Corser (B-PER) by the Yamaha (B-ORG) duo of Noriyuki (B-PER) Haga (I-PER) and Wataru (B-PER) Yoshikawa (I-PER) with Haga (B-PER) briefly taking the lead in the final chicane on the last lap.

Figure 4 :
Figure 4: Comparison of training time taken for one epoch in NMT full training by the various models at different stages of the pipeline, namely (Base) LSTM with attention encoder-decoder translation model, Model Aware (MA) Siamese and Integrated Clustering (Int) Model.It can be observed that A 2 L adds a negligible overhead to the overall training time as well (≈ 0.45% of the time taken by the base model).

Figure 5 :
Figure 5: Modeling similarity using the Siamese encoder (enclosed by dotted lines).A pair of sentences from SICK dataset is fed to the pretrained sequence tagging model.The output of the word encoder is then passed to the Siamese encoder.Last hidden state of the Siamese encoder, corresponding to the sequence length of the sentence, is used for assigning a similarity score to the pair.

Figure 6 :
Figure 6: [Best viewed in color] Ablations studies on different tasks using different active learning strategies. 1 st row: NER, 2 nd row: SEMTR, 3 rd row: CHUNK, 4 th row: NMT.In first three rows, from left to right, the three columns represent BALD, Entropy and Margin AL strategies.4 th row represents AL strategies for NMT, from left to right (LC: Least Confidence, CS: Coverage Sampling, ADS: Attention Distraction Sampling).Legend Description {100% data : full data performance, A 2 L (MA Siamese) : Model Aware Siamese, A 2 L (Int Model) : Integrated Clustering Model, Iso Siamese : Model isolated Siamese, InferSent : Cosine similarity based on InferSent encodings}.See Section 4.5 for more details.All results were obtained by averaging over 5 random splits.

Figure 7 :
Figure 7: [Best viewed in color] Comparison of our approach (A 2 L) with baseline approaches on different tasks using different active learning strategies. 1 st row: POS, 2 nd row: NER, 3 rd row: SEMTR, 4 th row: CHUNK.In each row, from left to right, the three columns represent BALD, Entropy and Margin based AL strategies.Legend Description {100% data : full data performance, A 2 L (MA Siamese) : Model Aware Siamese, A 2 L (Int Model) : Integrated Clustering Model, Cosine : Cosine similarity, None : Active learning strategy without clustering step, Random : Random split (no active learning applied)}.See Section 4.4 for more details.All the results were obtained by averaging over 5 random splits.

Table 2 :
Fraction of data used for reaching full dataset performance and the corresponding absolute percentage reduction in the data required over the None baseline that uses active learning strategy without the A 2 L step for the best AL strategy (BALD in all cases).Refer Fig 7 in Appendix for CHUNK plots.
Table 3 of Appendix.2. In sequence tagging, we match the performance obtained by training on the full dataset using only a smaller fraction of the data (≈ 3 − 25% less data as compared to state-of-art AL strategies) (Table 2).On a large dataset in NMT task (Europarl), A 2 L takes ≈ 4300 sentences lesser than the Least Confidence AL strategy to reach a Bleu score of 12. 3.While comparing different AL strategies is not our motive, Figure algorithm to cluster the sentences chosen by the AL strategy.We chose two representative examples from each cluster.

Table 3 :
Interpretation of the plot on the top left corner of Fig 7 (CoNLL 2003 (POS), BALD) in Appendix.The values in the cells are F-scores on the test set after training on the corresponding percentage of the data.It can be seen that with the increase in % labeled data, A 2 L (MA Siamese) consistently performs better than other baselines.