On the current state of reproducibility and reporting of uncertainty for Aspect-based Sentiment Analysis

For the latter part of the past decade, Aspect-001 Based Sentiment Analysis has been a ﬁeld of 002 great interest within Natural Language Pro-003 cessing. Supported by the Semantic Evalua-004 tion Conferences in 2014 – 2016, a variety of 005 methods has been developed competing in im-006 proving performances on benchmark data sets. 007 Exploiting the transformer architecture behind 008 BERT, results improved rapidly and efforts in 009 this direction still continue today. Our con-010 tribution to this body of research is a holis-011 tic comparison of six different architectures 012 which achieved (near) state-of-the-art results 013 at some point in time. We utilize a broad 014 spectrum of ﬁve benchmark data sets and in-015 troduce a ﬁxed setting with respect to the pre-016 processing, the train/validation splits, the per-017 formance measures and the quantiﬁcation of 018 uncertainty. Overall, our ﬁndings are two-fold: 019 First, we ﬁnd that the results reported in the sci-020 entiﬁc articles are hardly reproducible, since 021 in our experiments the observed performance 022 (most of the time) fell short of the reported 023 one. Second, the results are burdened with no-024 table uncertainty (depending on the data splits) 025 which is why a reporting of uncertainty mea-026 sures is crucial. 027

On the other hand, data in written form is available in huge amounts and thus might be an important source for valuable information.For instance, the internet is full of comparison portals, forums, blogs and social media posts where people nisms: There are two label sets for joint models, one to indicate whether a word is part of an aspect term and the other one to state its polarity.For collapsed models, a unified labeling scheme indicates whether a word is part of a positive, negative or neutral aspect term or not.
We re-evaluate four different models for ATSC, covering a variety of different architectures (RNNs, Capsule networks, LCF-based, BERT-based), as well as two different ATE+ATSC models, one of which is a pipeline approach while the other one works in a collapsed fashion.All models are retrained five times using five different (identical) train/validation splits and tested on the respective test sets in order to (i) compare them on a common ground and (ii) quantify the epistemic uncertainty associated with the architectures and the data.

Related work
Related experiments were conducted by Mukherjee et al. ( 2021), yet with a different focus.On the one hand, the authors also try to reproduce results on the benchmark data sets from SemEval-14 about Restaurants and Laptops.However, they selected six other models than we did for which the implementations are provided in one repository 1 .For these, the authors observed a consistent drop of 1-2 % with respect to both accuracy and macro-averaged F1-Score F macro the original implementations (by using them, when available) whereas they exclusively rely on community designed implementations, which adds a aims to reverse the sentiment of the chosen aspect term (also called "target aspect").This is reached by flipping the opinion using antonyms or adding negation words like "not".Additionally, conjunctions may be changed in order to make sentences sound more fluent.Another strategy to augment the test set is REVNON ("reverse non-target") for which the sentiment of non-target aspects are (i) changed if they have the same sentiment as the target aspect or (ii) exaggerated if the non-target aspect is of a differing polarity.The third strategy called ADDDIFF ("add different sentiment") adds non-target aspects with an opposite sentiment which is intended to confuse the model.These nontarget aspects are selected from a set of aspects collected from the whole data set and appended to the end of the sentence.ARTS are only designed to be used as test sets after training an architecture on the respective SemEval-14 training sets.The test sets for both restaurants and laptops are publicly available. 6During the preparation of the ARTS data for CapsNet-BERT, we noticed that the start and end positions of some aspect terms were not correct.We changed them in order to make the code work properly and we also deleted duplicates (cf.Xue and Li ( 2018)).For these specific test sets, the Aspect Robustness Score (ARS) was introduced by Xing et al. (2020) in order to measure how well models can deal with variations of sentences.
Therefore, each sentence and all its variations are regarded as one unit for which the prediction is only considered to be correct if the predictions for all variations are correct.These units alongside with their corresponding predictions are then used to compute the regular accuracy on the unit-level.
More Data Sets Recently more data sets have been published in addition to the ones mentioned beforehand.Mukherjee et al. (2021) proposed two new data sets about Men's T-Shirts and Television.
The YASO data set (Orbach et al., 2020) has a different structure as it is a multi-domain collection.This is an interesting approach, yet also the reason for not considering it for our experiments: This data set is far better suited for cross-domain analyses, which is out of the scope of this work.

Models
MGATN A multi-grained attention network (MGATN) was proposed by Fan et al. (2018).Its multi-grained attention as able to take into account the interaction between aspects.We chose MGATN since it is reported to be the best performing RNNbased model on SemEval-14 data sets.
CapsNet-BERT Capsules Networks were initially proposed for the field of Computer Vision (Hinton et al., 2011;Sabour et al., 2017), with the so-called capsules being responsible for recognizing certain implicit entities in images.Each capsule performs internal calculations and returns a probability that the corresponding entity appears in the image.A variation of Capsule Networks for ATSC and its combination with BERT was introduced by Jiang et al. (2019).It was reported to outperform all other capsule networks with respect to their accuracy on the SemEval-14 Restaurants data.Additionally, it performed second-best on MAMS, which is why we selected it for this study.
Furthermore, we assumed their results on SemEval-14 Restaurants data to be for three-class classification, as all the other results they refer to are also three-class.Yet, it is not fully clear to us which makes this experiment even more interesting.approaches.Yet, this only holds for the variant that 317 is trained using additional domain adaptation.

318
BERT+TFM The approach described by Li et al.CapsNet-BERT Comparing all the selected models on the ATSC task, CapsNet-BERT performed best on all data sets regarding all the metrics except for ARS Accuracy on ARTS Restaurant data (cf.Tab. 4).For ARTS, it seems as if the reported ARS accuracy for Laptops matched our result for Restaurants, and vice versa, as Fig. 3 illustrates.
As far as we can tell, we did not mix up the data sets during our calculations which makes this look quite peculiar.The difference between the reported and reproduced values on SemEval-14 Restaurants data (as shown in Fig. 1b) may be explained by the fact that we did three-class classification and we only assumed so for the reported value.were better than the reported ones (cf.Fig. 6b), while Laptops we could not quite reach the performance (cf.Fig. 6a).For the latter case, our results of single runs were better than (or at least equal to) the reported one, which is kind of a symptom of the problem.If we only reported the best of all runs, our conclusion would have been that we were able to outperform the original model.However, as

B Complete results
The following tables show the quantitative results of our experiments.For SemEval-14, five trainvalidation splits were created out of the original training set.On each split pair, five runs were performed which lead to split-specific means and standard deviations.In the overall mean and deviation, all runs of all splits are included.Consequently, they are based on 25 values for SemEval-14 and ARTS data and five values for MAMS data (as there were no splits applied).
Natural Language Processing (NLP) has profited a lot from technical and algorithmic improvements within the last years.Before the successful times of Machine Learning and Deep Learning, NLP was mainly based on what linguists knew about how languages work, i.e. grammar and syntax.Thus, primarily rule-based approaches were employed in the past.Nowadays, far more generalized models based on neural networks are able to learn the desired language features.
. (2021) reported a doubling of this drop when using 15% of the training data as validation data.On the other hand, they executed additional tasks which included the set-up of two new data sets about Men's T-shirts and Television as well as the model evaluation on them.Furthermore, they also experimented with cross-domain training and testing.Yet, several important points are not addressed by their work which is why we investigate them in our work.First, while they mostly care about comparing different types of architectures (Memory Networks vs. BERT), we instead focus on comparing the best performing models for different tasks (ATSC vs. ATE+ATSC).Further, we cover a larger variety of types of architectures by selecting the best performing representatives of several different types.Second, we stick closer to

.and
point of its introduction.There were 324 also models using other layers on top instead of the 325 Transformer layer, but our variant of choice was 326 TFM as it produced slightly better results than the 327 rest.328 GRACE GRACE, a Gradient Harmonized and 329 Cascaded Labeling model introduced by Luo et al. 330 (2020), belongs to the category of pipeline ap-331 proaches.It includes a post-training step of the 332 pre-trained BERT (Devlin et al., 2019) model using 333Yelp 7 and Amazon data(He and McAuley, 2016).334Thepost-trained model then shares its first l layers 335 between the ATE and the ATSC task.The remain-336 ing layers are only used for the former.They are 337 followed by a classification layer for the detected 338 aspect terms.These classification outputs are then 339 used again as inputs for a Transformer decoder 340 which performs sentiment classification.The prin-341 ciple of using the first set of labels as input for the 342 second is called Cascaded Labeling here and is as-343sumed to deal with interactions between different aspect terms.Gradient Harmonization is applied in order to cope with imbalanced labels during training.GRACE appears to be the best of the pipeline models according to the literature.Furthermore, it is reported to be the best ATE+ATSC model on both SemEval-14 data sets.However, these successes have to be taken into account with care, as their results are based on four-class classification.This means that in comparison to the other authors' settings they did not exclude conflicting reviews of SemEval-14 data.Thus, our analyis contributes to comparability even more since it has not been established yet for our model-data combinations.4Experiments 8We re-evaluate six models (cf.Sec.3.2) on the five data sets presented in Sec.3.1.Our overall goals are to establish comparability between the models, to examine whether reported performance can be reproduced and to quantify epistemic model uncertainty that might exist due to the lacking knowledge about the train/validation splits.First, we re-use the implementations provided by the authors and try to reproduce their results on the data sets they used.Second, we adapt their code to the remaining data sets and conduct the necessary modifications, again sticking as closely as possible to the original hyperparameter settings (cf.Appendix A).The biggest change we made was increasing the number of training epochs drastically and adding an early stopping mechanism.For all ATSC models, we selected the optimal model during the training process based on the validation accuracy and/or F macro 1 For performing the experiments, we had a Tesla V100 PCIe 16GB GPU at our disposal.Data Preparation Unlike other data sets, both SemEval-14 data sets come without an official validation split.Thus, we created five different train/validation splits (90/10) for each of the two SemEval-14 training sets.For each split, five training runs with different random initializations were conducted per model.The resulting 25 different versions per model per data set were subsequently evaluated on the two official SemEval-14 test sets as well as on the ARTS test sets.In Sec. 5 we report overall means per model per test set as well as means and standard deviations per model and test set for each of the different splits.Since there is an lection was based on ATSC-F micro 1 in order to match the calcu-437 lation of those from BERT+TFM.

Figure 1 :
Figure 1: Comparison of reported and reproduced performance.The reproduced value is the mean of all 25 runs per model in total.Further, 95% bootstrap (n = 2000) confidence intervals are displayed.Note that absolute performance of GRACE (four classes) and BERT+TFM cannot be compared to the other models due to different tasks.No F micro 1 was reported for CapsNet-BERT on SemEval-14 Laptops.

449
It is also interesting to see how different runs 45017 We do not give a similar figure for MAMS or ARTS as there are not enough reported values to form a good graph.can lead to rather broad ranges of results, although having done only five training runs per model and data split.An example for this phenomenon is the Accuracy of MGATN on SemEval-14 Laptops (cf.

Fig. 2 )
Fig. 2).For the first, the fourth and fifth split, all of the values lie very close together (within mean ± std), whereas the results of the other two splits show a rather high variance.MGATN For MGATN, our reproduces results fell short of the reported values for accuracy, around five to ten percentage points for SemEval-2014 Laptops and Restaurants, respectively (cf.Tab. 4).Fig. 2 depicts the results on Laptops, the difference between reported and reproduced performance on the Restaurant data (not shown) looks similar.A reason for this behavior might be that we could not use the official implementation of the authors.In terms of ARS Accuracy on ARTS Restaurants, MGATN was the only model that reached only a single-digit value which means that it is not good at dealing with perturbed sentences.

Figure 2 :
Figure 2: Example for high differences between data splits: Accuracy of MGATN on SemEval-14 Laptops.

Table 1 :
Descriptive Statistics for the five utilized data sets."Multi-Sentiment sentences" are those with at least two different polarities after removing "conflict" polarity."Aspect Terms in total" also exclude "conflict".