Multi-Task Attentive Residual Networks for Argument Mining

We explore the use of residual networks and neural attention for multiple argument mining tasks. We propose a residual architecture that exploits attention, multi-task learning, and makes use of ensemble, without any assumption on document or argument structure. We present an extensive experimental evaluation on five different corpora of user-generated comments, scientific publications, and persuasive essays. Our results show that our approach is a strong competitor against state-of-the-art architectures with a higher computational footprint or corpus-specific design, representing an interesting compromise between generality, performance accuracy and reduced model size.


I. INTRODUCTION
A RGUMENT mining (AM) is an area of natural language processing (NLP) defined by a variety of tasks, aiming to extract and structure arguments from unstructured text [1]. Some are argument detection, stance classification and topic-based argumentative content retrieval [2]. The problem we address in this work is to assemble the structure of the argumentation behind a given input document. This problem can be broken down into multiple tasks, such as the detection of argument components, as well as the classification of links between them. The latter is known to be a challenging task, whose outcome may be a complicated graph.
There are many possible definitions of argument. According to Walton [3], an argument is made of three components: (i) a claim, or assertion, about a given topic; (ii) a set of premises supporting the claim; and (iii) the inference between the premises and the claim. Relations between arguments, or Manuscript  argument components, typically consist of either support or attack links. Argument components and relations may be implicit, which contributes to the difficulty of the task at hand. Moreover, not all argument definitions fit all genres. In fact, AM approaches are very often tailored to specific corpora or genres [4], [5], with solutions that are seldom general enough to be directly applicable to different data sets. Indeed, many AM systems build upon sets of handcrafted features which encode information about the underlying argument model, genre or topic of interest, and make assumptions on the argumentative structure of the input document, thus constraining the resulting argument graph.
On the other end of the spectrum, we find increasingly many solutions that do not rely on feature engineering, but on huge neural architectures with millions of trainable parameters. These models are usually very accurate, but also very expensive, especially in terms of the carbon footprint resulting from the huge energy cost of training and fine-tuning [6]. So much so that a significant part of the NLP community is now promoting the vision of a Green AI, whereby more effort must be spent on simpler and efficient solutions, suited for low-resources settings [7], [8], [9], [10].
In the last years, the availability and diversity of AM corpora has considerably increased [2], [11], [12]. However, most AM models are tested only on a few popular benchmarks, typically neglecting less known datasets, and only reporting on positive results. This phenomenon, which is not limited to AM research but has been observed in other communities as well, has been often criticized because it may hinder the development of new ideas [13] and promote the development of models that generalize poorly to the real world [14].
This work presents a general-purpose, domain-agnostic neural architecture that does not rely on genre-specific or topicdependent features, and its evaluation on five different datasets. The architecture is smaller than state-of-the-art models by various orders of magnitude. It exploits neural attention and multitask learning, jointly addressing the problems of identifying the category of argument components, and predicting their relations. Experimental results conducted on a variety of different corpora show that the model is robust, can be applied to many different domains, and achieves good performance across the considered data sets. Our main contributions are: r A novel approach to AM, which extends our previous work [15] by introducing an attention module and using ensemble learning. The model jointly performs multiple AM tasks, and does not rely on ad-hoc features or rich This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ contextual information, but only on GloVe embeddings and on a widely applicable notion of distance. r A model with a much smaller computational footprint than state-of-the-art neural approaches.
r An analytical evaluation of the contribution of each added module through an ablation study and a validation of our model on a challenging corpus.
r A set of experiments to assess generality, whereby we test our approach on four different corpora with various domains, writing style, formatting, length, and annotation model. To the best of our knowledge, we are the first to validate a new AM method on as many corpora. r A negative result on a fifth corpus, which highlights the limitations of our approach, as suggested in [16]. With respect to our previous work [15], this article extends the neural architecture with attention and ensemble learning, and presents a more extensive experimental evaluation. All the code used in our experiments is publicly available. 1 The article is organized as follows. Section II presents background and related work. Section III introduces our architectures. Section IV describes the data used for evaluation. Sections V and VI illustrate the experimental setting and discuss results. Section VII concludes.

II. BACKGROUND
The adoption of deep learning approaches in AM is relatively recent, compared to other areas of NLP. That is probably a consequence of a lack of large AM corpora, considering the complexity and peculiarities of the tasks at hand. Indeed, the annotation of large corpora for AM system evaluation and training proved to be challenging, as demonstrated by relatively low Inter-Annotator Agreement (IAA) indicators and several unsatisfactory attempts at crowdsourcing annotations. That is especially true for some genres like user-generated content [17]. Reasons for that are the nature of the task, which is intellectually demanding, and the lack of a unified argument model, as "arguments" may take very different shapes in different genres, also leading to a trade-off between the expressiveness of the argument model and the complexity of the annotation process and availability of relevant data points, often resolved in favor or simple argument models [1]. Earlier research mainly focused on the definition of features for specific genres or even for specific corpora. The differences between corpora, both regarding the domain and the theoretical framework followed during the annotation process, forced researchers to test a model on the same corpora on which it was trained, and to the best of our knowledge, transfer learning approaches have not seen wide experimentation. These two elements lead to the common practice to define a method or a model and validate it only on a single corpus or on a few corpora [1].

A. Multi-Task Learning and Joint Learning for am
Since AM includes many subtasks that are strongly interrelated, a recent trend of this research field is to address many 1 [Online]. Available: https://github.com/AGalassi/StructurePrediction18. of them at the same time using multi-task or joint learning techniques. The aim of such approaches is to transfer knowledge from the auxiliary tasks to the main one, or to obtain coherent results on multiple tasks performed at once. Stab and Gurevych [4] jointly address component classification and link prediction on persuasive essays, using Integer Linear Programming and a rich set of specific features, such as lexical, structural, and contextual information. Various neural architectures are tested in [18], including the deep biLSTM multi-task learning (MTL) setting of [19], using sub-tasks as auxiliary tasks. They conclude that neural networks can outperform feature-based techniques in argument mining tasks. Schulz et al. [20] investigate MTL settings addressing component detection on five datasets as five different tasks. Their architecture is composed of a CRF layer on top of a biLSTM, whose recurrent layers are shared across the tasks. They obtain positive results, and the MTL setting shows to be beneficial especially for small datasets, even if the auxiliary AM tasks involve different domains and even different component classes. Lauscher et al. [21] analyze an MTL setting where rhetorical classification tasks are performed along with component detection. They use a hierarchical attention-based model to perform both word-level and sentence-level tasks with the same neural architecture. The results show improvements in the rhetorical tasks, but not in AM. Accuosto and Saggion [22] experiment with MTL and sequential transfer learning, improving performance on AM through discourse parsing tasks.
In [23], a structured learning framework based on factor graphs is used to jointly classify all the propositions in a document and determine which ones are linked together. The models heavily rely on a priori knowledge, encoded as factors and constraints, designed to enforce adherence to the desired argumentation structure, according to the argument model and domain characteristics. The authors discuss experiments with six different models, which differ by complexity and by how they model the factors, using RNNs and SVMs. Their best result is obtained by using the same set of features used in [4], resulting in a total feature size of around 7,000 for propositions and 2,100 for links. Finally, another approach based on factor graph is DRAIL [24], a neuro-symbolic framework that allows to specify the structure and the constraints of the graphs through first-order logic clauses.

B. Neural Attention for AM
Neural attention is a mechanism widely used in NLP to improve performance and interpretability of neural networks, and it is the core of many NLP architectures like RNNsearch [25], Pointer Networks [26], and Transformer [27]. Given an input sequence, and possibly a query element, attention consists in the computation of a set of weights that represent the importance of each element of the sequence, which can be further used to create a compact representation of such an input. There are many different ways to compute such weights. A taxonomy of attention models is proposed in our survey [28].
Among the AM systems that use neural attention, the one used in [29] integrate hierarchical attention and biGRU for the analysis of the quality of the argument, the one in [30] use attention to integrate sentiment lexicon, while in other works [31], [32], [33] attention modules are stacked on top of recurrent layers. The use of Pointer Networks for AM has also been investigated [34]. Biaffine attention has been used by Morio et al. [35] along with task-specific parametrization (TSP-PLBA) and a mixture of symbolic and sub-symbolic input features. Chen et al. [36] address the task of inferring the agreement between sentences using fine-grained co-attention between the two sentences.
Transformer-based approaches in AM use language representation models such as BERT [37] and ELMO [38] to create contextualized word embeddings. Specifically, Reimers et al. [39] address component classification and argument clustering, a related task whose aim is to identify similar arguments. Similarly, Lugini and Litman [40] use BERT embeddings alongside other contextual information to perform component classification, and Wang et al. [41] use them to train a different model for each type of component. Trautmann et al. [42] use pre-trained BERT models to perform word-level classification of the stance of components regarding a given topic, while Poudyal et al. [43] use RoBERTa [44], an improved version of the original BERT. BERT is also used by Opitz [45], who formulates relation classification as a plausibility ranking task by exploiting hypothetical discourse contexts. Bao et al. [46] propose a neural transition-based model which incrementally builds an argumentation graph, using a combination of fine-tuned BERT embeddings and other symbolic features as input. More recently, Srivastava et al. [47] use BERT to classify arguments, then rely on the trained weights and self-attention to predict links.
Mayer et al. [48] present and conduct extensive experimentation on the AbstRCT corpus, addressing four AM subtasks with a pipeline scheme. They analyze the impact of various BERT models, which are pre-trained on other corpora and then fine-tuned on the corpus at hand. Segmentation and component classification are performed as sequence tagging with BIO scheme. Link prediction and relation classification follow, taking into account all the pairs of components obtained in the first step and classifying their relations as attack, support, or non-existing. Their architecture is based on bi-directional transformers followed by a softmax layer and various encoders. Their approach is completely distance-independent, but since they compare every possible pair of components, the size of the dataset grows quadratically with the number of components in the document, which makes it hardly scalable to large documents. Another approach, consisting of predicting at most one related component for each component, and then classifying their relation, has been tested but yields worse results. The architectures that yield the best results are BioBERT [49], which is pre-trained on a large-scale biomedical corpus, SciBERT [50], which is pre-trained on scientific articles of various nature, and RoBERTa.
Looking outside the context of AM, BERT is a very popular model across all the NLP tasks [51], but it is also very resourcedemanding, consisting of more than 110 millions parameters in its base implementation. For this reason, there is an active effort in assessing when its use is really necessary [8].

C. Residual Networks
Residual networks [52] are a family of deep neural networks that achieved outstanding results in many machine learning tasks across many different domains related to NLP [27], [53], [54]. The core idea behind residual networks is to create shortcuts that link neurons belonging to distant layers (see Fig. 1), whereas standard feed-forward networks typically link neurons belonging to subsequent layers only. This kind of architecture usually results in a more efficient training phase, allowing to train networks with considerably more layers, reducing the overall computational footprint. A similar principle is followed also in the design of dense and highway networks [55], [56]. The intuition behind residual networks is that if a function H(x) can be approximated by multiple non-linear layers, then they can also approximate its residual function F (x) = H(x) − x. It is therefore possible to obtain the original function simply adding the residual value:

III. MODEL
The architecture we propose makes use of the dense residual network model, along with a Long Short-Term Memory (LSTM) network [57], and an attention module [28]. The network is trained to jointly perform three argument mining sub-tasks: argument component classification, link prediction, and relation classification.
More specifically, our approach operates on sentence pairs, does not rely on document-level global optimization, and does not enforce model constraints induced, for example, by domainor genre-specific background knowledge. This makes our approach amenable to a possible integration within more complex and sophisticated systems.
We performed model selection and hyper-parameter tuning on a single corpus (CDCP, see Section IV) and we collected results on validation data in order to tune the whole architecture. There are two reasons for this choice: on the one hand, we aim to show the robustness of the approach across different corpora, while on the other hand we believe it is important to limit the footprint of these experiments -an issue that is receiving a growing attention in the community [6].

A. Model Description
In order to achieve a general method which may be applicable in any domain, our approach does not rely on a specific argument model, but rather it reasons in terms of abstract entities, such as argumentative components and links among them. We instantiate such abstract entities into concrete categories given by annotations, such as claims and premises, supports and attacks, as soon as we apply the method to a specific corpus whose annotations follow a concrete argument model.
The detection of argumentative content in text is one typical stage of AM systems [1]. Other works only focus on AM tasks that assume that argumentative components and their boundaries are already identified in the data. Such is the case with Niculae et al. [23], whose CDCP dataset only consists of argumentative elements, and with others [31], [40], [58] who simply ignore the non-argumentative elements of the input text. Accordingly, we define a document D as a sequence of argumentative components and disregard the rest of the input text. An argumentative component in turn is a sequence of tokens, i.e., words and punctuation marks, representing an argument, or part thereof. The labeling of components is induced by the chosen argument model. Such a labeling associates each component with the corresponding category C of the argument component it contains. For this reason, we will use the terms component, sentence, and proposition as equivalent, and implying them as being argumentative by assumption.
Given two argumentative components a and b belonging to the same document, we represent a directed relation from the former (source) to the latter (target) as a → b. Reflexive relations (a → a) are not allowed. 2 Any pair of components is characterized by four labels: the types of the two components (C a and C b ), the Boolean link label L a→b , and relation (type) label (R a→b ). The link label indicates the presence of a link, and is therefore true if there exists a directed link from a to b, and false otherwise. The relation label instead contains information on the nature of the link connecting a and b. It represents the relationship between the two components, according to the links that connect a to b or b to a. Its domain is composed, according to the underlying argument model, not only by all the possible link types, but also by their opposite types (e.g., attack and attackedBy), as well as by a special category, None, meaning no link in either direction. One reason to introduce opposite relation types is to mitigate the unbalance caused by limited amount of instances each relation type typically has, if compared with the number of instances belonging to the None class. Likewise, we speculate that the introduction of additional labels may contribute positively to the optimization process. We shall remark that opposite relation labels are exploited during training, but they are discarded in the test phase, where they are simply substituted with the None label, consistently with previous work. We use a multi-objective learning setting where multiple tasks are performed jointly for each possible input pair of components (a, b) belonging to the same document D. Our main focus is the identification of the link label L a→b for each possible input 2 We will partially consider reflexive relations for the UKP dataset for a specific reason explained in Section V.
pair of propositions (a, b) belonging to the same document D. Our first objective is thus a link prediction task, which can be considered as a sub-task of argument structure prediction. A second objective is the classification of the two components, 3 and our final objective is the classification of the relationship between such components, i.e., the prediction of labels C a , C b , R a→b . A common issue in the classification of pairs of document components is the fact that pairs grow quadratically with the number of components, causing a large imbalance against the negative class [43], [48]. One way of dealing with that issue is to limit the possible pairs by setting a maximum distance, thus obtaining a number of pairs proportional to the number of components. Such a distance is a hyper-parameter, and as such it may be empirically determined [43].

B. Embeddings and Features
Faithful to the main purpose of this work, of evaluating the effectiveness of deep residual networks and attention for AM without resorting to domain-or genre-specific information, our system relies on a minimal set of widely applicable features.
Words are encoded using pre-trained GloVe embeddings [59] of size 300. Since punctuation may play a key role in the semantic of the sentences [60], we have decided to keep punctuation tokens as well. Input sequences are zero-padded to the length of the longest sequence in the datasets (henceforth T ). Out-ofvocabulary terms are handled by creating random embeddings.
In our previous work, we empirically assessed how the distance between two components may be a relevant feature for AM in the CDCP corpus [15]. The same observation has been recently made also with reference to other corpora [18], [43], [48]. Similarly to what has been done in [18], we define the number of argumentative components separating source and target as argumentative distance, using the positive sign when the source precedes the target, and the negative sign otherwise. Inspired by works in other domains [61], [62], [63], we encode such a scalar number in a 10-bit array, using the first 5 bits for those cases where the source precedes the target, and the other 5 bits for the opposite case. The number of consecutive "1" values encodes the value of the distance, with a maximum value of 5. For example, if the argumentative distance is −3, the encoding is 00111 00000; if the argumentative distance is 2, the encoding is 00000 11000.

C. The RESARG Architecture
We use our own previous system [15] as a baseline. We refer to it as RESARG. Its architecture, depicted in Fig. 2(a), is based on residual networks [52] and comprises the following macro blocks: r two deep embedders, one for sources and one for targets, that manipulate token embeddings; r a dense encoding layer that reduces the dimensionality of the features; r a biLSTM that processes the sequences;  The purpose of the deep embedders is to fine-tune the pretrained embeddings, a common procedure in deep learningbased NLP solutions [64] whose usefulness was confirmed by preliminary experiments. Each embedder is composed of a single residual block consisting of four pre-activated timedistributed dense layers. Accordingly, each layer applies the same transformation to each embedding, regardless of their position inside the sentence. All the layers have 50 neurons, except for the last one, which has 300 neurons.
The dense encoding layer is necessary to reduce the parameters in the following biLSTM, thus reducing the time needed for training, and limiting overfitting. It applies a time-distributed dense layer, which reduces the embedding size to 50, and a time average-pooling layer [65], which reduces the sequence size by a factor of 10. The resulting sequences are then given as input to the same biLSTM, producing a single representation of size 50 for each component.
Source and target are processed in parallel in the first three blocks, then concatenated together, along with the encoding of the distance, and given as input to the final residual network. The first level of the final residual network is a dense encoding layer with 20 neurons, while the residual block is composed of a layer with 5 neurons and one with 20 neurons. The outputs of the first and the last layers of the residual networks are summed up and provided as input to the classifiers.
The final stage of RESARG are three independent softmax classifiers used to predict the source, the target, and the relation labels. Each classifier, which predicts a label for a dedicated task, contributes simultaneously to our learning model. The link classifier is obtained by summing the relevant scores produced by the relation classifier, aggregating the probability assigned to the relation labels into a single link label.
All the dense layers use the rectifier activation function [66], and they randomly initialize weights with He initialization [67]. The application of all non-linear functions is preceded by batchnormalization layers [68] and by dropout layers [69], with probability p = 0.1. The resulting architecture has about 130,000 trainable parameters.

D. The RESATTARG Architecture
Motivated by the remarkable results obtained by attentionbased architectures in NLP tasks, we have extended RESARG by including a neural attention block after the bi-LSTM module. To better exploit the new attention module, we removed the time pooling layer from the dense encoding block, so as to avoid loss of information along the temporal axis, and to maintain the whole output sequence from the LSTM. Therefore, in this new model, the input and the output of the LSTM module have size (T , 50). The resulting architecture, named RESATTARG, is depicted in Fig. 2 The attention module is implemented as coarse-grained parallel co-attention [28], to consider both components at the same time while computing attention on each of them, and its structure is illustrated in Fig. 2(c). Our method consists of exploiting the average embedding of one proposition as a query element while computing attention on the other, similarly to what has been done in [70]. Specifically, calling K s and K t the outputs of the bi-LSTM obtained from, respectively, the processing of the source and the target propositions, we compute the (masked) average of K t , obtaining a single embedding g t of size 50 (1). This embedding is used as query element to compute additive soft attention [28] on K s (2), obtaining a set of attention weights a si that represent the relevance of an element (3), and then a single source context vector c s of size 50 (4). The details of this process are described in the following Equations, where the matrices W 1 , W 2 and the vectors b, w 3 are learnable parameters.
An equivalent symmetric procedure is used to compute attention on K t so as to obtain c t . The output of this block are two embeddings of size 50, as in our previous architecture.
Our method resembles the approach of Chen et al. [36], but with two important differences. First of all, they use fine-grained co-attention [28] instead of coarse-grained, so they consider each element of a sentence with respect to each element of the other sentence, leading to a higher computational footprint. The second difference is that they use multiplicative attention instead of the additive one: while the former is more indicated for tasks where it is important to consider the similarities between two inputs (as in the agreement inference task) the latter is more suitable for tasks where representations of relevant elements are unavailable [28], as in component classification and link prediction.
The resulting architecture has about 140,000 trainable parameters. If compared with other state-of-the-art neural architectures, such as BERT BASE and its 110 M parameters, RESATTARG is considerably smaller, and accordingly it is less computationally demanding.

E. Optimization Model and Ensemble Learning
We consider a multi-task formulation for our learning problem. The loss function is given by the weighted sum of four different components: the categorical cross-entropy on three labels (source and target categories, link relation category) and an L 2 regularization on the network parameters.
Since the training of neural models is non-deterministic, the results of a single training procedure are influenced by the random seed that is used, thus they may not be reliable or reproducible [71], [72]. Such problem also affects our previous results [15], since they were obtained from a single training experiment.
We have decided to replicate that experiment by repeating the training procedure 10 times, with different seeds, obtaining 10 trained neural networks for each configuration. We will evaluate our models in two different ways. At first, we will consider the average of the scores obtained by every single network for each metric. Then, we evaluate the predictions obtained using all the 10 models in ensemble voting.
In our ensemble setting the class of each entity is assigned as the class voted by the majority of the networks. This technique is similar to the concept of bootstrap aggregating, also known as bagging [73]. However, while in standard bagging each model is trained on a random sample of the training set, here we train all the models on the same training set, since stochastic elements are already present in the training procedure itself. Indeed, the training process does involve non-deterministics steps, such as the initialization of the networks' weights, the selection of the elements for each batch, and the application of dropout. We have chosen this ensemble method for the sake of simplicity, but more advanced techniques do exist and may yield better results [74].

IV. CORPORA
We validate our approach on five corpora differing from each other in various dimensions: the domain of the documents, their average length, the formatting, and the argumentative model followed for the annotations.

A. CDCP
The Cornell eRulemaking Corpus (CDCP) [23] consists of user-generated documents in which specific regulations are discussed. The authors have collected user comments from an eRulemaking website on the topic of Consumer Debt Collection Practices rule. The corpus contains 731 user comments, for a total of about 4,700 components, all considered to be argumentative.
As typical of user-generated data, the comments are not structured, and often present grammatical errors, typos, and do not follow usual writing conventions (such as the blank space after the period mark). This complicates pre-processing, since most of the off-the-shelf tools turn out to be inaccurate even in simple tasks such as tokenization.
Annotations follow the argument model proposed in [75], where links are constrained to form directed graphs. The corpus is suitable both for component and relation classification, since it presents 5 classes of propositions and two types of links. We will use the version of CDCP without nested proposition and guaranteed transitive closure.
Components are addressed as propositions, and they consist of a sentence or a clause. Propositions are divided into POLICY (815), VALUE (2160), FACT (746), TESTIMONY (1026), and REFERENCE (32). Only 3% of more than 43,000 possible proposition pairs are linked; almost all links are labeled as REA-SON (1,292), whereas only a few are labeled as EVIDENCE (46).
The unstructured nature of documents, the strong unbalance between the classes, and the presence of noise make the corpus particularly challenging for all the subtasks of argument mining, especially those that involve the relationships between components.

B. AbstRCT
The AbstRCT Corpus [48] consists of abstracts of scientific papers regarding randomized control trials for the treatment of specific diseases (i.e., neoplasm, glaucoma, hypertension, hepatitis b, diabetes). The final corpus contains 659 abstracts, for a total of about 4,000 argumentative components. AbstRCT is divided into three parts: neoplasm, glaucoma, and mixed. The first one contains 500 abstracts about neoplasm, divided into train (350), test (100), and validation (50) splits. The remaining two are designed to be test sets. The glaucoma part contains 100 abstracts for that disease, the mixed one contains 20 abstracts for each disease. 4 Components are labeled as EVIDENCE (2,808) and CLAIM (1,390), while relations are labeled as SUPPORT (2,259) and ATTACK (342). 5 About 10% of about 25,000 possible component pairs have a labeled relationship. The argumentative model chosen for annotation enforces only one constraint: claims can have an outgoing link only to other claims.
With respect of CDCP, this corpus is less noisy and the distribution of the classes is more balanced. We have chosen this as a benchmark to demonstrate that our approach is independent of the domain and of the argument model.

C. DrInventor
The Dr. Inventor Argumentative Corpus (DrInventor) [76] is the result of an extension of the Dr. Inventor corpus [77], which includes an annotation layer containing argumentative components and relations. DrInventor consists of 40 scientific publications from computer graphics, which contain about 12,000 argumentative component labels, as well as annotations for other tasks.
The classes of argumentative components are DATA (4,093), OWN CLAIM (5,445), and BACKGROUND CLAIM (2,751). The former two are related to the concepts of premises and claims, while the latter is something in between, since it is a claim related to some background knowledge, such as that made by another author in a previous work. The relation classes are SUPPORTS (5,790), CONTRADICTS (696), and SEMANTI-CALLY SAME (44), since it is common practice in scientific publications to re-iterate the same claim (or more rarely the same data) multiple times.
Since DrInventor includes documents where the structure of the discourse is complex, and data are often presented along with claims, it makes argument mining more challenging: in more than 1,000 cases some components are split into multiple text sequences, located in non-contiguous parts of the documents. This phenomenon mostly concerns claims, but data are affected too, in fewer cases. This introduces the difficulty of recognizing different segments of the documents as part of a single component and makes link prediction more difficult to address through non-pipeline approaches.
The unbalanced distribution between the three classes and the presence of split components makes this corpus quite challenging for link prediction, a difficulty which is highlighted also by the low inter-annotator agreement reported in the original article.

D. SciDTB
The SciDTB Argumentative Corpus [22] consists of 60 scientific abstracts from the ACL anthology, for a total of 353 argumentative components. Components can span across multiple sentences and can belong to six classes: PROPOSAL (110), AS-SERTION (88), RESULT (64), OBSERVATION (11), MEANS (63), DESCRIPTION (7). The annotation scheme impose that each component can be linked only to another one, and presents only one class of argumentative relationship: SUPPORT. 6 Out of the 1884 possible pairs of components, only 126 (6.69%) are linked together by a SUPPORT. The challenging aspects in this corpus are its small size, which allows us to test our method on a low-resource setting, and its unbalance in the distribution of component classes.

E. UKP-PE
The Persuasive Essays Corpus (UKP-PE) [4] consists of 402 documents from an online community where users post essays and other material, provide feedback, and advise each other. The dataset is divided into a test split of 80 essays and a training split with the remaining documents.
UKP-PE defines three classes of argumentative components: MAJOR CLAIM (751), CLAIM (1,506), and PREMISE (3,832). Premises may be linked to CLAIMS through relations of SUPPORT (3,613) or ATTACK (219). MAJOR CLAIMS are not linked to other components. 7 The classes of argument components are similar to those in other datasets. However, what distinguishes the UKP-PE corpus from the others is a more regular argumentation model, which is specific to this corpus alone. All argument graphs are trees. All roots are claims. All tree components belong to the same paragraph. Each premise has exactly one outgoing relation. Claims do not have outgoing relationships, they can only be supported/attacked by premises. The structure of the argumentation is also fairly regular. For example, major claims are usually present in the introduction or conclusion of an essay, and they are often the only argumentative component in the paragraph.
Thanks to the highly regular nature of the UKP-PE data, strong baselines heavily rely on document structure, like the position of the sentence in the essay, or whether a component is in the introduction or conclusion, or in the first or the last sentence of a paragraph [78]. At the same time, including such highly regular data in our analysis enables us to gain further insights on the strengths and limitations of a structure-agnostic approach like ours.

V. EXPERIMENTAL SETTING
We initially evaluate our new architecture against our previous model and the structured learning approach of [23] on CDCP, presenting an ablation study of the new components we have introduced. Then, we extend the evaluation to other four data sets, for which we compare our approach against the state-of-theart. In our approach each component is involved in many pairs, both as a source and as a target, and accordingly it is classified For UKP-PE we consider paragraphs containing arguments as "documents" and we do not include "self pairs" in the count of component pairs. multiple times by the same network. The label will be assigned by the model by considering the average probability computed by the ensemble for each class, and by thus choosing the class with the highest score. Alternative approaches could be to assign the class that results to be the most probable in most of the cases, thus relying on a majority vote. A further option could be to simply consider the label with the highest confidence. However, the latter procedure might be more sensitive to outliers, because the misclassification of a component in just one pair would lead to the final misclassification of the component, regardless of all the other pairs. A deeper analysis of different techniques to address these issues is left to future research. Table I report summary statistics of the datasets we use. Our architecture allows us to use the CDCP, AbstRCT, and SciDTB datasets directly, without need for further pre-processing.

A. Data Preparation
For what concerns DrInventor, instead, specific data preprocessing is needed to address two aspects of this dataset: the presence of lengthy documents and split components. Lengthy documents make it inconvenient to consider all the possible pairs of argumentative units. Doing so would not only be infeasibile with regular computational resources, but it would also yield an extremely unbalanced dataset for link prediction, with less than 1% of pairs linked. We thus filtered out all the pairs that did not appear in the same section of the document, and whose argumentative distance is not included between −10 and +10. A second peculiarity of this dataset is the presence of components that include non-argumentative material. These "split components" are made of two sequences x and y separated by a third, non-argumentative sequence z. In those cases, we split x and y into two unrelated components, and attributed them the same label, the same links, and the same argumentative relations with the other components. The resulting dataset consists of about 8,700 links out of 240,000 possible pairs, which amount roughly to 3.6%. Among these links, SUPPORTS amount to 89%, CONTRADICTS to 10%, and the remaining 1% are SEMANTICALLY SAME relations.
Regarding UKP-PE, like others did before us [4], [23], we also consider exclusively pairs of components that belong to the same paragraph. However, many paragraphs contain only a single component. That is the case, for instance, with about 400 paragraphs containing a single major claim. In order to include also them in our pair-based classification method, we decided to introduce "self pairs" into our dataset, which are instances where the same component acts both as source and target. This significantly increases the number of pairs (from 22,000 to 28,000). So, to improve optimization and enable a comparison with previous approaches, we did not consider these pairs for link prediction and relation classification in validation and testing.

B. Comparison With Other Methods
Not all the approaches to AM are easily compared against one another. This is the case, for example, of approaches that perform only few tasks versus end-to-end systems, or pipeline versus joint learning approaches. Since we perform component classification on propositions or sentences, to make our results comparable with architectures that perform it token-wise, we split each classified component into tokens that share the same label, and compute the evaluation of token-wise classification. Since the tokenization method may not be the same one used by other approaches, the final results may not be perfectly comparable, but we believe that this minor difference will not introduce appreciable errors.
We shall also remark that in our approach we consider argumentative components as already selected and perfectly bounded, therefore we perform component classification only between argumentative classes and we do not consider the "non-argumentative" class as a possibility. This makes our figures incomparable against those obtained by architectures that address both component identification and classification at once, such as [48], since they include "non-argumentative" among the possible classes and thus address a harder problem. A similar consideration holds regarding the pipeline approaches that perform evaluation of each step based on the result of the previous one instead of using the gold standard. In this case, the errors introduced by early steps introduce noise which may affect the evaluation of subsequent steps. It is once again the case of [48], where errors obtained during the first step may introduce noise in the link prediction/relation classification tasks, even if the authors report that such an error is neglegible. We could not find a solution to this problems, but we argue that, nevertheless, a qualitative evaluation of our method can still benefit from a comparison with these other approaches.
Due to these difficulties, in most of cases adapting existing techniques to a new corpus would be a very demanding task. For this reason, we compare our approach only against approaches that have already been tried on the same corpus.

C. Optimization
For each corpus, we train the models on the corresponding training split. Since each corpus is characterized by different classes for component and relation classification, it would be impossible to test a model trained on a different corpus. Nonetheless, it would be possible to use the same model for the task of link prediction.
We shall remark that the hyper-parameters of the architecture and of the learning model have been tuned on the validation set of CDCP. It is also important to highlight that we use the same set of hyper-parameters in all the experiments. Our purpose From top to bottom: The best results of structured approaches based on SVM and RNN, two recent approaches, our previous result obtained with a single training of ResArg, the average scores of the same architecture trained 10 times, the scores of the ensemble learning setting of the same model, and finally the average and the ensemble scores of the new Attention-based architecture ResAttArg. When previous works do not report a score, we use the symbol "-".
is to test whether our approach can yield satisfactory results across different and heterogeneous corpora without the need of re-tuning, and therefore limiting its cost and its environmental impact [6]. Nonetheless, we are aware that performing a specific calibration for each corpus would probably improve our results.
We use the Adam optimizer [79] with parameters b 1 = 0.9 and b 2 = 0.9999, applying proportional decay of the initial learning rate α 0 = 5 × 10 −3 . The weights of the four components of the loss function are set to 1 for the cross entropy of source and target, 10 for the cross entropy of relation, and 10 −4 for the regularization. The training was early-stopped after 100 epochs with no improvements on the F 1 score of the Link class computed over validation data, except for DrInventor, where we early-stopped after 20 epochs of patience due to the dataset's size and much heavier computational footprint.

VI. RESULTS AND DISCUSSION
This section presents the experimental results on each corpus. To assess the contribution of the attention module, we compareRESATTARG with RESARG. Moreover, we study the performance gain introduced by the ensemble approach. To be consistent with other works in this research field, we measure all the performances using the F 1 metric and report the values as percentages. For component and relation classification we consider the macro-averaged score. For link prediction, we consider the score of the positive class.
We structure the analysis of results on each corpus in separate subsections. Two more subsections are devoted to the analysis of computational costs and of the attention module.

A. CDCP
We used the same validation set as in our earlier work [15], which was created by randomly selecting documents from the original training split with 10% probability. We used the remaining documents as training data and the original test split as is. To provide a summary evaluation, following [23], we measured the performance of the models by computing the F 1 score for links, propositions, and the average between the two. More specifically, for the links we measured the F 1 of the positive classes, whereas for the propositions we used the score of each class and then we computed the macro-average. We also reported the F 1 score for each relation class, alongside their macro-average. The NONE class of relation classification corresponds to the negative class of link prediction.
A first question we address was whether our results with a single model were solid or to what extent influenced by the nondeterministic nature of the training procedure. We compared our baseline model with the average scores obtained by 10 networks, with our ensemble setting, and against the structured approach used in [23]. The results are shown in Table II. The average computed over the 10 networks leads to a worse performance on Link prediction with respect to our previous results, which suggests that our previous results were due to a particularly "lucky" training. Nonetheless, the average score on the two tasks remains similar (between 47 and 48), just a few points below the state of the art. The ensemble approach substantially improves the results, outperforming the structured learning approach on both tasks. The results on link prediction are still below those obtained in the first experiment, if only by less than 1%.
Introducing the attention module in the architecture leads to appreciable improvements for both the average and the ensemble approach. In particular, the latter performance outperforms our previous result in all the three tasks. As far as relation label prediction, our approaches fail to predict the EVIDENCE relation. This is a negative result, but hardly surprising, since EVIDENCE is a rather rare class in this dataset (less than 1% of all relations). Also, we can see in Table X that the use of attention strongly reduces the amount of training epochs (−56%) as well as the standard deviation.
Our results on component classification are similar to the ones obtained by using TSP and PLBA [35], while Transition-based  [15], RESARG used in ensemble fashion, and RESATTARG used in ensemble fashion. BERT [46] greatly surpasses our approach in both tasks. It is worth remarking that both these approaches use a mixture of symbolic and subsymbolic features, and that BERT has about 1,000 times more trainable parameters than one of the networks in our ensemble. Bao et al. [46] report they fine-tuned their model 50 epochs with early stopping strategy. In our experiments, the average time required to train a BERT model for a single epoch is remarkably higher than the time required to train our models (more details on the computational cost will be given in Section VI-F).
To estimate the agreement among the networks in the ensemble architecture, and have a measure of the robustness against the implicit randomness of the training procedure, we have computed Krippendorff's alpha [80] for the three tasks. We obtained α = 0.70 for component classification, and α = 0.44 for both link prediction and relation classification. These values are similar to the IAA obtained by the authors of the corpus, and confirm the difficulty of the link prediction task. Fig. 3 shows confusion matrices for component classification on CDCP. Unsurprisingly, the most common mistake regards the prediction of facts as values -VALUE being by far the largest class in the corpus, and so affected by many false positives. Such an ambiguity between the two classes has also been reported during the annotation process.
Interestingly, the confusion matrices of the structured approach and of our methods are quite similar. We speculate that our networks may have learned a behavior similar to that produced by the structured approach, with no need to receive any of the constraints or information regarding the argumentative structure that are instead injected in the structured approach.

B. AbstRCT
For what concerns AbstRCT, we compare our architectures against the best methods presented by its authors [48], whose results are reported in the first rows of Tables III and IV. We trained and validated our model on the respective splits of the Neoplasm dataset, using the remainder of the dataset for testing. For reasons we already explained, the approach presented by Mayer et al. [48] is not directly comparable with ours, therefore the comparison can only be qualitative. To ease comparison with future approaches, we report in Table V some additional details on our results. As for component classification, RESATTARG with ensemble yields the best result, performing comparably with the state of the art. Our approaches obtain substantially better scores for EVIDENCE than CLAIM on all datasets. Similarly to the Transformer-based approaches, our architectures perform better on the mixed test set than on the neoplasm one. We yield better results on all datasets for what concerns the micro f 1 score. However, for what concerns macro F 1 , although our architecture improves the previous approaches on Neoplasm, it is outperfomed by BioBERT on Glaucoma and Mixed. In relation classification, RESATTARG with ensemble outperforms all the other models on Neoplasm, and it performs about 2% worse than the state of the art on Mixed and Glaucoma. It is interesting to notice that in this task BioBERT is largely outperformed by our approach. Almost all the metrics confirm that the introduction of attention and ensemble improve our architectures. The agreement between the networks RESATTARG is very high for token-wise component classification in each dataset (0.81 ≤ α ≤ 0.83), and lower but still acceptable for the other two tasks (α = 0.67 on neoplasm and α = 0.62 for the other two). The introduction of attention has importantly reduced the amount of training epochs (−29%) and the standard deviation. On this corpus, RESARG requires about half the amount of training epochs it required on CDCP, while RESATTARG requires a few less.
These good results indicate that our method may be a valuable approach with well-structured corpora. Moreover, such results are attained without resorting to contextual embeddings or pretraining on domain-related corpora, but by only relying on noncontextual, general-purpose embeddings.

C. DrInventor
To the best of our knowledge, the only approach tested on this corpus is the architecture for token-wise component classification used by Lauscher et al. [21], which makes use of GloVe embeddings and a Bi-LSTM followed by a feed-forward neural network with a single hidden layer as classifier. We thus consider such an approach as a baseline. Like Lauscher et al., we reserved 30% of the documents of the DrInventor corpus as test set, and 20% of the remaining part as validation set. It is worth remarking that for the tasks of link prediction and relation classification we are considering a limited number of pairs.
Tables VI and VII includes a detailed report of our performance on the dataset. We outperform the baseline by a wide margin. Moreover, we address two additional tasks, link prediction and relation classification, thus offering a benchmark for future work. These results confirm once more that attention and ensemble together give a crucial contribution to the classifier. Differently from previous experiments, the agreement between the networks RESATTARG is similar for all the tasks, with only α = 0.56 for component classification and α = 0.60 for the remaining tasks. The agreement for Component Classification is lower than on the previous datasets and may suggest that this dataset is more challenging. This is the only case where RESARG requires less training epochs than RESATTARG, but the difference is neglegible.
Our model is incapable of classifying the SEMANTICALLY SAME relation and has difficulties also with CONTRADICTS. That is hardly surprising, if we consider that these are the two least represented classes in this dataset. It is less straightforward to understand why the model is better at classifying BACK-GROUND CLAIM rather than DATA, even if the latter are more represented than the former. We speculate it may be related to the fact that in some instances data may amount to citations or text other than proper sentences.

D. SciDTB
Previous experiments on this dataset were conducted using BiLSTM and CRF, exploiting syntactical, positional, and discourse features [22]. The authors performed component classification at token level using BIO tagging and validated it through a 10-fold cross-validation setting. We have decided to pursue a different experimental setting: we randomly split the corpus into train, validation, and test folds with an approximate rate of 60%, 20%, 20%, imposing the constraint that each fold must contain at least 1 instance of each component class. We are aware that such decision makes our results impossible to compare with previous approaches, but we are deeply convinced that this is the best approach for the task at hand. Indeed, since some component classes are under-represented in the dataset (for example, there are only 11 instances of OBSERVATION and 7 of DESCRIPTION), some folds will not contain any instances of them, and therefore some of the tests will completely ignore those classes, resulting in unreliable measures. For what concerns link prediction, previous experiments were conducted also considering non argumentative relationships, so it is impossible to compare those results to ours.
Due to the small size of the corpus, the architectures overfitted on both the task of component classification and link prediction, obtaining the perfect score on the training set. The results on the test set are showed in Table VIII. The performance on component classification are comparable to the measures obtained on DrInventor, and the architecture clearly have difficulty to recognize MEANS and DESCRIPTION. Surprisingly, OBSERVATION, the second least represented class, is always correctly classified. While the use of ensemble plays a key role, once again it is the combination of attention and ensemble that leads to the best result. The agreement between the networks is extremely low (α < 0.30), and the measure is deeply affected by the error on the DESCRIPTION class. For what concerns link prediction, similar considerations can be drawn, except that the agreement is considerably higher, but still unreliable (0.43 < α < 0.46). The number of necessary training epochs is higher than in previous corpora for both the models. Once more, RESATTARG requires less epochs than RESARG, but the standard deviation for both models is similarly high.

E. UKP-PE
UKP-PE comes with two strong baselines: the ILP joint model proposed by the authors of the dataset [4] and the structured learning approach by Niculae et al. [23]. They heavily rely on corpus-specific structural features. For example, it is possible to obtain a 47.7 F1 score for major claim detection by only using structural information [78]: something that has never been even attempted in other datasets, because it would not make sense to do so. We compare our results based on the original test split of the dataset, using about 10% of the documents of the training split as a validation split. As shown in Table IX, our approach is largely below the baselines, with a difference in F 1 scores between 20% and 30%. Both the use of attention and ensemble bring consistent improvements with respect to the base model. The agreement between the networks is also low, with α = 0.57 for component classification and α = 0.38 for link prediction, assessing them as nearly acceptable for the first task but unreliable for the others. The number of training epochs required is the highest for both models and also the standard deviation is very large. This suggests that the training process is probably unstable and it is difficult to make the model converge. Investigating the type of errors, we see that most CLAIMS are predicted as PREMISES and that most MAJOR CLAIMS are predicted as CLAIMS. As previously noted, failure to outperform the baselines by ignoring dataset-specific structural knowledge is not suprising, but it gives us a valuable indication of the limits of a structure-agnostic approach like ours.

F. Analysis of Computational Costs
We shall now turn to the analysis of the computational cost. Table X reports the average number of epochs required to train our models on each dataset (we do not include the epochs of patience during which there was no improvement on the validation  8 on the same hardware. We considered two typical training regimes: fine-tuning of the entire model (BERT fine-tuned), and training of the classification head only (BERT freezed). Training BERT freezed for one epoch requires ten times as long as training RESATTARG for one epoch. Training BERT fine-tuned for one epoch requires thirty times as long. The total cost of training will depend on the epochs required to reach the result. Standard practice when using BERT models in AM is to fine-tune for just 3 epochs (see, e.g., [48] and references therein). However, 3 epochs is a lower bound. For instance, the best result on the CDCP dataset so far have been obtained by fine-tuning a BERT-based model for 50 epochs with early stopping strategy [46]. These results could not be outperformed by training a plain BERT model for component classification for 3 epochs only [47] Moreover, transformer models like BERT often are but the backbone of more sophisticated AM architectures. Clearly, a backbone with such a large computational cost per single epoch greatly limits the possibility to test different architectures or to perform model selection and hyperparameter tuning. In conclusion, training a single RESATTARG architecture requires less time than fine-tuning a BERT model for 3 epochs (2856 vs 3399 seconds), let alone 50 epochs. Moreover, an ensemble of smaller architectures could better exploit parallel training and, importantly, can have a much lower computational footprint at inference time (see Section III-D: 110 M parameters for a plain BERT model vs 1.4 M parameters for an ensemble of 10 RESATTARG models).

G. Analysis of Attention
The attention module, besides improving the overall performance of our architecture, can also be used to enhance the interpretability of the underlying neural model. To this aim, the normalized weights a s attributed by the module to each token are typically used as indicators of which words does the 8 We used the bert-base-uncased model.  model consider important for the tasks. Indeed, whether this mechanism is really helpful for improving the explainability of neural models is still a matter of discussion [28], [81], [82]. Fig. 4 provides some visualizations of such scores.
In an attempt to assess the impact of attention on interpretability, we analyzed RESATTARG by measuring which are the tokens that, on average, receive the most attention. Tables XII and XIII report the 20 tokens from the CDCP and SciDTB corpora, respectively, that receive on average the most attention both in the general case (leftmost part in the tables), and in case the network predicts a link (rightmost part).
This kind of analysis gives mixed results. In the general case, tokens should give us hints about which elements are used by the network to distinguish between different components. The CDCP corpus stands out since in the first position there is the punctuation sign "?!". This sign as well as other tokens seem good indicators of emotional involvement (e.g., also "please", "seriously"). We speculate that they may be useful for the network to discriminate objective components from subjective ones (e.g., TESTIMONIES from VALUES, CLAIMS from . These results seem to confirm previous findings, regarding the importance of subjectivity analysis [83] for the task of argument mining [84]. For what concerns link prediction, among the top-ranked tokens some discourse markers can be observed (e.g., "depending", "owing", "whenever", "firstly", "second", "while", "even"), as well as verbs in present continuous form, but also words that do not seem correlated with the task, such as "spam" or "phased". These results provide an additional example of the ambivalence of the study of attention weights to obtain an explanation of classification performed by black-box models.

VII. CONCLUSION
In this article we presented RESATTARG, a new neural architecture for argument mining based on residual networks, multi-task learning, neural attention, and ensemble learning. Our approach does not rely on dataset-specific architectural choices such as structural features or encodings. On the contrary, it only uses general-purpose embeddings and a broadly applicable distance feature, making it suitable for any domain and argumentative model. Moreover, RESATTARG is considerably smaller than other state-of-the-art approaches, making it less expensive to train, and more sustainable from an environmental perspective [6], [7].
In spite of its lower computational footprint, RESATTARG equals or outperforms state-of-the-art architectures on a variety of tasks and datasets, with a notable exception of a dataset whose structural properties are crucial for a correct identification of argumentative components. We conducted ablation studies to evaluate the performance gain yielded by the attention module and the ensemble learning addition and to compare to our previous work [15]. The use of ensemble also increases the robustness of this approach against the intrinsic randomness of neural architecture training. The attention module could be instrumental to interpreting the behavior of the model, although our analysis gives mixed results on that.
The main limitations of RESATTARG are its limited scalability to large documents, and its failure to accommodate task-specific structural constraints. The latter is a design choice, and our results show that highly regular datasets and tasks are best address with dedicated architectures. Alternatively, neural-symbolic approaches [24], [85] may enable a systematic and modular integration of background knowledge. Such a knowledge would contribute during the optimization process, so as to influence and improve the training, without compromising the generality of the neural architecture. However, they risk exasperating the existing scalability issues [86]. When facing very high numbers of argument pairs, we addressed scalability by limiting the range of argumentative relationships using a fixed-size window, which, although cognitively plausible, and in agreement with annotated dataset statistics, meant imposing an additional constraint on the model of the argument. Alternatives to our pair-based approach include multiple-choice classifiers [48], pointer networks [34], and sequence labelling [18]. Such methods should scale better, but they enforce a constraint on the argument model as well, imposing that any component can have only one outgoing relationship, which makes them unsuitable to some corpora. Finally, since all the datasets have a strong unbalance, the use of weighted loss [48] or augmentation techniques [87], [88] may contribute further performance gain.