Comparing and extending the use of defeasible argumentation with quantitative data in real-world contexts

Dealing with uncertain, contradicting, and ambiguous information is still a central issue in Artificial Intelligence (AI). As a result, many formalisms have been proposed or adapted so as to consider non-monotonicity, with only a limited number of works and researchers performing any sort of comparison among them. A non-monotonic formalism is one that allows the retraction of previous conclusions or claims, from premises, in light of new evidence, offering some desirable flexibility when dealing with uncertainty. This research article focuses on evaluating the inferential capacity of defeasible argumentation, a formalism particularly envisioned for modelling non-monotonic reasoning. In addition to this, fuzzy reasoning and expert systems, extended for handling non-monotonicity of reasoning, are selected and employed as baselines, due to their vast and accepted use within the AI community. Computational trust was selected as the domain of application of such models. Trust is an ill-defined construct, hence, reasoning applied to the inference of trust can be seen as non-monotonic. Inference models were designed to assign trust scalars to editors of the Wikipedia project. In particular, argument-based models demonstrated more robustness than those built upon the baselines despite the knowledge bases or datasets employed. This study contributes to the body of knowledge through the exploitation of defeasible argumentation and its comparison to similar approaches. The practical use of such approaches coupled with a modular design that facilitates similar experiments was exemplified and their respective implementations made publicly available on GitHub [120, 121]. This work adds to previous works, empirically enhancing the generalisability of defeasible argumentation as a compelling approach to reason with quantitative data and uncertain knowledge.


Introduction
Representing and manipulating knowledge with computers is still one of the main challenges in AI. This knowledge includes the ability to perform common sense reasoning, which is often non-monotonic [21]. Non-monotonic reasoning allows additional information to invalidate old claims or conclusions [118,106,91,73,79,21]. The classic non-monotonic reasoning example is given by 'birds fly'. It is reasonable to assume that a particular bird, Tweety, flies, unless it is an exceptional bird: ostrich, duck, penguin, and so on [118]. This type of reasoning can be modelled in AI by several non-monotonic formalisms [21], such as inheritance networks with exception [63], semantic networks using Dempster's rule [54], non-monotonic logics [91,96,117] and knowledge-based systems [4]. Still, to the best of the authors' knowledge, there is an absence in the literature of comparisons among some of these formalisms. This research article focuses on the comparison of knowledge-based, non-monotonic systems, which make use of rules or arguments supporting or contradicting certain conclusions to formalise non-monotonic reasoning. For instance, fuzzy reasoning [142] and expert systems [48] with the addition of non-monotonic layers, and computational argumentation, also referred to as defeasible argumentation [24,113]. All these approaches have led to the development of non-monotonic reasoning models usually based upon knowledge bases provided by human experts. Therefore, their performance depends on the amount and quality of knowledge available. However, they also allow such knowledge, possibly fragmented, vague, and non-algorithmic, to be represented in a natural and structured way [59]. Intuitively, this provides a higher degree of interpretability and transparency to the reasoning process. The reason for that is because these models attempt to use human language and to follow the way humans reason. This attempt potentially increases their explainability, which is essential for their adoption and usage. Nonetheless, these advantages have not been sufficient to increase the use of defeasible argumentation technology when performing quantitative reasoning under uncertainty. In this case, quantitative reasoning is understood as reasoning built with domain knowledge and performed on quantitative data, thus being able to provide numerical inferences. Bench-Capon and Dunne [16] identified a set of challenges that need to be overcome for achieving this goal, including the lack of a strong link between argumentation and other formalisms and the lack of engineering solutions for the application of argumentation. Another problem arises from the early stage of research -trying to add quantitative approaches to argumentation [80,56]. Often, quantitative approaches in AI are deemed as limited for their inability to provide justifiable conclusions [35]. Hence, this study attempts to empirically evaluate the inferential capacity of defeasible argumentation models against other similar reasoning approaches. In addition to defeasible argumentation, non-monotonic fuzzy reasoning and expert systems are selected and employed as baselines, due to their vast and accepted use within the AI community.
To perform this comparison, the problem of representing the construct of computational trust has been chosen. Trust is a crucial human construct in-vestigated by several disciplines, such as psychology, sociology, and philosophy [128]. It is an ill-defined construct, whose application lies in the domain of knowledge representation and reasoning. In this research article, the modelling of reasoning applied to the inference of computational trust is proposed in the context of the Wikipedia project. The goal is to design non-monotonic, knowledge-based models capable of assigning a trust value in the range [0, 1] ⊂ R to Wikipedia editors on a case-by-case basis. One means complete trust should be assigned to an editor, while 0 means an absence of trust assigned to the editor. These models are built upon domain knowledge and instantiated by quantitative data, thus can provide numerical inferences. Moreover, the domain knowledge employed contains pieces of evidence and arguments that can be withdrawn in light of new information, allowing the proposed assignment of trust to be seen as a form of defeasible reasoning activity. It is expected that the comparison of non-monotonic reasoning approaches applied in the domain of computational trust will improve the perception of defeasible argumentation in relation to other similar alternatives. Moreover, such an experiment will also add to previous works that have made similar empirical comparisons in different domains of application [122,86,124,125], potentially enhancing the generalisability of defeasible argumentation as a compelling approach to reason with quantitative data and uncertain knowledge. In turn, this enhancement could possibly enable different applications and experiments, likely to be defeasibly modelled, to be carried out.
The remainder of this paper is organised as follows: Section 2 provides the related work on expert systems and fuzzy reasoning, including their options for handling non-monotonicity, followed then by defeasible argumentation and a short description of computational trust. Section 3 presents the design of the empirical experiment proposed to allow the envisioned comparison. Section 4 describes the results, performs its analysis and presents the respective discussion. Lastly, Section 5 concludes with a summary of the study, limitations, findings and recommendations of future research.

Literature and Related Work
In order to enhance the understanding of non-monotonic reasoning approaches, in particular defeasible argumentation, this section provides the reader with the main notions and properties of non-monotonicity, a brief description to the most common non-monotonic logics, and a precise description of the studied knowledge-based systems. Lastly, the section concludes with a short review of some works that have attempted to compare non-monotonic formalisms, followed by an introduction to computational trust, the chosen application domain for comparison purposes from a real-world context.

Non-monotonic Logics
Non-monotonic logics are fundamental for the better comprehension of the concept of non-monotonicity. Default logic, autoepistemic logic and circumscription have commonly been referred to as the most important ones [15,23].
Briefly, default logic [117] models default reasoning, which is performed by employing default knowledge. Default knowledge is of the form that can be retracted if new information that can falsify its preconditions becomes available. The standard example is given by 'normally, birds fly'. If Tweety is a bird, and there is no information to assume that Tweety does not fly, it is assumed that Tweety flies. Rules in this logic are called defaults and are represented by expressions in the following form: p(x) : j 1 (x) . . . j n (x)/c(x), where p(x) is a prerequisite, j i (x) are justifications and c(x) is the consequent of the default. Formally in the Tweety example: bird(T weety) : f lies(T weety)/f lies(T weety). In natural language: if Tweety is a bird, and based on the available information, it is possible to assume that Tweety flies, then infer that Tweety flies.
Similarly, autoepistemic logic [96] is a formalism of modal reasoning [32] focused on modelling reasoning about what is believed by known propositions. It assumes that any specific information should be known and hence it is possible to reason with what is known. It believes that if Tweety is a bird and it is not believed that Tweety does not fly, then Tweety flies. In another way, circumscription [91], tries to represent non-monotonicity through the concept of abnormality. A circumscription technique presumes that Tweety flies because there is no information to show that it has any abnormality.
In essence, proof-theoretic formalisations are often hard to evaluate with respect to their consistency and intuitions that are supposed to be captured. For instance, Poole [109] describes the problem of not committing to implicit assumptions when using default reasoning. Suppose the standard example, 'normally, birds fly', and that by default reasoning Tweety flies: is it then reasonable to assume that Tweety is not an emu or a penguin? Poole proposes three solutions: not concluding that Tweety is a bird; making no commitment whether Tweet is an emu or a penguin; and concluding that Tweety is neither an emu nor a penguin. Compared to non-monotonic logics, knowledge-based systems are better suited for capturing the intuitions of a specific problem. Since rules or arguments must be predefined, only relevant non-monotonic contexts are modelled, leaving little, if any, place for confusion. The next subsections introduce such systems employed in this research article.

Expert Systems
Succinctly, expert systems are defined as systems that try to transfer a vast body of specific knowledge from a human to a computer. They attempt to emulate such a human in a given field [48] and are aimed at accomplishing tasks that require human expertise or at playing the role of an assistant [66]. Their structure is usually composed of two internal components: a knowledge base and an inference engine [48] The former is provided by a human expert and generally translated into a set of logical rules. The latter is aimed at eliciting, firing and aggregating such rules towards a conclusive inference. Rules are used to define what to do and what to conclude in different scenarios. Usually, they follow the form depicted in Fig. 1. These rules, in conjunction with a set of facts (for instance data) and an interpreter that decides the application of the rules, are what constitute a rulebased system. They can model a large range of problems, once the domain knowledge is represented as IF-THEN rules. It is important for the number of rules not to be too large so as to make the interpreter inefficient [58]. In this case, the system can exploit users' inputs and pieces of information stored in the knowledge base to reason with.

Non-monotonicity in Expert Systems
The use of non-monotonic logics in expert systems has been studied for several decades [51]. Nonetheless, the general use of non-monotonic reasoning in industry has not been extensive [98,97,116]. A few examples of expert systems that deal with non-monotonicity are proposed through the use of inheritance with exceptions in semantic networks [98], through the use of defeasible logic [101], through the use of default reasoning [50], and through the use of probabilistic reasoning [75]. In this research article, knowledge in expert systems is represented by rules, and the respective reasoning is performed in a single step. In other words, data is imported, and all rules are fired at once. Thus, to retract a rule, the notion of 'contradictions' or 'exceptions' are employed. These are defined by domain experts and describe special cases in which a rule is no longer valid. Once a special case is triggered, a backtrack search is employed to remove affected rules [116, chap. 9]. This might require excessive efforts, depending on the amount of data to be managed and the number of reasoning steps firing the backtrack search. However, note that even though this is a simplistic procedure, it can still be effectively implemented in a problem of single reasoning step, with a reasonable number of rules. That is the case with computational trust, the application chosen here for comparison purposes. Because reasoning applied to model computational trust can be performed in a single step with a reasonable number of rules, expert systems designed in this study do not follow a usual multi-step reasoning process. Further information on the domain of application and design methodologies will be detailed afterwards.

Fuzzy Reasoning
Fuzzy reasoning models are the product of knowledge-based systems that incorporate fuzzy logic and/or fuzzy sets [142] into their reasoning and knowledge representation techniques [69]. Fuzzy logic calculus coupled with the notion of fuzzy membership functions allows for an effective knowledge representation of imprecise and uncertain pieces of information. Such notions can be exploited by IF-THEN rules of similar structure to those depicted in Fig. 1. In this case, a fuzzy rule-based system arises, which is most useful when modelling systems that make use of linguistic variables as their antecedents and consequents. In turn, rules can be employed for the construction of a fuzzy control system. Usually, such a system is composed of a set of crisp inputs, a knowledge base, a fuzzification module, an inference engine and a defuzzification module [104], as depicted by the diagram in Fig. 2  This structure begins by the fuzzification module assessing the membership grades of crisp inputs associated with fuzzy sets. Subsequently, a fuzzy inference method must be applied, in order to produce an inference.
Many of such inference methods are available [130,133,87]. This research article employs the 'Mamdani' fuzzy inference method [87], which is often used in practice [127]. Finally, defuzzification methods [60], such as cetroid or mean max membership, must be applied to convert a fuzzy set into a crisp output in different fashions, contrarily to fuzzification methods that convert crisp inputs to fuzzy sets.

Non-monotonicity in Fuzzy Reasoning
Some supplementary extensions of fuzzy inference systems have been suggested, in order to tackle the use of non-monotonic rules. Unfortunately, these extensions are few and not well established. For example, in [27], conflicting rules have their conclusions aggregated by a possible averaging function. However, an inference from a non-monotonic fuzzy rule cannot be propagated since the theory does not allow circularity. Another type of non-monotonicity in fuzzy systems is investigated in [53]. In this instance, non-monotonicity arises when identical consequents are inferred by distinct permutations of variables in the antecedents. The goal of this approach is to remove redundant rules, while preserving the crisp values returned by the defuzzification module of a Mamdani fuzzy system. A third approach is given by Siler and Buckley [129, chap. 8], whereby possibility theory [45] is included into the fuzzy reasoning system to tackle conflicting instructions. Siler and Buckley [129] make the assumption that 'truth values represent necessity, the extent to which the data support a proposition'. At the same time, they also treat 'truth values that represent possibility, the extent to which a truth value represents the extent to which the data fails to refute a proposition'. The possibility of proposition A is denoted P os(A), while its necessity is denoted N ec(A). Both are values between [0, 1] ⊂ R. Necessity is also assumed to represent the traditional truth values reviewed on the previous subsections, while truth values that represent possibility need to be added to the system. Moreover, it is assumed that adding supporting evidence can affect the necessity but not the possibility of a proposition, and adding contradicting evidence can never increase possibilities. In other words, N ec(A) ≤ P os(A), for any proposition A. In this case it is guaranteed that propositions are defeasible. For example, if N ec(a) = 1 and P os(a) = 0, it would not be possible to refute a. Under these circumstances, the effect on the necessity of a proposition a by a set of propositions {Q 1 , . . . , Q k } contradicting a, and a set of propositions {P 1 , . . . , P j } supporting a, is derivable as: where ¬N ec(Q i ) = 1 − N ec(Q i ), the union is implemented by the 'max' operator and the intersection by the 'min' operator. A few axioms used to develop conventional possibility theory are not considered in this approach, due to their incompatibility with other fuzzy logics. However, according to [129] the advantage provided is a functional theory, when incorporated into fuzzy reasoning with rule-based systems. For instance, suppose a proposition a whose necessity is 0.3 and possibility is 1.0; if P (necessity 0.4) supports a and Q (necessity 0.2) refutes a, then N ec(a) = min(max(0.3, 0.4), 1 − 0.2). Thus, the new extent to which a support its truth is 0.4, because of the support of P and the failed attempt of refutation from Q, due to its low necessity. Unlike other reviewed implementations of non-monotonicity, note that this approach does not restrict the type of membership functions, methods of fuzzy inference, methods of defuzzification or propagation of inferences generated by non-monotonic rules. However, it does require that possibility values be defined. Simple approaches might be to assume possibility 1 for propositions that can be refuted by any other piece of information, and possibility 0 for propositions that cannot be refuted by any other piece of information.
As for applications of non-monotonic fuzzy systems, to the best of our knowledge, no paper has adopted these reviewed approaches in real-world contexts. Instead, they have been evaluated under simulated environments or by hypothetical examples. In particular, Siler and Buckley [129] exemplify its proposal of adding possibility theory into a fuzzy reasoning system with a simplified application in the medical field for demonstration purposes. The goal was to determine the anatomical significance of regions in an echocardiogram composed by ultrasound images of a beating heart. Rules were automatically created from a classification database. Contradictions arose from noisy images inferring mutually exclusive conclusions and were resolved with the aid of Equation (1).

Defeasible Argumentation
Argumentation deals with the study of assertion and definition of arguments usually emerged from divergent opinions. Its use is rooted in the common tradition, from Aristotle to the present, of employing reasons to decide how to act. In AI, unlike expert systems and fuzzy reasoning, computational argumentation theory was introduced as a formalism for modelling non-monotonic reasoning. Thus, it does not require an implementation of non-monotonicity as described in the previous reasoning approaches. Non-monotonicity is naturally present through distinct techniques for the evaluation of the dialectical status of arguments. This evaluation usually determines which arguments should be ultimately accepted or rejected in an interconnected network of arguments. Nowadays, AI is regarded as one of the main areas of application of argumentation theory. In this field, argumentation aimed at developing computational models of arguments might also be referred to as defeasible argumentation [113] -a paradigm that has become increasingly significant [16] and widely employed for modelling non-monotonic reasoning [36].
Computational argumentation systems are usually structured around layers specialising in the definition of internal structure of arguments, the definition of arguments interactions, the resolution of conflicts between arguments and the possible resolution strategies for reaching a justifiable conclusion [113] . However, as the boundaries of such layers might not be precisely defined, a few layered structures have been proposed [112,11] for the development of computational models of argument. Another example of multi-layered structure can be found in [80] and is depicted in Fig. 3.   [80] for the creation of argument-based models of inference.
The present research article adopts this structure, due to the nature of the experiment and application selected for comparison purposes (computational trust), which relies on the the knowledge of a single expert, in order to achieve numerical inferences. In the literature of defeasible argumentation, many works are focused on each of these layers.
For instance, in Layer 1, Toulmin [132] was one of the first to introduce a conceptual model of argument with better structured arguments. Walton [138] proposes a different approach with several different argumentation schemes in which a subject can build his point of view, such as argument from consequence, appeal to expert opinion, and argument from analogy. Other structures are also possible [17] and they help to clarify possible ways in which arguments can be represented. Nonetheless, argumentation systems might still be constructed with simpler arguments, for instance when these are represented by a pair of premises and a conclusion.
In Layer 2, several works attempt to model the relationship between arguments and the management of their interactions, which means the actual arguing process. The classification proposed by Prakken [110] exemplifies three different classes: undercutting and rebutting attack, first formalised in [107], and undermining attack, introduced in [137]. An undermining attack refers to an argument being attacked on one of its premises, thus it is the only possible class of attack for deductive inferences. In contrast, the classes of undercutting and rebutting attack target respectively the inference link and the conclusion of an argument -structures which can be denied only in a defeasible argument. Another type of interaction is given by supporting relations between arguments. An argument system that employs relations of attack and support is said to be making use of the concept of bipolarity [30]. This concept is not employed in this research article, but there is a broad amount of research on the topic of bipolar argumentation which the interested reader is referred to [134,31,8,100].
Subsequently in Layer 3, is possible to find works focused on characterising the success of an attack (or often referred to as defeat), since defeat relations can often be influenced by reliability of tests and expertise. Commonly, attacks have a form of binary relation. However, in order to determine a defeat, two other trends have been observed in the literature of argumentation: the preferentiality/strength of arguments and the preferentiality/strength of attacks. Strength of arguments is recognised as a valid source of information for those deciding on a collection of acceptable arguments [47]. These can be employed in a variety of ways, such as using priorities among rules to solve their rebuttals [111], allowing only arguments of greater strength to attack arguments of lesser strength [108], or even inverting the direction of attacks [7]. Dunne et al. [47] proposed, instead, the use of strength of attack relations. Their approach is justified by the fact that it is not only the strength of arguments that is important, but also the strength of the attack that one argument makes on another.
Once attacks have been evaluated the acceptance status of arguments can be defined through acceptability semantics in Layer 4. Conflicts by themselves do not demonstrate which arguments should be ultimately accepted. To do so, it is necessary to evaluate the overall interaction of arguments across the conflicting set. Most frequently, this evaluation relies on the abstract argumentation theory proposed by Dung [46] and later extended [139,26,25]. Hence, acceptability semantics return extensions, or subsets of arguments that can be mutually acceptable, according to a specific rationale of the semantics. Therefore, a notion of scepticism is usually employed in the informal discussion behind the behaviour of semantics [14]. For example, the grounded semantics is considered more sceptical for taking fewer committed choices and always providing a single extension. By contrast, the preferred semantics is seen as a more credulous approach for being more audacious when accepting arguments and has consequently been able to provide more than one extension. Still, Dung's semantics and its variations are not the only class of semantics employed in abstract ar-gumentation theory. Another well-known class of semantic is the ranking-based one [5,9,29,89,115,42]. In ranking-based semantics, the goal is not to provide an extension, but to rank arguments to define the most important one(s). It is a more flexible class of semantics, in the sense that arguments are not strictly rejected or accepted, but instead a graded assessment of arguments is provided, based on the topology of the argumentation framework.
Finally, in Layer 5, the assessment of the statements supported by acceptable arguments is performed. When performing defeasible inferences, for practical purposes, usually a single decision needs to be made or a single action performed. However, multiple acceptable arguments may be computed in the previous steps. Usually, these coincide with possible consistent points of view that can be simultaneously considered for describing the knowledge being modelled. In the case of extension-based semantics, extensions might contain multiple arguments, multiple extensions might be computed, or both. In the case of ranking-based semantics, multiple arguments might be ranked at the top -a situation that can easily occur when multiple arguments are not attacked. Despite the suggestion in [71] of ranking the arguments by the number of extensions they belong to, this is not a problem usually addressed in the literature. Thus, in the present research, premises and conclusions of employed arguments are linked to categorical or numerical datasets. This allows for a simplified quantification of their values and aggregation in different fashions, namely average, sum or median. In turn, this aggregation produces a final inference (number), which is only possible due to the use of arguments built upon associated numerical datasets.
Some works make use of all the reviewed multi-layer structures (Fig. 3) in their systems [33,64,39], whereas others do not [105,55,57]. Table 1 lists several argument-based systems and their approaches for each layer in some prominent areas of argumentation. Some of these approaches may not have been previously reviewed, since they are beyond the scope of this study.

Grounded, Preferred, Categoriser
Extens. cardinality + average [43] Computational trust Argument schemes n/a n/a n/a Sum aggregation function

Comparison of Non-monotonic Formalisms
A comprehensive comparison of different approaches to non-monotonic reasoning is a subject not sufficiently developed in the literature. In particular, to the best of the authors' knowledge, there is a very limited number of empirical works that incorporates the diversity of reasoning methodologies as done in this research article. Still, some works have proposed different comparisons among non-monotonic formalisms and are important to mention. For instance, Brewka et al. [22] provide a good overview in order to categorise non-monotonic logics by modal-preference logics, fixed-point logics and abductive methods. The recent work in [62] presents guidelines for selection of non-monotonic logics resulting in 17 different types of logics. Konolige [72] also studies the relation between non-monotonic logics, specifically between default and autoepistemic logic.
Delladio et al. [40] investigate the relations between a normal default logic and a variant of defeasible logic programming. The former is a special case in which justifications and consequents of default rules are the same, or in terms of default logic: p(x) : c(x)/c(x). The latter is a formalism that combines logic programming and defeasible argumentation, to allow the representation of defeasible and non-defeasible rules. The authors show an equivalence between the consequents from the normal default logic and different answers given by the defeasible logic programming. Still, it investigates a theoretical relationship, limited by certain cases of the selected formalisms.
Dutilh Novaes and Veluwenkamp [49] make an empirical test on the accuracy of two formal non-monotonic reasoning models: preferential logic, a nonmonotonic logic extended from a monotonic one; and screened belief revision, a particular version of belief revision theories [52]. The experiment attempts to demonstrate which of the two formalisms can better predict belief bias, a form of human reasoning. This examination is different from the investigation proposed in this research article, which instead assesses the inferential capacity of non-monotonic, knowledge-based reasoning approaches for quantitative inferences in real-world contexts.
More recently, Arieli et al. [10] perform a comparative study among logicbased approaches to formal argumentation and a theoretical discussion about the relations of these and other non-monotonic reasoning formalisms. For instance, it describes how certain extensions of autoepistemic logics [96] and default logic [117] could be translated to an extension of assumption-based argumentation [19], another formalism designed to capture and generalise nonmonotonic reasoning.
Yang et al. [140] compare first-order predicate logic, fuzzy logic and nonmonotonic logic implemented through negation as failure. The methods were contrasted using a simulation approach in which experiment facts were considered as random numbers. In turn, a set of algorithms is provided for their investigated transformations and modifications. This work possesses a very similar motivation to the present research article when comparing different reasoning formalisms, even if some are monotonic. It attempts to evaluate the capacity of inference and the complexity of these distinct approaches. Yet, despite proposing an interesting mechanism for experimentation, the study does not evaluate the subject in real-world domains. Instead, the simulation approach seems to elucidate the performance of such methods only in a computationally-controlled environment, for instance when different types of transformation must be applied to numerical datasets.

Computational Trust Modelling
Computational trust modelling, a knowledge-representation and reasoning problem, has been selected in order to allow the comparison of defeasible argumentation with other similar reasoning techniques. It has been investigated by several disciplines from different perspectives, such as psychology, sociology, and philosophy [128]. Many definitions of trust can be found in the literature [102]. Briefly, it can be described as a prediction that a trusted entity will bring to completion the expectations of a trustier in some specific context. The first computational model of trust was proposed in [88]. Its goal was to enable artificial agents to make trust-based decisions in the domain of Distributed Artificial Intelligence. The modelling of computational trust has also several applications in digital systems, for instance: reputation management [141,93]; social search and collective intelligence [83,85]; user behaviour modelling [82]; and self-adaptive recommendations [84,44].
Several works have examined the relation between argumentation and computational trust. For instance, Matt et al. [90] review how argumentation can help agents make decisions. It also discusses how arguments can improve the assessment of the trustworthiness of certain agents by supporting predictions on these agents' future behaviours. In turn, Parsons et al. [103] present a set of argument schemes for reasoning about trust. It is aimed at providing a computational mechanism for establishing arguments about trustworthiness. These schemes are also followed by a set of critical questions that can rule out their use. Other works have also focused on defining argument-based approaches for reasoning about trust [6,131].
In this research article, the context under evaluation comes from the Wikipedia project. Collaborative, user-generated content is of essential importance in the web. Hence, sites such as Wikipedia, TripAdvisor and Flickr leverage the interest and contribution of people all over the world. The drawback comes from the discrepant origin and quality of such contributions, leading to the complication of visitors and content moderators assessing their reliability. Wikipedia itself is under continual change from different sources, ranging from domain experts and to casual contributors, to vandals and committed editors. Therefore, many works have investigated the problem of computing the trust of Wikipedia editors or articles. For instance, Adler and de Alfaro [2] present a content-driven reputation system for Wikipedia editors, assuming that the reputation of editors can be used as a rough guide to the trust assigned to articles edited by them. In turn, reputation is assigned according to the longevity of the text inserted, and the longevity of the text edited by each editor. In a subsequent work, Adler et al. [3] compute the trust of a word in a Wikipedia article according to the reputation of the original editor of the word, as well as the reputation of editors who edited content in the vicinity of the word. The study demonstrates that text labelled as high trust has a significantly lower chance of being edited in the future. Similarly, Zeng et al. [143] explore the revision history of an article to assess the trustworthiness of the article through a dynamic Bayesian network. In short, other works evaluate the trust of Wikipedia's contributors through a multi-agent trust model [74] and the Wikipedia editor reputation through the stability of content inserted [68].
To conclude, let us point out that the proposed use of defeasible argumentation and other reasoning approaches in this research article is not aimed at enhancing the assessment of computational trust. Hence, the performed experiments are not compared with the aforementioned works. Nonetheless, to the best of the author's knowledge, the use of non-monotonic reasoning, instantiated by quantitative information, for the inference of trust of Wikipedia editors has only been attempted in previous works [126,119]. These employed different sets of data and/or reasoning approaches. Thus, the investigation proposed in this research article is a more comprehensive one, extending these and other previous works [122,124,125] that have compared knowledge-based, non-monotonic reasoning approaches applied in different domains of application. Therefore, the main goal of this research article is to enhance the generalisability of defeasible argumentation as an effective approach to reason with quantitative, uncertain and conflicting information in real-world contexts. The next section provides a precise description of the research problem, the formulated hypothesis and the methods applied to test it.

Design and Methodology
The research problem being addressed is how defeasible argumentation compares to similar reasoning approaches, such as expert systems and fuzzy reasoning, when used for the formalisation of non-monotonic reasoning models of inference. Moreover, this paper focuses on the case in which such models can be instantiated by quantitative data from real-world domains. Most recently, there seems to be an increase in the use of defeasible argumentation as the basis of current models employed in practice. Therefore, the assumption is that this approach could also be more suitable for modelling non-monotonic reasoning and producing non-monotonic reasoning models of inference in the domain of computational trust. The confirmation of this assumption would reinforce the applicability and generalisability of defeasible argumentation with quantitative data in real-world contexts. Consequently, it could also aid other scholars to adopt the proposed approach in other domains of applications. To investigate this, an inductive type of research is proposed -that is, one which attempts to propose broader generalisations from specific observations. An observation comes from the recent increased use of argument-based models in fields such as health care, knowledge-representation and reasoning, and multi-agent systems. This leads to the following hypothesis: Hypothesis. If computational trust is modelled with defeasible argumentation, then the inferential capacity of its models will be superior than that achieved by non-monotonic fuzzy reasoning and expert systems models according to a predefined set of evaluation metrics from the domain of application.
To test this hypothesis, non-monotonic reasoning models of inference are designed and built with the pre-existing theoretical knowledge of the investigated reasoning approaches, as reviewed in Section 2. The goal is to design non-monotonic reasoning models capable of assigning a trust value in the range [0, 1] ⊂ R to Wikipedia editors. One means complete trust should be assigned to an editor, while 0 means an absence of trust assigned to the editor. In turn, such models are instantiated with real-world data, allowing a statistical comparison of produced inferences. Fig. 4 depicts the designed experiment with the evaluation phases incorporated into the flow. First, knowledge bases Knowledge-base 2 (Appendix A)

Comparison of inferences
Design of non-monotonic reasoning models from domain knowledge Instantiation of models with quantitative, real-world datasets from the domain of computational trust Figure 4 Design of a comparative empirical research study and its evaluation schema aimed at evaluating the inferential capacity of defeasible argumentation. structured around natural language terms are employed by the non-monotonic reasoning approaches for the design of inferential models. These same models are later instantiated by real-world datasets from the domain of computational trust, producing three sets of inferences, one for each reasoning approach. These sets are subsequently analysed for the comparison of the inferential capacity of the reasoning approaches. This comparison is done by assessing the values assigned to Wikipedia Barnstar editors. A Barnstar 1 represents an award used by Wikipedia to recognise valuable editors. It is a non-automatic award bestowed from a Wikipedia editor to another Wikipedia editor. Therefore, it is not a ground truth for trust. Instead, it is used as a proxy measure to identify trustworthy editors and to allow the selection of evaluation metrics later detailed in this section.

Datasets Employed and Knowledge Bases Design
Wikipedia makes all its data available for download through HTML or XML dumps 2 , including articles, articles' history and complete text data. Hundreds of different language editions are available for download. Since no natural language information is analysed, but only quantitative data related to editors, the XML dump of the Portuguese-language 3 edition and the XML dump of the Italian-language 4 edition of the Wikipedia were selected for examination and downloaded on 8 January 2020. These were selected mainly due their respective sizes that were more appropriate to the computational resources available. Moreover, it was expected that two sets of data would be able to reinforce findings and confirm possible observed differences in the inferences produced by designed non-monotonic reasoning models. According to the Wikimedia Foundation's Analytics, 5 , the Portuguese file contained 999, 696 pages created (excluding pages being redirects), 1, 947, 023 editors, and 133 Barnstar editors, while the Italian file contained 1, 576, 621 articles, 2, 804, 142 editors, and 106 Barnstar editors, both up to January 2020. Each dumped Wikipedia page is identified by its title and it has a number of associated revisions containing: i) its own ID; ii) a time stamp; iii) a contributor (editor) identified by a user name or IP address if anonymous; iv) an optional commentary left by the editor; v) the current number of bytes of the page on current revision; vi) an optional tag indicating whether the revision is minor or major and should be reviewed by other editors. Fig. A.14 (p. 56) depicts an example of the XML structure. This data was extracted for the definition of features listed in Table 2 for each editor, resulting in two datasets 6 . Temporal factors such as presence, regularity and frequency factor were first proposed in [81]. A time window of 30 days was selected for evaluation of the frequency and regularity factors, similarly to the statistical analyses performed by the Wikimedia Foundation. The set of extracted features was employed for the construction of two knowledge bases with the author's knowledge and intuition in this domain. Appendix A (p. 54) lists the information contained in them. The set of IF-THEN rules constructed was the same for both knowledge bases. They were defined intuitively and with the aid of external sources. For instance, the numerical ranges associated with natural language terms employed to describe activity factor and bytes were defined with the aid of the Wikimedia Foundation's Analytics. According to the reports of this foundation, an editor is considered a contributor if he/she has made more than 10 editions in his/her life cycle. Hence an activity factor ≥ 10 was used to infer medium high trust, while activity factor ≥ 20 was used to infer high trust. Similarly, the last report on the mean size of articles in the Portuguese Wikipedia showed a mean of 2,388 bytes per article, while 90% of articles had more than 512 bytes. This information was used to intuitively infer high (medium high) if an editor had contributed at least 2,388 bytes (512 bytes) throughout all his/her editions. Other features were normalised in the range [0, 1] ⊂ R, in order to provide a more standard reasoning process. For instance, the feature not minor was divided by the activity factor, hence providing the percentage of editions flagged as major for all editions of the editor. Features normalised in the range [0, 1] ⊂ R were then described by four natural language terms: low, medium low, medium high and high. Table A.6 (p. 55) lists the feature transformations employed, the associated natural language terms and their respective numerical ranges. It is important to highlight that the constructed IF-THEN rules are limited by the type of data employed. While Wikipedia provides the text history of all its articles' editions, no natural language data is exploited in this study. A knowledge base also formed by features that take advantage of natural language data would likely contain stronger information for the inference of the editors' trust.
Following this, a set of contradictions among IF-THEN rules was defined. This process was made in two different ways, resulting in two different knowledge bases. The first was made by means of an intuitive manner, trying to establish evident relationships, such as low frequency factor contradicting the use of high presence factor to infer high trust. In other words, high presence factor was interpreted as an indication of high trust, unless the same editor also had a low frequency factor. A set of premises was also considered for the definition of agents whose trust should be low, such as a vandal or a bot. For example, an editor who is anonymous, has a low number of comments, a very low number of not minor edits, a high number of pages edited, and a high number of bytes inserted was considered a bot. In other words, this set of characteristics was considered sufficient to assume that an editor was a bot. In turn, this set of premises was used to contradict several IF-THEN rules inferring high trust. This full knowledge base is reported in Appendix A, with the resulting graphical representation depicted in thor to identify pairs of features that could possibly have conflicts according to his/her knowledge. Many more contradictions were identified in this case, since it was easier to visualise all the possible pairs of features that could have some conflicting set of beliefs. For example, the arrow from pages to activity factor was drawn as a reminder that while a high activity factor could be an indicator of high computational trust, this should not be the case if the editor has only modified a low number of pages. The idea is that a trustworthy editor would, based on the author's belief and intuition, collaborate on a high number of pages when performing a high number of edits. In addition, IF-THEN rules that did not contradict each other, but which inferred different trust levels, were considered this time to contradict each other, resulting in a much larger number of contradictions. The rationale for this was to assume that only one trust level should be accepted, demanding more from the conflict resolution strategy of each non-monotonic reasoning approach and ideally performing fewer calculations for the aggregation of rules/arguments before a final inference. Fig. A. 16 (p. 60) depicts the resulting graphical representation of this knowledge base. Fuzzy membership functions (FMF) were also designed by the author to model natural language terms such as high, low, very low, and others ( Fig. A.17, p. 62). These are necessary for the implementation of fuzzy reasoning models. Theoretically, such functions can provide higher precision for the modelling of natural language terms, or 'fuzzy' concepts. Natural language terms related to the same feature, for instance, low/medium low frequency factor, were designed in such a way that some intersection was possible between the terms. However, note that defining FMF is also a fuzzy process. Hence, two types of FMF were attempted, a linear and a gaussian one, which are often employed by fuzzy systems. The reason for having different types was to investigate their impact on the inferential capacity of fuzzy reasoning models. Other types could have been defined and further research can be made in this aspect.
In conclusion, note that these knowledge bases were built by a single agent, without the collaboration of other domain experts. Still, it is important to highlight that, despite such knowledge bases not being subject to a formal process of validation, for instance by being inspected by other experts, it is reasonable to assume that they are credible. This assumption comes from the fact that the author has more than 10 years of heavy internet usage, competent qualifications in computer science, and is experienced in a multitude of digital collaborative environments. A different approach would be to collaborate with a larger group of experts, producing greater and more sophisticated knowledge bases. Nonetheless, this approach could also lead to other issues, for example due to an expert's difficulty in understanding the knowledge used by computational models, an expert's capacity for verbalising knowledge, or an expert's capacity for understanding the amount of detail required [95]. In summary, the process of knowledge acquisition is a familiar problem and a frequent bottleneck of knowledge-driven approaches in general. It is not the goal of this research article to propose a solution for such issues. Rather, this research creates trustworthy, credible knowledge bases. In turn, it employs the same knowledge bases to perform the envisioned comparison among non-monotonic reasoning approaches.

Design of Non-Monotonic Reasoning Models Employing Expert Systems
This section provides a step-by-step description of the inferential process of possible expert system models when applied for modelling a non-monotonic reasoning process in the domain of computational trust. A running example is depicted in Fig. 6. This example is referred to throughout this section and is aimed at providing a complete overview of the designed expert system inferential process. Figure 6: An illustration of a reasoning process modelled by an expert system for a single Wikipedia editor.

IF-THEN Rules
The first step of a rule-based expert system is to model a knowledge base usually gathered from an expert with rules in the form "IF (antecedent) THEN (consequent)". In this research article, the antecedent is a set of premises associated with several quantitative features that are believed by the expert to influence the consequent being inferred (computational trust). The consequent might have different levels and is assumed to be derived from the premises by a domain expert. Therefore, no single deductive system is applied. The same premise/s might be used by different domain experts, but leading to different conclusions. Each level of a premise in the antecedent, as well as each level of the consequent, is mapped to a numerical range also by the domain expert. In this way, features associated to a certain level, such as low and high, can be evaluated true also according to continuous values. To formalise the generic case, IF-THEN rules are precisely defined. This definition follows the logic structure in which antecedents can have multiple premises joined with AND/OR boolean operators, while the consequent is a single statement [48, chap. 3].
Definition 1 (Generic IF-THEN rule). A generic IF-THEN rule is defined, without loss of generalisability for OR and AND operators, as: Where i n ∈ R is the input value of the feature n with numerical range [l n ∈ R, u n ∈ R]; the range [l c ∈ R, u c ∈ R] is the numerical range for the consequent level being inferred with; and AND and OR are boolean logical operators.

Inference Engine and Non-monotonic Extension
The inference engine starts with the activation of IF-THEN rules by input data. This data is used to evaluate antecedents of rules, activating a subset whose evaluation returns true. This evaluation is based on the numerical ranges provided by the knowledge base designer. For instance, in rule C4, high comments means that a comments input value between [0.75, 1] ⊂ R provided by a user will activate the rule. A rule can also be contradicted by other rules that intend to bring forward and support contradictory information. Formally, these can also be seen as meta-rules [48, chap. 3], or rules that describe how other rules should be used, as in the contradictions in the running example. These are activated in the same fashion as IF-THEN rules. The main difference comes from the fact that the consequent of these meta-rules, or contradictions, might impact other meta-rules or other IF-THEN rules, while the consequent of IF-THEN rules is being employed only for the inference of computational trust. If both an IF-THEN rule and at least one contradiction challenging the rule have been activated, then the inference engine discards the rule. This mechanism will eventually form a set of surviving rules. The running example depicts some input values, two activated rules and one surviving rule.
In this research article, the rules in the set of surviving rules will always be inferring the same consequent, but most likely at different levels. Since the goal is to aggregate them and to extract a unique scalar, with the most representative of the consequents being inferred, an aggregation strategy is needed. In this situation, a usual expert system would have a typical set of choices for selection of rules [92,48], for example, deciding a priority for each rule, returning multiple outcomes or choosing the first rule activated. However, none of these strategies are applicable in this research study. The constructed knowledge bases do not give explicit preferences among rules, order of activation or possibility to compute more than one output. Because of this, IF-THEN rules must be quantified and aggregated to infer a single scalar. In the present study, this value is defined according to the numerical range of the consequent of the rule, the numerical range of its premises and the input values provided for the rule activation. In the basic scenario of an IF-THEN rule with only one premise, it is quantified as the minimum (respective maximum) value of the numerical range of its consequent if its premise is activated with its minimum (respective maximum) value. For instance, consider rule C4 rewritten with illustrative numerical ranges: 75,1] In this case, if the input value for comments is 0.75, then C4 value will be 0.75. Analogously, if the input value for comments is 1, then C4 value will be 1. Activation values greater than 0.75 and less than 1 are evaluated according to a linear relationship. This is defined by a function f as proposed in [119]: Definition 2 (Generic rule value). The value of a generic IF-THEN rule r is given by the function: The value of a rule will always lie between the numerical range [l c , u c ] of its consequent. Moreover, the boundaries in this range will define the type of relationship between premises and consequent: • If l c < u c , then Definition 2 will model a linear relationship. The higher the value of the premise/s, the higher the value of the conclusion.
• If l c > u c , then Definition 2 will model a contrary linear relationship. The higher the value of the premise/s, the lower the value of the conclusion.
• If l c = u c , then Definition 2 will model a constant function whose every input results in the same output (u c ). This might be useful to model consequents with categorical levels.
Briefly, Definition 2 provides an evaluation formula for rules that employ logical operators AND/OR, replacing them for max and min operators. Different operators could have been employed if required by the domain of application or human reasoner. Moreover, if there is no reason to select one operator over another, they could also be a parameter when designing expert system models. However, having in mind the rules contained in the knowledge bases employed in this study, the adoption of other operators would likely not have a significant impact on the results. The antecedents of these rules are often formed by a single premise, thus, their evaluation would follow a simple linear relationship regardless of the aggregation strategy adopted for multiple premises.
Finally, four heuristics defined in previous works [123] are employed for the aggregation of values assigned to surviving IF-THEN rules. The strategies are defined, in order to extract different points of view from remaining rules and to accommodate the use of rule weights when weights are provided. Weights among rules are provided in the employed knowledge bases. They are the result of a pairwise comparison procedure between the 9 employed features (Table A.6, p. 55) performed by the knowledge base designer. Hence, they will be numbers in the range [0, 8] ⊂ N, being 0 if a feature is considered less important than any other feature for the inference of computational trust, and 8 if it is considered more important than any other feature. The weight of a feature will also represent the weight of the rule employing this feature. The aim is to investigate the impact of adding this extra information on the inferential capacity of the expert systems models. Thus, the heuristics are: h 1 : definition of the sets of surviving rules grouped by their consequent level. Extraction of the largest set. Average of the values of the rules in the largest set. In case two or more of the largest sets exist, the above process is repeated for each, and their average is returned. The idea is to give importance to the largest set of surviving rules supporting the same consequent level.
h 2 : same as h 1 but applying the weighted average instead of the average. The goal here is to allow the possibility of defining weights to specific rules.
h 3 : average value of all surviving IF-THEN rules. This is to give equal importance to all surviving IF-THEN rules, regardless of which level of the consequent they were supporting.
h 4 : same as h 3 but applying the weighted average instead of the average. Again, the goal is to allow the use of weights attributed to specific rules.
The running example depicts an illustrative output for the four heuristics in the last step of the inference engine.

Design of Non-monotonic Reasoning Models Employing Fuzzy Reasoning
Fuzzy reasoning provides a robust representation of linguistic information by using fuzzy membership functions. In this research article, the structure of a Mamdani fuzzy control system and the use of possibility theory as defined in [129, chap. 8], and reviewed in Section 2.3, are employed for the definition of fuzzy reasoning models of inference. As with expert systems, a running example of a single inference is depicted in Fig. 7 and referred throughout this subsection.

Fuzzification Module
The first step, the fuzzification module, starts with the definition of fuzzy IF-THEN rules and fuzzy contradictions. Their structure is the same as that presented for expert systems, but they are computed in a different fashion. Each linguistic term associated with a feature level or consequent level, such as high or low, is then described by a FMF, which is also provided by the knowledge base designer. In the running example, some FMF employed are depicted in the fuzzification module.

Inference Engine and a Non-monotonic Extension
Once the fuzzification step has been completed and the knowledge base of the expert translated into fuzzy IF-THEN rules and fuzzy contradictions, the next step is to evaluate the initial truth values of the fuzzy IF-THEN rules. To do so, each membership grade on the antecedent of these rules needs to be evaluated according to the input data. For instance, consider rule U2 in the running example. If pages = 17, then the membership grade of the linguist term medium high is 0.75, according to the FMF for pages. In this case, the initial truth value of U2, before solving contradictions, is also 0.75. U2 is a simplified example, but more than one feature can be contained in each rule's antecedents. Hence, a t-conorm and a t-norm are necessary to implement the notions of union and intersection. In the literature of fuzzy logic, t-norms (fuzzy intersection) and t-conorms (fuzzy union) [70], are a function F : [0, 1] 2 → [0, 1] such that the axioms of commutativity, associativity, monotonicity, and boundary condition are satisfied. Most commonly, the Zadeh, the Product and the Lukasiewicz operators are employed and are those selected for investigation in this research study, as listed in Table 3. Other operators can be seen in [70]. Table 3: T-norms and t-conorms usually employed by fuzzy systems. x, y ∈ [0, 1].

Fuzzy operator T-Norm T-Conorm
Zadeh Following the calculation of the fuzzy IF-THEN rules' initial truth values, possibility theory is adopted for the resolution of contradictions. According to the approach proposed in [129], truth values are used to represent possibility (Pos) and necessity (Nec) as defined in Section 2.3.1. In this study, necessity is represented by the membership grade of a proposition and possibility is always 1 for all propositions. This means that all propositions (or rules) are open to be retracted. Since there is no addition of supporting information, but only attempts to contradict or refute information, it is possible to employ Equation (1) (p. 7) to deal with the contradictions in the knowledge bases of this study. For instance, the new necessity of fuzzy rule U2, if it is contradicted only by the fuzzy contradiction CC14, is given by: -Nec(U2) = min (Nec(medium high pages), 1 -Nec(anonymous)) Nec(medium high pages) is the membership grade of the linguistic variable medium high pages. Three situations might arise in this case: • If Nec(anonymous) = 0, then CC14 has no impact on the necessity of U2.
• If 0 < Nec(anonymous) < 1, then U2 can only maintain the same necessity or have it decreased to a value greater than 0, indicating a partial refutation.
The new necessity of the fuzzy rule U2 represents the truth value of medium high trust in this rule. However, it is important to highlight that the approach developed in [129] has been inspired by a multi-step forward-chaining reasoning system. In this research study, reasoning is done in a single step, in the sense that data is imported, and all rules are fired at once. Nonetheless, it is possible to define a precedence order of fuzzy contradictions. More precisely, it is possible to define a tree structure in which the consequent of a fuzzy contradiction is the antecedent of the next fuzzy contradiction. In this way, Equation (1) can be applied from the root or roots to the leaves. In case of cyclic contradictions, or contradictions whose consequents impact each other's premises, they are solved simultaneously. For that, the truth value of all fuzzy rules is stored before solving any cyclic fuzzy contradictions. In turn, the final truth value of fuzzy rules is calculated according to these stored values.
Having the fuzzy contradictions solved by the proposed mechanism, rule weights (if defined) can be applied to the current truth values of fuzzy IF-THEN rules. In this study, the approach proposed in [65] is selected. In this case, rule weights are normalised in the range [0, 1] ⊂ R and multiplied by the current truth value of each rule. In this example, it is also assumed that the weight of a feature represents the weight of the rule that contains this feature, as implemented by expert system models.
Eventually, a disjunctive approach is employed for computing the truth values of the consequent levels. Hence, each consequent level is given by the maximum value of the truth values of the fuzzy IF-THEN rules that infer the same consequent level. If a conjunctive approach was selected (using the minimum value instead), the set of rules would be jointly satisfied, representing a stricter proposal. Here, the disjunctive approach is selected for being a more flexible proposal that guarantees that at least one rule is satisfied. The membership grade of updated fuzzy IF-THEN rules will then propagate to their consequents, producing a set of truncated membership functions associated with their consequents. The inference engine in the running example depicts the truth values of fuzzy rules, their updated values after solving contractions, and, finally, after applying rule weights. This is followed by the disjunctive aggregation of trust levels and the definition of the respective graphical representation.

Defuzzification Module
The output of the inference engine is a graphic representation of the aggregation of these truncated membership functions. Several methods can be used for calculating a single defuzzified scalar from this graphic representation [60]. Two are commonly employed and selected here: mean of max and centroid. The first returns the average of all elements (consequent levels) with maximal membership grade. The second returns the coordinates (x, y) of the centre of gravity of the graphic representation. The defuzzified scalar is represented then by the x coordinate of the centroid, as per the defuzzification module in the running example.

Design of Non-Monotonic Reasoning Models Employing Defeasible Argumentation
The definition of argument based-models follows the five-layer modelling approach proposed in [80] and depicted in Fig. 3 (p. 8). It starts with the definition of the internal structure of arguments, continues with the definition of their conflicts, the computation of their acceptance status, and ends with the aggregation of accepted arguments. A running example is depicted in Fig. 8 and referred throughout this subsection.

Layer 1: Definition of the Internal Structure of Arguments
Most commonly, an argument is composed of one or more premises and a conclusion derivable by applying an inference rule. Hence, the first step of an argumentation process is to define its forecast arguments: F orecast argument : premises → conclusion This structure includes a set of premises (believed to influence the conclusion being inferred) and a conclusion derivable by applying an inference rule →. It is an uncertain implication, which is used to represent a defeasible argument. As with the rules of expert systems, premises and conclusions are strictly bounded in numerical ranges associated with natural language terms (for instance low and high). Forecast arguments are activated if their premises evaluate as true, according to the input data provided. Boolean logical operators AND and OR can be applied for the use of multiple premises, similar to the rules employed by expert system models.

Layer 2: Definition of the Interactions between Arguments
In order to evaluate inconsistencies, the notion of mitigating argument [90] is introduced. These are arguments that attack other forecast arguments or other mitigating arguments. Both forecast and mitigating arguments are special defeasible rules, as defined in [110]. Informally, if their premises hold then presumably (defeasibly) their conclusions also hold. Different types of attacks and consequently mitigating arguments exist in the literature, as reviewed in Section 2.4 (p. 7). For instance, in this research article, an undermining attack is represented by a forecast argument and an inference ⇒ to a negated argument B (forecast or mitigating): U ndermining attack : f orecast argument ⇒ ¬B Note that these are undermining attacks because the conflict arises from the premises in each forecast argument. For instance, in the running example, B3 is attacking AF3 by stating that activity factor should not be high, since bytes is low. An undercutting attack would be defined if for some reason the inference being performed by an argument was being contested. For example, contradictions "OnlyAge" described in Table A.9. Here, an undercutting attack is represented by a set of premises and an inference to a negated argument B (forecast or mitigating):

U ndercutting attack : premises ⇒ ¬B
Lastly, a rebuttal attack would be created if, for some reason, it was believed that the conclusion of an argument was false. For instance, some domain expert could define an attack targeted at C4 by saying that there is evidence to infer another level of trust instead of high. Note that in the context of computational trust, different consequents (trust levels) might coexist or not according to the expert's reasoning, hence not all arguments with different conclusions should necessarily lead to rebuttal attacks. Moreover, since all arguments in the constructed knowledge bases are defeasible (or not strict) rebuttals would be mutual (both arguments would attack each other). Rebuttal attacks (⇔) occur in this research article when two forecast arguments support mutually exclusive conclusions according to a domain expert, hence, are represented as: Rebuttal attack : f orecast argument ⇔ f orecast argument Let us point out that different types of attacks can enhance the explainability of argument-based models and aid in the process of creating knowledge bases. However, they do not impact on the computation of the acceptability status of arguments and final numerical scalar being produced by such models in the next layers. This computation is performed via abstract argumentation theory as proposed in [46]. In this case, all attacks are seeing as a binary relation as described in Section 2.4.

Layer 3: Evaluation of the Conflicts of Arguments
At this stage, forecast and mitigating arguments can be seen as an argumentation framework, which can be elicited with data. Arguments will then be activated or discarded, based on whether their premises evaluate as true or false.
Attacks between activated arguments will be evaluated before being activated as well. These usually have a form of a binary relation. In a binary relation, a successful (activated) attack occurs whenever both of its source (argument attacking) and its target (argument being attacked) are activated. This study also makes use of the notion of strength of arguments as presented in [108]. In this case, an attack is considered successful only if the strength of its source is equal to, or greater than, the strength of its target. In the running example, feature weights are employed for defining the strength of arguments. As with the definition of rule weights in expert systems and fuzzy reasoning, the weight of a feature will also represent the strength of the argument employing this feature. The running example depicts two sub-AFs from activated arguments and attacks, with and without strength or arguments. These sub-AFs are argumentation frameworks contained in the original ones created in Layer 2, but only considering activated arguments and attacks.

Layer 4: Definition of the Acceptance Status of Arguments
Given a sub-AF, all its arguments are considered abstract as proposed in [46]. In turn, acceptability semantics [46,139,26,20,18] are applied to compute the acceptance status of each argument, or its acceptability. As previously defined, acceptability semantics evaluate the overall interaction of arguments across the AF (or sub-AF in this argumentation process), in order to select the arguments that should ultimately be accepted. In the running example, two well-known extension-based semantics, preferred and grounded [46], and one ranking-based semantics, categoriser [18], are illustrated. For the sake of simplicity, their formal definitions are presented in Appendix D.

Layer 5: Accrual of Acceptable Arguments
Eventually, in the last step of the reasoning process, a final inference must be produced. In the case of extension-base semantics if multiple extensions are computed, one might be preferred over the others. In this study, the cardinality of an extension (number of accepted arguments) is used as a mechanism for the quantification of its credibility. Intuitively, a larger extension of arguments, that by definition are also conflict-free, might be seen as more credible than smaller extensions. If the computed extensions have all the same cardinality, these are all brought forward in the reasoning process. After the selection of the larger extension/s or best-ranked argument/s, a single scalar is produced through the accrual of the values assigned to its/their forecast arguments (arguments that infer some trust level). Mitigating arguments have already completed their role by contributing to the resolution of conflicting information (previous layer) and thus, are not considered in this layer. The values of forecast arguments follow from the same formula described in Definition 2. An assumption is made here that forecast arguments have a similar structure to the IF-THEN rules defined for expert systems. Premises are associated with numerical ranges and concatenated by boolean operators AND and OR, and the conclusion has a numerical range as the consequents of the IF-THEN rules. Having their values assigned, the accrual of forecast arguments can be made in different ways, for instance considering measures of central tendency. Here, the average is accounted for models that use a binary relation of attacks, while the weighted average is accounted for models that use the notion of strengths of arguments. Note that in the case of two or more preferred extensions with the same number of accepted forecast arguments, the outcome of the preferred semantics is the mean of all its extensions.

Summary of Models and Comparative Metrics
The list of models built with different reasoning approaches can be seen in Appendix B. Overall, 68 models were implemented, with different configuration parameters selected for evaluation, as described in the previous sections. To compare the inferences produced by them, three evaluation metrics for computational trust are employed: rank of Barnstars, spread, and percentage of NAs (not assigned). Table 4 lists the calculation method of each metric. Table 4: Calculation of evaluation metrics employed to assess the trust inferences performed by non-monotonic reasoning models in the Wikipedia Project.

Rank of Barnstars
Remove editors with no trust value assigned from the dataset. Sort all other editors by their trust values in descending order. Non-Barnstars tied with Barnstars are ranked above. Sum the ranks of the Barnstar editors and normalise the result in the range range [0, 100] ⊂ R. 0 means all Barnstars with an assigned trust value are ranked above any non-Barnstar, while 100 means they are ranked below any non-Barnstar.

Spread
Standard deviation of the trust values assigned to Barnstars.
Percentage of NAs Percentage of editors that had no assigned trust value.
The rationale behind the Rank of Barnstars metric, is that when sorting editors in descending order by their assigned trust values, it is assumed that the ranking of the best models will result in Barnstar editors being placed at the highest positions. Non-Barnstar editors may also be highly trustworthy. Nonetheless, Barnstar editors still should, presumably, be ranked at the highest positions. Moreover, since trust is not a binary concept, it is expected that the distribution of the trust values assigned by these same models to Barnstar editors should have a positive, continuous spread. Spread is measured by the standard deviation of the values assigned to Barnstar editors. Finally, models that are capable of producing a final inference in more cases are considered better, thus a higher percentage of NAs, or cases without an assigned inference, is deemed as a disadvantage. Note that no metric was defined for computing an overall difference between trust values assigned to Barnstar editors and trust values assigned to non-Barnstar editors. The reason for this derives from the lack of knowledge about the non-Barnstar editors.

Results and Discussion
This section presents the results of the instantiation of the designed nonmonotonic reasoning models using two sets of data, each built from a Wikipedia XML dump, first from the Italian-language and then from the Portugueselanguage edition. Defeasible argumentation and expert system models were implemented through an online framework using entirely custom code in PHP and JavaScript [120]. Differently, fuzzy reasoning models were implemented using the C++ programming language 7 . The presentation of results is structured around each evaluation metric, a summary, and a final discussion.

Rank of Barnstars
This metric is targeted at evaluating whether the investigate non-monotonic reasoning models are capable of ranking the Barnstar editors at the highest positions. Fig. 9 and 10 depicts the normalised sum of Barnstar ranks in the range [0, 100] ⊂ R for the models built with knowledge base 1 (KB1) and knowledge base 2 (KB2) respectively. Each figure depicts first results when models are instantiated by the Italian-language dataset and then by the Portugueselanguage dataset. Instances without an assigned trust scalar were removed from this analysis. A baseline was computed by the average of the normalised features reported by each editor, also resulting in inferences in the range [0, 1] ⊂ R. This baseline was defined only to indicate whether the non-monotonic reasoning process performed by the implemented models was effective in comparison to a non-deductive and simplified inference.
From both figures, it is possible to observe that the computed ranks were effective, ranging from 0 (perfect rank) to 14.05 across all models and input datasets. This suggests that the non-monotonic reasoning approaches were all capable of capturing, to some degree, the notions of the ill-defined construct of trust, even when presenting worse values compared to the features' average. Moreover, the use of different datasets does not seem to have a significant impact on the produced rank of Barnstars. It highlights the stability of models when using different sets of data. However, some differences can still be noted among the models built with different knowledge bases. Different results were expected especially due to the contrasting topologies of these two knowledge bases. Fig. A.15 and A.16 (p. 58 and p. 60) depict their graphical representations. Particularly, while certain parameters appear to be effective for models built with one knowledge base, the same parameters can be less effective when the other knowledge base is applied instead. For instance: 1. Expert system models provided ranks between 0.28 -4.51 when built with KB1 and perfect ranks when built with KB2. The variance when employing KB1 was observed due to the filtering of surviving rules proposed by heuristics h 1 and h 2 , which seems to diminish the quality of the ranks (a) Models instantiated with the Italian-language dataset. (b) Models instantiated with the Portuguese-language dataset. produced (E1, E2 × E3, E4). Noticeably, it does not seem to affect the models built with the same heuristics but using KB2; 2. Fuzzy reasoning models presented higher variance when instantiated with KB1 (ranks between 0.25 -14.05) than with KB2 (ranks between 1.27 -2.47). The use of linear or Gaussian FMF did not contribute to this variance or had a significant impact in the quality of the ranks produced. Instead, when built with KB1, fuzzy models using the centroid defuzzification approach (labelled with •) seem to be much more effective than Features' average (7.54) (a) Models instantiated with the Italian-language dataset.    those using the max defuzzification approach (labelled with •). When built with KB2, no particular parameter seems to provide significantly better ranks, possibly due to the difficulty of these models in dealing with a higher number of contradictions contained in this knowledge base; 3. Argument-based models demonstrated great stability when instantiated with KB1 (ranks between 0.28 -0.5), but higher variance when instantiated with KB2 (ranks between 0 -13.35). It shows that acceptability semantics and the use (or not) of strength or arguments did not impact results when using KB1. Contrarily, for the argument-based models built with KB2, the use of grounded and preferred semantics (labelled with and ⊗) resulted in better ranks. Lastly, models built with KB2 and no strength of arguments (labelled with ) also performed better than their counterparts (labelled with ), suggesting that the strengths of arguments defined by the author did not improve the rank of Barnstar editors.
In summary, the first knowledge base presents a simple topology. It seems to work more effectively with argument-based models, a subset of expert system models, and a subset of fuzzy reasoning models. Contrarily, the second knowledge base is built with many more attacks, resulting in a more complex topology. Despite such complexity, certain models did achieve a perfect rank of Barnstar editors. However, as it will be evaluated in the next subsection, a higher topological complexity might also hamper the capacity of these same models of producing inferences. Thus, the next subsection evaluates the percentage of NAs, or the percentage of instances without a trust value assigned by each model. It further investigates a probable trade-off between these two knowledge bases: while one is more simplified and allows inferences to be produced for all instances, the other is more complex and precise, but prevents models from reaching a conclusion in a number of cases.

Percentage of NAs
The capacity for assigning trust values under conflicting information is assumed to be a favourable property. Thus, it is evaluated through the percentage of NAs. It is known that certain acceptability semantics employed by argument-based models, such as the categoriser, will always return a non-empty extension. In addition, the preferred semantics is less likely to return an empty extension compared to the grounded semantics. In fact, due to the topology of the knowledge bases employed in this research article, the preferred semantics always returns a non-empty extension. However, the models implemented with other semantics and other reasoning approaches might not reach a final inference for certain cases. Hence, it is important to evaluate the extent to which this can impact the quality of the designed reasoning models. Fig. 11 depicts the percentage of NAs for models instantiated with KB2. Due to the simplified topology of the KB1, NAs were not reported for any reasoning model built with it. Similarly to the previous evaluation metric, the instantiation of models with different datasets did not result in different trends. As for expert system models, it seems clear that the simplistic conflict resolution strategy employed by them might work well when applied to knowledge bases of simplified topology (as per Fig. 10). However, once a higher number of conflicts is presented several instances might not have an assigned inference. When instantiated with KB2, 51.3% (Italian-language dataset) and 50.43% (Portuguese-language dataset) NAs were reported for all expert system models. Thus, limiting the applicability of the reasoning approach with KB2.
With respect to fuzzy reasoning models, similar percentages of NAs were observed, indicating an equivalent capability of resolving the conflicts provided (a) Input data from the Wikipedia Italian edition (b) Input data from the Wikipedia Portuguese edition in KB2. Thus, note that the use of possibility theory for the resolution of large amounts of conflicts does not appear to be impacted by other configuration parameters of fuzzy reasoning models in this knowledge base. Finally, in relation to argument-based models, it is possible to observe that A12 and A9, both built with the grounded semantics, presented very different results. While A12 makes use of strength of arguments and presented 0% of NAs, A9 makes no use of strength of arguments and presented 51.3% (Italianlanguage dataset) and 50.43% (Portuguese-language dataset) of NAs. Therefore, the use of strength of arguments can assist in the issue of empty extensions when employing the grounded semantics. However, the quality of the inferences is not maintained when using such strengths. A9 had a perfect rank, while A12 had a rank value of 12.13 as depicted in Fig. 10. This is an indication that the strengths defined by the author helped to solve the excessive amount of conflicts in KB2, but did not seem to enhance the rank of Barnstar editors. This is likely due to the way rule weights and strength of arguments were defined, feature by feature. Weights/strengths could have been defined for each rule and arguments directly, in a more time-consuming manner and requiring more domain knowledge, but likely better capable of quantifying their importance. The only argument-based models that could achieve a strong rank of Barnstars while not reporting NAs were A7 (preferred semantics) and A8 (categoriser semantics), which were built with no strength of arguments. It reinforces the suitability of the preferred semantics and the categoriser semantics (despite worse rank of Barnstars editors) for the inference of computational trust.

Spread
Another metric selected for evaluating the quality of the inferences produced by the non-monotonic reasoning models was the spread of the trust scalars assigned to Barnstar editors. This was measured through the standard deviation (σ) of these scalars. As previously mentioned, trust is not a binary concept. Thus, if we assume that Barnstar editors are trustworthy, we should also expect that they will have different trust levels. Fig. 12 and 13 depicts the results for the models built with KB1 and KB2. The baseline instrument is depicted again. It produces trust scalars through the average of the normalised values of the features reported by each editor. It is employed as an attempt to show whether the non-monotonic reasoning processes implemented could outperform, to some degree, a non-deductive and simplified inference.
As depicted in Fig. 12, most models built with KB1 achieved a σ between 0.07 -0.18, close to or higher than the results reported by the features' average. Hence, the capacity to differentiate trust levels by most designed models is similar and can be considered appealing. Low σ values were expected due to the required difference in trust scalars assigned to Barnstars and non-Barnstar editors. In other words, the higher the σ the higher the chance of overlapping trust values assigned to Barnstar and non-Barnstar editors and, consequently, the worse the rank of Barnstar editors.
In addition, contrary to the previous analysed evaluation metrics, the models' σ varies greatly when instantiated with the Italian-language and the Portugueselanguage dataset. This result indicates that no single reasoning approach is better suited to design inference models able to achieve higher values of σ. An exception was noted for fuzzy reasoning models built with KB1 and employing the mean max defuzzification approach (labelled with •). These reported the lowest σ (between 0 -0.07), regardless of the dataset employed and below the features' average. One possibility is that the lower number of inputs required by the mean max approach (only the maximum membership grades) resulted in inferences of less variability when compared to the centroid defuzzification Features' average (0.07) Features' average (0.08) Figure 12: Spread (calculated with the standard deviation) of computational trust scalars inferred to Barnstar users with models built with knowledge base 1 (Appendix A). Inferior symbols are used to represent: centroid (•) and mean of max defuzzification approach (•); heuristics h 1 to h 4 ; grounded ( ), preferred (⊗), and categoriser ( ‡) semantics; and use (respectively no use) of the rule weights/arguments strength ( , respectively ).
approach, which considers the whole aggregated fuzzy set given by the models' fuzzification module. As for results depicted in Fig. 13, note that most models built with KB2 achieved even lower σ (between 0.01 -0.08), and were below the features' average. Therefore, their capacity to differentiate trust levels is similar, but lower when compared to the counterpart models built with KB1. No single expert system or fuzzy reasoning model reported a high σ when built with KB2. This is an indication that expert system and fuzzy reasoning models are likely not capable of inferring robust trust scalars when built with knowledge bases containing a ‡ (a) Models instantiated with the Italian-language dataset. (b) Models instantiated with the Portuguese-language dataset.
Features' average (0.08) Figure 13: Spread (calculated with the standard deviation) of computational trust scalars inferred to Barnstar users with models built with knowledge base 2 (Appendix A). Inferior symbols are used to represent: centroid (•) and mean of max defuzzification approach (•); heuristics h 1 to h 4 ; grounded ( ), preferred (⊗), and categoriser ( ‡) semantics; and use (respectively no use) of the rule weights/arguments strength ( , respectively ).
large number of conflicting information. Particularly, they are presumably not able to follow the assumption of trust not being a binary concept. Some few exceptions were observed for argument-based models. In particular, models A7 and A8 achieved 0.19 and 0.15 σ (Italian-language dataset), and 0.22 and 0.08 σ (Portuguese-language dataset). These seem to be the only models built with KB2 able to achieve a favourable spread, while maintaining 0% of NAs and a robust rank of Barnstar editors (between 0.66 -2.36 for A7, and between 5.3 -7.14 for A8). This balance illustrates the strong capacity for modelling nonmonotonic reasoning by argument-based models defined with the preferred and categoriser semantics while built with KB2. This capacity was not observed in any other investigated reasoning approach.

Summary and Discussion
The inferential capacity of the employed non-monotonic reasoning approaches is examined in terms of them being able to produce models whose inferences could be considered valid in the domain of computational trust. Reasons for the inferences of a model being considered invalid include spread (σ) lower than the baseline (features' average) or very high percentage of NAs. In terms of rank of Barnstar all models were considered effective. Table 5 summarises the results. Table 5: Status of the reasoning approaches with respect to the successfully modelling of computational trust by models implemented with them. In case of mixed results, it is reported the number of models built with the reasoning approach that were successful over all the ones implemented. Reasons for failing are detailed in parenthesis.

Defeasible Argumentation
Italian / KB1 All 14/24 (low σ) All Italian / KB2 None (high % of NAs and low σ) None (low σ) 2/6 (high % of NAs and low σ) None (high % of NAs) None (low σ) 2/6 (high % of NAs and low σ) As it can be observed, expert systems models presented appealing results when built with KB1. This demonstrates that the reasoning approach can be used effectively when instantiated with knowledge bases with a lower number of conflicts, supporting the vast use of expert systems in an ample range of domains present in the literature. Thus, it is important to highlight strengths such as clear reasoning process, capacity to keep the language of the domain, and capacity to add and retract rules. By contrast, when built with KB2, all expert system models were considered unsuccessful when modelling reasoning applied to computational trust. This was mostly due to the high percentage of NAs regardless of the dataset employed. As anticipated, limitations were expected in line with the performed literature review. The lack of a built-in non-monotonicity layer or options for implementing it reinforce the disadvantage of expert systems in dealing with large amounts of uncertain, vague and contradictory information.
Fuzzy reasoning models presented the most divergent reasoning process, when compared to expert system and defeasible argumentation models. While expert systems and defeasible argumentation share similarities such as quantification of rules/arguments and their aggregation through measures of central tendency, fuzzy reasoning adopts the notions of fuzzy sets and FMF, thus providing a disparate inferential process. This difference offers advantages such as higher precision for the modelling of natural language terms and capacity to handle fuzzy concepts. Certainly, such advantages are of great importance when non-monotonic reasoning is being performed. However, this higher precision comes with a number of disadvantages, including the definition of membership functions that differ from the way in which humans reason. Moreover, in order to manipulate these functions and define an inferential process with them, the definition of a number of configuration parameters is necessary. For example, a fuzzy logic operator, a fuzzy inference method, and a defuzzification approach are all needed. Some mathematical reasoning is required to select the most appropriate parameters, limiting the applicability of the reasoning approach by domain experts who are not familiar with fuzzy parameters and their interpretations. In addition, the available options for implementation of non-monotonicity are not well developed. In this research article, the use of possibility theory was selected for being intuitively the only approach that allowed the retraction of rules in a fuzzy sense. In other words, partial retractions were allowed, according to the truth value of the propositions being evaluated. However, this approach was limited by its order of application, not being commutative. The requirement of all these configurations, and the variability in the inferences produced, seem to place fuzzy reasoning in between expert systems and defeasible argumentation, in terms of inferential capacity when performing non-monotonic reasoning.
Lastly, defeasible argumentation presented the most robust results, being able to produce successful models despite the knowledge base and dataset employed. Similar to expert systems, all models were considered successful when built with KB1, in addition to producing better ranks of Barnstar. When built with KB2, argument-based models were deemed successful when employing the preferred and categoriser semantics without strength of arguments. This shows that some knowledge for the selection of configuration parameters is also required when domain experts decide to use defeasible argumentation. However, it is argued here that the amount of knowledge is lesser than that required by fuzzy reasoning. Moreover, the notion of scepticism behind the behaviour of semantics has been widely discussed in the literature of defeasible argumentation, making its adoption by experts in other fields more accessible. Lastly, when built with KB2, only argument-based models were able to produce instances in all cases while reporting appealing spread and rank of Barnstars.
To sum up, defeasible argumentation proved to be the most balanced reasoning approach, with models capable of maintaining strong results, despite the complexity of the knowledge base. Nonetheless, it is important to mention other limitations that could be noted when employing it. For instance, as with the other reasoning approaches, knowledge acquisition was a boundary when developing inferential models. This limitation could be observed, for example, in the accrual of arguments performed in the last stage of the reasoning process. The accrual based on the cardinality of extensions and measures of central tendency, such as average and weighted average of accepted arguments, are simplistic approaches. They could be seen as a surrogate for conflicts not known or not established between arguments in the knowledge base. Ideally, if the goal is to reach a single conclusion, another iteration of knowledge acquisition should be performed, instead of choosing extensions of higher cardinality (when necessary) and averaging accepted arguments of distinct consequents. This could likely improve the percentage of NAs and increase the number of models deemed successful. Still, defeasible argumentation was seemingly the most suitable to model non-monotonic reasoning applied in the domain of computational trust, allowing the creation of models that were constantly among the top-performing ones. Hence, in conclusion, the acceptance status of the hypothesis formulated in Section 3 is described below.
-Hypothesis: If computational trust is modelled with defeasible argumentation, then the inferential capacity of its models will be superior than that achieved by the selected baselines, non-monotonic fuzzy reasoning and expert systems models, according to a predefined set of evaluation metrics from the domain of application.
-Acceptance status: accepted. Reasoning models built with defeasible argumentation were superior for maintaining a balance between the metrics of evaluation, while being able to provide inferences for all instances in the datasets. Expert systems and fuzzy reasoning models achieved appropriated inferences with less-complex knowledge bases, but did not guarantee the production of inferences in all cases or appealing models regardless of the knowledge base.

Conclusion and Future Work
This research article focused on reviewing possible implementations of nonmonotonic reasoning by knowledge-based approaches and conducting an empirical comparison between them. Non-monotonic reasoning allows the retraction of previous conclusions in light of new information. It is a compelling approach to model reasoning applied to domains of uncertain information. The implementation of non-monotonic reasoning required the use of credible domain knowledge bases, which in this study were designed by the author. Their process of creation was exemplified, resulting in two variations: one simplified, with less conflicts; and another with more conflicts and higher topological complexity. Both contained sets of rules, contradictions, natural language terms, and fuzzy membership functions as reported in Appendix A. At last, these knowledge bases were exploited for the design of inference models built upon three knowledgebased, reasoning approaches able to perform non-monotonic reasoning: expert systems, fuzzy reasoning, and defeasible argumentation. Eventually, these same models were instantiated with real-world, quantitative datasets extracted from two Wikipedia dumps of different language editions. Two dumps were selected to reinforce findings and extend the generalisability of the study in terms of datasets employed. After being instantiated, the designed models were used to infer a trust scalar in the [0, 1] ⊂ R to each editor of the encyclopaedia, where 1 means complete trust, and 0 means complete absence of trust should be assigned.
Still, the proposed comparison was not aimed at enhancing the assessment of computational trust. Instead, it attempted to situate defeasible argumentation, between two other well-known reasoning techniques: expert systems and fuzzy reasoning. In particular, the inferential capacity of reasoning models built with such approaches were examined. Three evaluation metrics were selected for this analysis: 1) the sum of the ranks by assigned scalars to recognised trustworthy individuals; 2) the spread of assigned scalars to these same trustworthy editors; and 3) the percentage of instances without an assignment.
This research included a number of limitations. These are important and can inform future work for extending research on further domains of application and for considering additional non-monotonic reasoning approaches. In terms of its design, the domain-specific metrics selected for the evaluation of reasoning models increased the complexity of the proposed comparison due to the lack of ground truths. The scope of the design was also bounded by an empirical comparison between knowledge-based, non-monotonic reasoning approaches using quantitative data in a real-world context. Alternatively stated, the method employed (empirical study) to perform the envisioned comparison, was restricted. The employed models of inference were designed upon different existing building blocks in the literature of the reasoning approaches they were grounded on. This approach was necessary in order to achieve working models in the domain of computational trust with the available data and knowledge bases. Therefore, the understanding of the differences between such reasoning approaches in theoretical scenarios, or scenarios unlikely to happen in the context of modelling reasoning applied to computational trust, is also limited. In addition, the knowledge bases employed by the non-monotonic reasoning models were limited by their process of formation. This process was highly dependent on domain experts and on the time required for manual acquisition of information and formation/validation of structured knowledge. Moreover, each knowledge base was the result of the information extracted from a single human reasoner and not from multiple reasoners. The selection of multiple reasoners might require additional financial resources not available in this research work.
Despite such limitations, findings indicated how all the employed reasoning approaches allowed for the construction of models that were capable of assigning satisfactory trust scalars in certain scenarios. In the scenario of the knowledge base with less conflicts, all expert systems and defeasible argumentation models were deemed successful despite the dataset selected. Contrarily, fuzzy reasoning models led to mixed results, due to its higher number of configuration parameters for design of models. These parameters provided greater flexibility but their complexity and abundance likely limited the applicability and use of fuzzy reasoning by domain experts. When faced with the knowledge base of higher topological complexity, models built with expert systems reported a significant number of instances that remained unsolved, suggesting that a lower applicability in this kind of scenario is possible despite the dataset chosen. Fuzzy reasoning models reported less unsolved instances, but the spread of assigned scalars to trustworthy editors was too low for any model to be considered effective, also regardless of the dataset chosen. In contrast, some argument-based models were able to solve all the instances while demonstrating a better capacity of inferring robust trust scalars, or scalars able to show an effective balance among the selected evaluation metrics.
These results could lead to a possible interpretation of defeasible argumentation being better suited to capture the underlying reasoning of the knowledge bases employed in this study. Still, some differences could be observed by argument-based models employing different configuration parameters. For instance, models built with a more credulous rationale (preferred semantics), achieved better inferences. The reason for such performance may be due to a high uncertainty of the domain of computational trust. This higher uncertainty could have originated from: the fact that no other experts were consulted for validation or collaboration during the process of creation of knowledge bases; or by the fact that computational trust is not a well defined construct. Therefore, knowledge bases could likely be further improved in order to be used by sceptical argument-based models.
In conclusion, the originality of this research lies in the extensive comparison performed among defeasible argumentation and two other approaches capable of performing non-monotonic reasoning in the domain of computational trust. Previous works [119] have attempted to employ defeasible argumentation and to perform similar comparisons also in other domains, such as mental workload modelling [122,86] and mortality occurrence modelling [124,125]. In this research article different sets of data and/or models of inference were employed, extending the use of defeasible argumentation in real-world, quantitative contexts. Hence, a broad study has been performed, empirically enhancing the generalisability of defeasible argumentation as a possible approach to reason with quantitative data and conflicting/uncertain knowledge. In addition, a review of the investigated reasoning approaches was carried out, including their options for adding a non-monotonic layer. The practical use of such approaches coupled with a modular design that facilitates similar experiments was exemplified and their respective implementations made public. Moreover, the addition and use of a non-monotonic layer in the inferential processes of models built with expert systems and fuzzy reasoning was also exemplified. Such use is seldom reported in the field of non-monotonic reasoning. It might be a useful aid to scholars familiar with these reasoning approaches and also interested in performing non-monotonic reasoning activities. Overall, this study attempts to serve as a beginning point for other scholars, who could use it to replicate and/or improve the proposed approach in other domains of application. Consequently, this could contribute to the long-term goal of demonstrating the applicability and generalisability of defeasible argumentation with quantitative data in realworld contexts.
Different avenues can be pursued for future work. For example, a comparison of broader scope can be performed by adopting different structures and configurations of reasoning models. Another improvement to make the applicability of defeasible argumentation more generalisable could come from the analysis of other argumentation systems, such as fuzzy argumentation [41,67], possibilistic abstract dialectical frameworks [61], probabilistic argumentation [76], or bipo-lar argumentation [30,28]. Hybrid reasoning techniques, such as neuro-fuzzy systems [99], genetic fuzzy systems [38] and fuzzy argumentation [41] are also recommended. An investigation of different knowledge bases, both in the domain of computational trust and in new ones, will result in new findings. For example, in other contexts of digital collaborative environments, such as blogs, forums and social networks. Another interesting technique for the construction of knowledge bases could be the use of multiple reasoners for knowledge acquisition. Contradictions are generally hard to be formalised, and more reasoners might argue among themselves, leading to the creation of more conflicting rules/arguments. A less time-consuming method to produce new knowledge bases could also be attempted through the development of human-in-the-loop solutions, partially automating the construction of arguments and attacks. For instance, several works have proposed different techniques for rule extraction from machine learning models [12,13,34,136]. Finally, a higher explanatory capacity might lead to higher levels of adoption and deployment of non-monotonic reasoning. Explainability is a multifaceted concept [135,1,114,77,94]. Thus, an in-depth investigation in this aspect is also suggested.

Appendix A. Knowledge Bases
In this section, two knowledge bases are defined for the inference of computational trust. Their features were extract from the files provided by Wikipedia dumps (Fig. A.14, p. 56). Nine quantitative features were selected and are detailed next. Rules and contradictions are defined by the author, who is qualified in computer science and has appropriate experience in a multitude of digital collaborative environments. Both knowledge bases are consisted of: • A set of features employed for the modelling and assessment of computational trust (Table A.6).
• A set of natural language terms associated with numerical ranges used for reasoning with such features, for instance low and high (Table A.6).
• A set of inferential rules in the form: -IF B feature A THEN C trust. Where B is a level of feature A and C is a trust level. For instance "IF high bytes THEN high trust". Boolean operators AND/OR might also be used to add other premises.
• A set of contradictions or meta-rules in the form: -IF B feature A THEN not Rule B -IF Rule A THEN not Rule B • A graphical representation of rules and contradictions.
• A set of fuzzy membership functions associated with the natural language terms.

Features and Natural Language Terms of Knowledge Bases 1 and 2
Table A.6: List of features employed by the author for reasoning and inference of trust (as described in Table 2), followed by transformations applied to each of them, possible values found in the dataset and natural language terms with respective numerical range associated. Weights were also defined by the author through a pairwise comparison process.   Table 2), followed by transformations applied to each of them, possible values found in the dataset and natural language terms with respective numerical range associated. Weights were also defined by the author through a pairwise comparison process.      OnlyAge.a IF low frequency factor AND low regularity factor AND low activity factor THEN not P4

Contradictions Employed by Knowledge Bases 1 and Graphical Representation
OnlyAge.b IF low frequency factor AND low regularity factor AND low activity factor THEN not P3 OnlyAge.c IF low frequency factor AND low regularity factor AND low activity factor THEN not P2     (e) Triangular functions for comments, frequency factor, regularity factor, presence factor, and computational trust (f) Gaussian functions for comments, frequency factor, regularity factor, and presence factor, and computational trust

Appendix D. Definitions of Acceptability Semantics
In this research article, it is important to define the most common Dung semantics [46], such as grounded, preferred, and stable, as well as other important notions such as reinstatement and conflict-freeness.
A + indicates the arguments attacked by A, while A − indicates the arguments attacking A. Args + indicates the set of arguments attacked by Args + , while Args − indicates the set of arguments attacking Args − . Definition 4 (Conflict-free [26]). Let Ar, att be an AF and Args ⊆ Ar. Args is conflict-free iff Args ∩ Args + = ∅.
• Args is a complete extension if Args = F (Args).
• in(Lab) is a grounded extension if undec(Lab) is maximal, or in(Lab) is minimal, or out(Lab) is minimal.
• in(Lab) is a preferred extension if in(Lab) is maximal or out(Lab) is maximal.
The categoriser ranking-based semantics [18] is also employed in this research article. It assigns a value to each argument based on its number of attackers. To do so, a categoriser function and a categoriser semantics are defined as follows: Definition 8 (Categoriser function [18]). Let Args, att be an argumentation framework. Then, Cat : Args → (0, 1] is the categoriser function defined as: otherwise Definition 9 (Categoriser semantics [18]). Given an argumentation framework Args, att and a categoriser function Cat : Args → (0, 1], a ranking-based categoriser semantics associates a ranking Cat AF on Args such that ∀a, b ∈ Args, a Cat AF b iff Cat(a) ≥ Cat(b).