Examining the Modelling Capabilities of Defeasible Argumentation and non-Monotonic Fuzzy Reasoning

Knowledge-representation and reasoning methods have been extensively researched within Artificial Intelligence. Among these, argumentation has emerged as an ideal paradigm for inference under uncertainty with conflicting knowledge. Its value has been predominantly demonstrated via analyses of the topological structure of graphs of arguments and its formal properties. However, limited research exists on the examination and comparison of its inferential capacity in real-world modelling tasks and against other knowledge-representation and non-monotonic reasoning methods. This study is focused on a novel comparison between defeasible argumentation and non-monotonic fuzzy reasoning when applied to the representation of the ill-defined construct of human mental workload and its assessment. Different argument-based and non-monotonic fuzzy reasoning models have been designed considering knowledge-bases of incremental complexity containing uncertain and conflicting information provided by a human reasoner. Findings showed how their inferences have a moderate convergent and face validity when compared respectively to those of an existing baseline instrument for mental workload assessment, and to a perception of mental workload self-reported by human participants. This confirmed how these models also reasonably represent the construct under consideration. Furthermore, argument-based models had on average a lower mean squared error against the self-reported perception of mental workload when compared to fuzzy-reasoning models and the baseline instrument. The contribution of this research is to provide scholars, interested in formalisms on knowledge-representation and non-monotonic reasoning, with a novel approach for empirically comparing their inferential capacity. © 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Several approaches in the field of Artificial Intelligence (AI) have been proposed and investigated for modelling reasoning under uncertainty [1]. These include probability calculus [2] and its variations such as Possibility Theory [3] and Imprecise Probabilities [4], Dempster-Shafer Theory [5], Argumentation Theory [6] and Fuzzy reasoning [7]. Many scholars have used these approaches for modelling reasoning activities across many application domains. However, most of these are based upon a monotonic consequence relation: adding a formula to a theory never produces a reduction of its set of consequences. In other words, monotonicity supports the fact that learning a new piece of knowledge cannot reduce the set of what is known. Because of this limitation, they have been deemed not suitable for many knowledge representation problems that are non-monotonic in nature. Here, a conclusion or claim, derived from the application of some knowledge, can be retracted in the light of new is compared against that of non-monotonic fuzzy reasoning. To achieve this goal, a knowledge representation problem has been chosen: the representation and assessment of the ill-defined construct of mental workload (MWL), via non-monotonic reasoning. Defining and modelling mental workload is an open problem within the disciplines of Human Factors, Psychology and Neuroscience. It is ill-defined because no clear and widely accepted definition exist. Mental workload can be intuitively described as the amount of necessary cognitive work invested in a certain task for a period of time. Nevertheless, this is an oversimplified definition and other factors such as stress, time pressure and mental effort can all influence mental workload whose levels can change over the execution of a task. Additionally, different scholars and reasoners from different disciplines might define mental workload according to their theoretical background, their knowledge and availability of different theories. This justifies the subjectivity and uncertainty surrounding the problem of mental workload definition and modelling. However, a reasonable assumption made in this research study is that mental workload is a complex construct built upon a network of pieces of evidence, beliefs, intuitions and understanding the interactions among these, is essential in defining and assessing it. These assumptions are also the key components of a defeasible concept: a concept built upon a set of reasons that can be defeated by additional reasons. Eventually, since mental workload is an ill-defined construct, no ground truth exists to validate different models of mental workload itself. However, over the past 50 years, designers and scholars have proposed a set of validation criteria that can be used to validate models of mental workload [22]. Among them, convergence and face validity have been chosen in this study. The former refers to the extent to which two measures of mental workload, that should be theoretically related, are in fact related, while the latter assess whether these measures appear effective in terms of their stated aims, that means measuring mental workload.
The specific research question under investigation is: to what extent can defeasible argumentation allow the construction of models of mental workload with a higher convergent and face validity when compared to those constructed via non-monotonic fuzzy reasoning?
The reminder of the paper is organised by presenting related work on argumentation theory and fuzzy reasoning in Section 2. A short description of the construct of mental workload follows, aimed at providing the readers with the relevant notions to understand the modelling problem under investigation. In Section 3, the design of a comparative experiment and the methodologies for the development of argument-based and fuzzy-reasoning models are detailed. Section 4 presents the results followed by a discussion. Section 5 concludes this research by highlighting its contribution to knowledge and proposing future avenues of development.

Related work
Reasoning and explanation under incomplete and uncertain knowledge have been investigated for several decades in AI. On one hand, classical propositional logic has demonstrated to be inadequate for dealing with real-world reasoning activities often involving inconsistent and conflicting information [23]. This is because of its monotonicity property that is based upon a notion of consequence relation by which adding a formula to a theory never produces a reduction of its set of consequences. In a nutshell, monotonicity is grounded on the fact that learning a new piece of knowledge cannot reduce the set of what is known.
In detail, if a conclusion p follows from a set of premises A (denoted as A ⊢ p), in standard monotonic reasoning it also holds that A, B ⊢ p if any additional set of premises B is added to A, the conclusion p is still valid. However, this property does not allow the retraction of what is known, which is what often happens with human-like reasoning, which is generally non-monotonic. Reasoning is non-monotonic when a conclusion, supported by one or a number of premises built over some information, can be retracted in the light of new information [24,25].
In other words, in non-monotonic reasoning, even if the premises are still true, then the conclusion might not hold, whereas in monotonic reasoning, if the premises are true the conclusion necessarily follows. The property of non-monotonicity relies on the idea that a claim can be derived from premises partially specified, but in the case of an exception arising, this can be withdrawn [26,27]. Many non-monotonic reasoning formalisms have been investigated in AI over the last few decades [1,8,11,16,[28][29][30][31] with many applications in many domains, with some examples in [10,[32][33][34][35][36][37][38][39][40]. A type of reasoning that accounts for the property of non-monotonicity is defasible reasoning, routinely used by humans. Here, the main feature is represented by default knowledge employable in a reasoning process even if the preconditions for its application are only partially known. Default knowledge is represented by using defaults that are specific inference rules. These are expressions of the form: p(x) : j 1 (x), . . . , j n (x) −→ c(x) where p(x) is the prerequisite of the default, j 1 (x), . . . , j n (x) are the justifications and c(x) is the consequent. If p(x) is known and if j(x) is consistent with what is known, then c(x) can be defeasibly deduced [23]. In other words, if it is believed that the prerequisite is true, and each of the n conditions (justifications) can be assumed since they are consistent with current beliefs, then this leads to believe the truth of the conclusion. The truth surrounding preconditions is not explicitly verified and they are assumed to hold defeasibly. In other words, this means they are true in the absence of explicit information to the contrary [8]. As soon as new information becomes available and the falsity of such preconditions can be deduced, then the conclusions, derived from the application of the default knowledge can be retracted [1,10,12]. Intuitively, this means that adding new premises may lead to removing (non-monotonicity), rather than adding new conclusions (monotonicity).
One of the practical implementations of defeasible reasoning is offered by argumentation theory (AT). Classical argumentation, from its roots within philosophy and psychology, deals with the study of how arguments or assertions are defined, discussed and solved in case of divergent opinions. In AI, argumentation refers to that body of literature that focuses on techniques for constructing computational models of arguments. Such models have become increasingly important for operationalising nonmonotonic reasoning [1,15]. Example of application areas include dialogue and negotiation [15], knowledge representation [39] and decision-making in health-care [34], practical reasoning, decision support, dialogue and negotiation [15,17,41]. Argumentation deals with the interactions between possibly conflicting arguments, arising when different parties argue for and against some conclusions or when different pieces of evidence are available [42]. Arguments can be seen as 'tentative proofs for propositions' [43] in a logical language whose axioms represent premises in the domain under consideration. In general reasoning problem, usually the premises are not consistent because they may lead to incompatible conclusions. Argumentation systems are usually formed by several components and layers. A good summary of these components and their role can be found in [6] and summarised in the structure of Fig. 1. The first and second layers deals with the internal structure of arguments, its components such as premises, types of rules, their conclusions, their connection [44] as well as the definition of the conflicts among them. The subsequent layers deals with the dialogical structure of arguments and focus instead on conflict resolution. They typically regard arguments as monolithic entities, whose internal structure is abstracted away as far as the conflict resolution process is concerned. Roughly speaking, the first two layers concern the production and construction of arguments and their conflicts: they are aimed at formalising a knowledge-base. The subsequent dialogical layers concern with the resolution of potential inconsistencies and the extraction of a rational point of view that can be subsequently used, for instance, for supporting decision-making.
The structure described above is in somehow aligned to the structure of [45] where 4 layers were proposed: logical, dialectical, procedural, heuristic. The logical layer deal with argument constructions, similar to layer one (Fig. 1). The second layer introduces the notion of attack, rebuttal, defeats, counter-arguments and deal with conflictual information, introducing the notion of reinstatement, respectively similar to layers 2 and 4 of Fig. 1. It also introduces the notion of argument comparison aimed at determining whether an attack is successful, as in layer 3 of the structure of [6]. The third layer regulates how a dispute among arguments can be conducted and allows the introduction of new information and support new arguments formation. This does not have correspondence to the structure proposed in this paper, as static knowledge can be only represented in the first 2 layers using the notions of argument and attack. The fourth layer provides rationale ways for resolving conflicts emerged in the previous layer and it is connected to the notion of rhetoric, influencing the preceding layers. These strategical ways or heuristics influence the previous layers by suggesting which premises to use and which arguments to construct as well as which arguments to attacks and claim to make or deny. This does not have correspondence with the structure proposed in this study (Fig. 1). The main differences (summarised in Fig. 2) is that the structure proposed in this paper (left), on the one hand, is linear and mainly for single-agent argument-based systems, sometimes referred to autonomous agent reasoning [46], and it is tailored to knowledge-representation problems, belief revision and decision-making under uncertainty. On the other hand, the structure proposed in [45] (right) is non-linear and mainly for multi-agent systems where different parties engage in a dispute and might form new arguments during the dialectical exchange of information. The former is more static and adds a final 5th layer devoted to the accrual of arguments, with the assumption that sometimes, in some particular domain, a final rational point has to be always extracted. The latter is more dynamic and is tailored to those situations in which rhetoric is required and reasoning is conducted by multi-parties such as in legal reasoning and dialogues facilitating multi-agent interaction [45,46].
The first layer of the structure used in this research ( Fig. 1) focuses on constructing arguments. An example of an internal, monological structure has been proposed by Toulmin [47]. Toulmin's model is useful for highlighting the elements that might Comparison of a single-agent argument-based reasoning scheme for knowledge representation [6] versus a multi-agent reasoning scheme for legal reasoning [45]. form a natural argument, and it provides a useful basis for knowledge representation. Another well-known approach has been proposed by Reed and Walton to model the notion of arguments as product [48,49]. It is based upon the notion of argumentation schemes and it is useful for identifying and evaluating a variety of argumentation structures in everyday discourse [44]. A recent attempt for reconstructing the internal structure of arguments in natural language has been proposed in [50]. This combines the linguistic representation framework of Constructive Adpositional Grammars (CxAdGrams) [51] with the argument classification framework of the Periodic Table of Arguments (PTA) [52]. In detail, the second layer is built upon the notion of conflict, also referred to as attack or defeat, sometimes with slightly different meanings, key notions in argumentation. Several kinds of conflicts have been emerged in the literature but three core classes exist [53]: • undermining attack -an argument can be attacked on one of its premises by another argument whose conclusion negates that premise. Example: 'soda consumption is low according to X so X has a low risk of cholesterol' can be undermined by 'the blood pressure emerging from a test is high so X has a high consumption of sodas' • rebutting attack -it occurs when an argument negates the conclusions of another argument. Example: 'X consumes minimal amount of soda so X has a low risk of heart attack' can be rebutted by 'X is an obese person, the strongest risk for heart attack is obesity, so the risk of heart attack is high'.
• undercutting attack -it occurs when an argument uses a defeasible inference rule and is attacked by arguing that there is a special case that does not allow the application of the rule itself [54]. Example: 'aspirin treatment minimises the risk of heart attack so X has a low risk of heart attack' can be undercut by 'paper Z demonstrated that aspirin failed several times in minimising the treat of heart attack so it is not always an effective method to reduce heart attacks'.
Conflict between arguments, although a key notion in argumentation, does not embody any approach for evaluating an attack, which is instead handled in the third layer. Generally speaking, an attack often has a form of a binary relation between two arguments. Some author distinguishes this relation in a weak form, attacking another argument and not weaker (defeat) or in a strong form, attacking another argument and stronger (strict defeat) [53]. The construction of defeat relations are often influenced by the domain of application and are usually defeasible, that means attackable. For example, consider those domains where observations are important: defeat relations might be influenced by the reliability of tests and the expertise of the observers. To establish whether an attack can be considered a successful defeat, a trend in argumentation considers the notion of strength of arguments. Here, the assumption is the inequality of the strength of arguments that should be accounted for in the computation of sets of arguments and counterarguments [55]. Several researchers have adopted the notion of preferentiality among arguments [56]. In these researches, the assumption is that the information, necessary to evaluate the successfulness of an attack between two arguments is often pre-specified, and implemented as an ordering of values or a given partial preference. However, according to [56], also the information concerning preferentiality might be contradictory, and the preferences may change according to the context and to different observers who can assign different strengths, to different arguments, employing specific different criteria. Therefore, the notion of meta-level argument has emerged: a simple node in a graph where preferentiality is abstractly defined, and can be implemented by creating a new attack relation from a preference argument. Meta-level arguments have the goal of making a reasoning process more intuitive, as it allows no commitment regarding the definition of the preferences of arguments. As opposed to the preferentiality approach, another branch of argumentation is devoted to attach weights to attack relations rather than to arguments [55,57]. For example, the traditional binary relation of attack has been extended in [58] through the notion of fuzzy relations borrowed from Fuzzy Logic. This approach allows an expert to represent the degree of truth on an attack from an argument to another. Similarly in [59], probabilities can be assigned to arguments and attack relations. Here, probabilities refers to the likelihood of their existence and aim at capturing the inherent uncertainties in an argumentative reasoning activity.
Defeat relations, as defined in the third layer, focus on the relative strength of two conflicting arguments. However, they do not express yet which ones can be regarded as justifiable. The final state of each argument depends on the interaction with the others thus a notion of dialectical status is required in a fourth layer. Often, the outcome of an argumentation system is achieved by splitting the set of arguments in two, those that support a certain decision/action/claim and those that do not. Sometimes an additional class can contain those arguments that leave the dispute among arguments in an undecided status. In practical, real-world argumentation, multiple actions, decisions or claims can be considered thus the number of classes can increase. A seminal approach for assessing the dialectical status of arguments has been proposed by Dung [16], in line with other more practical and concrete works on argumentation [54,60]. Dung's abstract argumentation frameworks (AF) has been proven very useful for comparing different argumentative systems by translating them into his abstract format [60]. Here, given a set of abstract arguments whereby their internal structure is abstracted away, and a set of defeat relations, a decision has to be taken for determining which ones can ultimately be accepted. The core idea is that by uniquely looking at the defeaters of an argument to decide its dialectical status is not enough. It is also important to assess whether the defeaters are defeated themselves. Argumentation frameworks are strictly connected to the notions of semantics, specific criteria for the assessment of the dialectical status of arguments. The ideas is that, given an argumentation framework, a semantics specifies zero or more sets of acceptable arguments, called argument-based extensions, intuitively corresponding to different points of view. Examples include the popular grounded and preferred semantics proposed by Dung [16] respectively a sceptical and credulous criteria for argument acceptability. Other semantics can be found in [14,[61][62][63][64][65][66][67][68]. For practical purposes, as it often happens with human, a unique rational inference has to be made, a unique decision or has to be taken or a unique rationale claim is necessary. Thus another layer, a fifth layer, is sometimes added to the previous structure, as depicted in Fig. 1, aimed at accruing remaining arguments and producing a unique inference employable for practical purposes, such as informing decision making or explaining a rationale outcome. Here, accrual occurs at the level of the consequents of arguments and not at the level of an argument's claim. In other words, as stated in [69], in the former case, the accrual refers to a combination of arguments and their claims, leading to accruing reasons or arguments, while in the latter case, it refers to the decision or belief which is the subject of a reason (argument), namely referred to the claim (conclusion). Examples of approaches for accruing arguments at the consequent level include the average of the consequents of the arguments in an extension, in case of quantitatively assessable claims, or the consideration of claims of the top x% of arguments in case they were ranked in the previous layer with a rankingbased semantics. The reader is reminded to [69][70][71][72] for other approaches of accrual of arguments.
Despite the abundance of theoretical work, one of the main issues surrounding argumentation, its theories and approaches, is the lack of research devoted to the examination of its impact on the quality of the inferences produced by reasoning models built upon it. Similarly, situating and comparing argumentbased systems among and against other non-monotonic reasoning approaches is negligible. However, as previously mentioned, many other approaches for implementing non-monotonic reasoning and enabling knowledge representation under partial, conflicting and uncertain information exist. One of these is represented by fuzzy reasoning, a type of reasoning grounded in the well-known fuzzy logic. Fuzzy reasoning, as originally proposed in the seminal article by Zadeh [73], is a generalisation of standard logic in which a concept can possess a degree of truth, and not only being completely true or completely false.
For this reason, fuzzy logic aims at modelling vague concepts with varying degrees of truth. In practical terms, these concepts are implemented via fuzzy sets and membership functions. The former are particular sets whose elements have degrees of membership and have similar notions to classical set theory such as inclusion, union and intersection. The latter are particular functions that models this membership, that means, they assign a grade of membership to an element, in the interval [0, 1] ∈ R. These formal representations can support the construction of ''IF-THEN'' rules using natural language terms, as in ''IF the temperature is very high THEN significantly increase the speed of the fan''. Temperature and speed are the concepts, modelled as fuzzy sets, while very high and significantly are modelled by membership functions. Note that other membership functions might be designed as for instance, very low for temperature, or moderately for speed. In other words, Temperature is a fuzzy variable, while very high is a value that such variable can take, which is modelled by a membership function. A set of fuzzy rules can be employed for reasoning and along with the above notions, they have enabled the development of fuzzy control systems [74,75]. A traditional fuzzy control system is usually composed by a fuzzification module, an inference engine and a defuzzification module, sometimes referred to as defuzzifier [76,77]. The fuzzification module takes the input variables, identified from the knowledge-base of an expert related to an underlying application domain. The universe of information is split into a number of fuzzy subsets, and each is assigned a linguistic label. Subsequently, a membership function for each fuzzy subset is created by an expert or a human reasoner, aimed at modelling the uncertainties associated to each input variable. These form the fuzzy linguistic variables that, jointly with the rule-base of the expert/reasoner, are translated into fuzzy rules that are aimed at assigning relationship between fuzzy input and output. Each rule can be built upon a number of fuzzy linguistic terms, joined with specific operators. These might include the more traditional t-norm and t-conorms (intersection/disjunction), or others such as those used in intuitionistic fuzzy sets [78,79]. These are taken by the inference engine that actually performs the reasoning. Usually, these rules are all evaluated in parallel and are combined with another operator (usually the fuzzy OR operator) to obtain one fuzzy output distribution. This is then taken by the defuzzification module that applies a strategy (example the centre of mass, or mean of max approaches), to produce a single crisp output.
One of the limitations of this reasoning approach is that it does not incorporate any explicit notion of conflict among fuzzy rules. In other words, conflicting fuzzy rules might give rise to different outcomes when they are aggregated, therefore the behaviour of a fuzzy rule system admits that some degrees of truth may decrease in the presence of new information, which could be interpreted as a form of non-monotonicity. However, beside this interpretation, no explicit mechanism exists in standard fuzzy reasoning systems to handle conflicts among rules. Some scholars have proposed different extensions of standard fuzzy reasoning systems by incorporating a non-monotonic layer for dealing with conflicting information. Unfortunately, these are sporadic and not backed up by empirical research. For example, in [30], conflicting rules have their conclusions aggregated by an averaging function while in [31], a rule-base compression method is proposed for the reduction of non-monotonic rules. This method identifies all redundant rules after their fuzzification and it removes them while preserving the defuzzified output from the fuzzy system. Another approach has been proposed in [80,Chapter 8], and it is built upon Possibility Theory [3] which is employed within the inference engine to handle conflicting information. Possibility and necessity are measures of uncertainty and a special form of imprecise probability [81]. They can be linked to the notion of truth and its degree: Possibility indicates the extent to which data fail to refute its truth while necessity indicates the extent to which data supports its truth (both real values in [0, 1] ∈ R). The possibility of a proposition can also be seen as the upper bound of the related necessity (Pos ≥ Nec).
Not a lot of work exist in the literature towards comparing different non-monotonic reasoning approaches and their inferential capacity. The authors in [82] attempted at comparing Normal Default Theory against Defeasible Logic Programming. Similarly, [83] performed a comparison of first order predicate logic, fuzzy logic and non-monotonic logic as part of a knowledgerepresentation problem whereas [84] discussed and compared preferential and explanatory non-monotonic reasoning. However, the research studies above mainly focused on formal comparisons of different properties of different non-monotonic approaches for inferences and not on empirical evidence gathered from experimental studies involving humans. In contrast, a similar work to the research approach followed in this paper, has been performed by [85]. It focused on an empirical comparison between nonmonotonic preferential logic and screened belief revision, a particular version of belief revision theories, involving human data. Most of the work on non-monotonic reasoning are strictly related to various knowledge-representation problems and this is the reason why the ill-defined concept of human mental workload has been selected for experimental purposes.
Mental workload (MWL) is a construct coming from psychology and mainly applied within ergonomics and education [86] with novel applications in medicine and human-computer interaction. It can be intuitively described as the amount of cognitive activity exerted to accomplish a specific task under a finite period of time [87]. However, this definition is very simplistic and many factors influence mental workload. Despite 50 years of research, unfortunately it is still an ill-defined construct, with many definitions, discipline-specific and with many application-dependent models [40]. Similarly, many formalisms exist to represent and assess mental workload [22,39,88,89]. Along with the various ad hoc models of mental workload, the many definitions and context-specific models, a number of criteria for evaluating these have been proposed [90], including reliability, validity, sensitivity and diagnosticity. Two of them have been selected in this study, as indicated in the research question of Section 1: convergent and face validity, both defined in Table 1. The following section is devoted to the description of the construction of various models of mental workload by employing different declarative knowledge-bases, provided by different human reasoners, formalised by using the two reasoning methods described so far, namely defeasible argumentation and non-monotonic fuzzy reasoning, and elicited with data acquired by human participants in an educational context.

Design and methodology
A primary research study has been designed and performed. This is aimed at comparing the inferential capacity of defeasible argumentation and non-monotonic fuzzy reasoning for the problem of mental workload modelling. In particular, a well-known self-reporting subjective mental workload assessment instrument has been chosen as baseline: the Nasa Task Load Index [88]. This is a combination of six factors believed to influence mental workload: mental, temporal and physical demand, stress, effort and performance.
Each factor is quantified with a subjective judgement coupled with a weight w computed via a paired comparison procedure, that lead to 15 possible comparisons. The questionnaire designed for the quantification of each factor, and the pairwise comparison, can be found in [88] and in Tables 7 and 8 of Appendix. Eventually, the final mental workload score is the weighted average of the subjective rating associated to each attribute d i : As it is intuitive to grasp, this model is a weighted average, with no notion of non-monotonicity and no consideration of the relationships among the factors. The information associated to these factors along with that related to the pairwise comparison (preferentiality) has been used to construct three knowledge-bases of different topology and complexity with a human reasoner. Note that no automatic procedure for inducing rules from data have been used, but only the expertise of human reasoners. In detail, defeasible argumentbased model and a non-monotonic fuzzy reasoning-based models were constructed for each knowledge-base. Eventually, their inferential capacity is evaluated according to two criteria usually employed for the evaluation of models of mental workload: face and convergent validity (as detailed in Table 1). The overall research design is summarised in Fig. 4.

Non-monotonic fuzzy reasoning models
The non-monotonic fuzzy reasoning models are built according to the traditional Mamdami control system of Fig. 3 and extended by employing the notions of possibility and necessity to support its non-monotonicity.

Fuzzification module
Each encoded knowledge-base can be represented by rules of the form ''IF -THEN''. The antecedent is a set of premises associated to a number of mental workload features, while the consequent is associated to a possible mental workload level (underload, fitting lower load, fitting upper load, overload). Examples of rules are: • Rule 1: IF low mental demand THEN underload • Rule 2: IF low effort THEN fitting lower load Each consequent of a rule can be represented as a fuzzy term and described by a fuzzy membership function (FMF) (examples in Fig. 17). According to the domain expert's knowledge, two options Table 2 T-Norms and t-Conorms employed for two propositions a, b.
were designed: the universe of mental workload is represented with an interval [0, 100] ∈ ℜ with 4 membership functions.
Fuzzy membership functions were also defined for all linguistic variables associated to the antecedents of rules such as low for mental demand and low for effort (examples in Fig. 16).
Membership functions for both the antecedents and consequent of a rule have been defined considering a human reasoner, with experience with the construct of human mental workload, and not automatically induced from data. Since each feature of the NASA-Task Load Index instrument are rated by human participants using a 20-point scale ( Table 7 in Appendix), the input for these membership functions was scaled to the interval [0, 100] ∈ ℜ in order to be the same as the membership functions defined for the consequents (workload levels). This was purely a practical decision in order to deal with similar scales. Additionally, the definition of

Inference engine
Once the knowledge-bases are fully translated into fuzzy inference rules, the next step is to evaluate their initial truth values. Each membership grade on the antecedent of these rules needs to be evaluated according to input data individually. If more than one antecedent is contained in a rule, fuzzy logics are necessary to aggregate them via the notions of union and intersection. Three known operators are selected for this: Zadeh, Product and Lukasiewicz. Table 2 lists the t-norms and t-conorms (AND, OR) respectively for each selected operator. Subsequently to the calculation of the initial truth values of the rules, it is necessary to solve any contradiction among them. For example: This rule indicates that if effort is high then any rule whose antecedent contains 'low mental demand' is refuted and its consequent should be evaluated again and the truth value associated to it updated. An example is: -Exception 1: high effort refutes Rule 1 Exceptions represent the non-monotonic nature of the information in the knowledge-bases. Here, the underlying reasons who brought to life the exception is abstracted away that means, the undercutting, rebutting or undermining nature of the refutation is not formally modelled. A way of enabling non-monotonicity in fuzzy reasoning is the use of Possibility Theory, as implemented in [80]. Much of the literature on fuzzy mathematics is concerned with possibility, a measure of the extent to which the data fail to refute a conclusion. However, in the real world, we are primarily concerned with necessity, a measure of the extent to which the data support a conclusion. Possibility (Pos) and necessity (Nec) are values bounded in the interval [0, 1] ∈ R. The reasoning process of establishing necessary conclusions is not the same as the process of establishing possible conclusions. For example, on one hand, when possible truth values are initialised in the lack of any data, a value of 1 is used. On the other hand, when necessary truth values are initialised in the absence of data, a value of 0 is used [80]. According to these considerations, in this study necessity is considered as the membership grade of a proposition while possibility is always set to 1 for all propositions, due to the lack of any data. Under this circumstance, the effect on the necessity of a proposition A by a set of propositions Q which refutes A is given by: with ¬Nec(Q ) = 1 −Nec(Q ). In Eq. (1), the addition of supporting evidence can only affect the necessity but not the possibility of a proposition. In this research study, there is no addition of supporting information but only attempts to refute information. Thus, the above equation is sufficient to model the contradictions in the knowledge-bases. An example is given below. Let us consider Rule 1 assuming it is refuted only by Exception 1, then the truth value of its consequent is given by: highlight that the theory developed in [80] was for a multi-step forward-chaining reasoning system. This means that rules were activated in a chain, one by another, defining a precedence order of rules. However, in the current research study, the activation of rules is done in a single step, in the sense that data is imported and all rules are activated (or not) at once. Despite this constrain, it is still possible to define a precedence order of refutations also in this case. In detail, a tree structure in which the consequent of a refutation is the antecedent of the next refutation can be constructed. In this way, Eq. (1) can be applied from the root or roots to the leaves. This approach is sufficient for knowledge-bases that do not contain loops (cyclic exceptions). However, this is not the case as the knowledge-bases employed here contain loops of rules or rebutting information. For instance, let us consider the following ''IF-THEN'' rules and their refutations: • In this case it is not clear whether exceptions 2 or 3 should be solved first. Given that there is no information within the knowledge bases to decide whether a mental workload feature (premise of a rule) or an exception is more important than another, then the proposal here is to activate them simultaneously. In detail, the original truth value of rules are initially computed before updating them due to exceptions (refutations). For instance, the truth values of rules 3, 4 are: • Truth value of Rule 4 = min (Nec(high frustration), 1 - The above mechanism handles conflictual information by updating the degrees of truth of each rule. In turn, this degree of truth also represents the degree of truth of the rule's consequent. Once each consequent (mental workload level) has an associate degree of truth, it is necessary to aggregate them before proceeding to the defuzzification step. A disjunctive approach is employed for performing this aggregation. This approach groups the consequent levels inferred by each ''IF-THEN'' rule using the 'max' operator. In this case, at least one rule is satisfied, leading to a more flexible proposal. For instance, the truth value of underload in a context where only Rule 1 and Rule 3 infer underload is 'max(Truth value of Rule 1, Truth value of Rule 3)'. A conjunctive approach, employing the 'min' operator, is also possible. However, the set of rules would need to be jointly satisfied, representing a stricter proposal. Since exceptions are already defined in   the knowledge-base, this paper only investigates the disjunctive case. The inference engine stops when each consequent level has a truth value assigned.

Defuzzification module
The output of the inference engine can be graphically seen as the aggregation of the consequents (mental workload levels) of fuzzy inference rules, output of the inference engine, as exemplified in Fig. 6. Several methods can be used for calculating a single defuzzified scalar but two are selected here: mean of max and centroid. The first returns the average of all the consequents (here mental workload levels) of the fuzzy rules, with maximal membership grade. The second returns the coordinates (x, y) of the centre of gravity of the geometric shape formed by the aggregation of the membership functions associated to each consequent (MWL level). The defuzzified scalar is subsequently represented by the x coordinate of the centroid and we assume it represents a rational inference of mental workload. Following the non-monotonic fuzzy reasoning process described so far, a set of models is constructed with different fuzzy logic operators and defuzzification methods (as listed in Table 3). A graphical summary of the inferential process followed by these models is depicted in Fig. 5 (see Table 4).

Argument-based models
The definition of argument based-models follows the fivelayer modelling approach depicted in Figs. 1 and 7. Arguments are rule-based and not of any other form such as natural language or logical propositions. They are of two types: defeasible and defeater

Table 4
Summary of the non-monotonic fuzzy reasoning models constructed via a human reasoner grouped by knowledge-base with references to their graphical representation and details of the membership functions adopted and their activation ranges both for the antecedents (MfA) and consequents (MfC) of fuzzy rules.  (Table 19) rules. The former rules contain premises that only create presumptions in favour of their conclusion contrarily to strict rules, which logically entail their conclusion [91]. The latter are rules that cannot be used to draw any conclusion and are essentially used to prevent some conclusion. These definitions of rules are in line with defeasible logics [92]. In detail, defeasible arguments are rules that can be defeated by contrary knowledge while defeater arguments are used to defeat some defeasible rule by producing knowledge to the contrary [92].

Effort
How hard did you have to work (mentally and physically) to accomplish your level of performance?
Performance How successful do you think you were in accomplishing the goals, of the task set by the experimenter (or yourself)? How satisfied were you with your performance in accomplishing these goals?

Frustration
How insecure, discouraged, irritated, stressed and annoyed versus secure, gratified, content, relaxed and complacent did you feel during the task? follows.

Forecast argument: premises → conclusion
A forecast argument is composed by a set of premises and a conclusion derivable by applying an inference rule →. It is a defeasible rule that can be defeated by contrary knowledge [92].
To keep the terminology consistent with previous sections, we also refer to this as an ''IF-THEN'' rule. An example of a singlepremise forecast argument is ''ARG 1: IF low mental demand THEN underload''. An example of a multiple-premise forecast argument is: ''HPF1: IF (high OR medium upper) effort and (medium lower OR medium OR low) performance AND (low OR medium lower) frustration AND (low OR medium lower) mental demand THEN underload''. The linguistic terms of the attributes used in the premises of an argument and its conclusion are strictly bounded in well defined ranges and these are High [70,100] established by the human reasoner. Formally: where i n ∈ R is the activation value of premise n which is bounded in the numerical range [l n , u n ] ∈ R (with l n < u n ); [l c , u c ] ∈ R (with l c < u c ) is the numerical range of the consequent; AND, OR are logical operators.
The inference → is formalised as a mapping from the ranges of the premises to that of the conclusion, as proposed in [93]. Formally, with the simple case which can be extended to an arbitrary number of premises, the value associated to the conclusion c is:   • if l c < u c then Eq. (2) models a linear relationship (the higher the value of the premises the higher the value of the conclusion) • if l c > u c then Eq. (2) models an inverse linear relationship (the higher the value of the premises, the lower the value of the conclusion) • if l c = u c then Eq. (2) models a constant function whose inputs results in the same output (u c ). This is useful to model consequents with categorical levels.
Briefly, the above mapping provides a formula for rules that employ logical operators AND/OR, replacing them for max and min operators [94]. Example of forecast arguments can be found in Tables 11 and 20.

Layer 2 -Definition of the conflicts of arguments.
In order to model inconsistencies, the notion of mitigating argument [42] is introduced. This is a defeater rule that cannot be used to drawn any conclusion but only to prevent some other conclusion [92]. This type of argument is formed by a set of premises and an undercutting inference ⇒ to another argument, either forecast or mitigating: Mitigating argument:premises ⇒ ¬argument  The inference ⇒ is an undercutting attack and a mitigating argument can be seen as an exception which has the effect of undermining the validity of another argument. Mitigating arguments are used to express the uncertainties of a reasoner concerning the validity of forecast arguments or other mitigating arguments. An example of a mitigating argument that can be constructed from Exception 1 (Section 3.1) is: ''UA1: high effort ⇒ ¬ARG1''. Undercutting attacks do not allow partial refutation, so its target is always discarded. This means that no notion of strength of arguments nor strength of attack is considered. Another type of attack relation is the rebutting attack, as mentioned in Section 2. This occurs when a forecast argument negates the conclusion of another forecast argument. A rebuttal occurs when the conclusions of two forecast arguments are believed to be mutually exclusive according to a reasoner. They attack each other generating rebuttals from contradictions, resulting in symmetric attacks: Rebutting attack: forecast argument ⇔ forecast argument An example of rebutting attack is: ''R1: MD1 ⇔ FR2'' (Table 12 and Fig. 18 in Appendix).
The third type of attack is the undermining attack which occurs when a forecast argument challenges the premises of another forecast argument. In other words, the conclusion of a forecast argument attacks the premises of another forecast argument. This allows a reasoner to model a situation in which the premises used to construct a particular forecast argument are no longer applicable given the inference of another forecast argument.
Undermining attack: forecast argument ⇒ ¬forecast argument Fig. 11. Convergent validity of the non-monotonic fuzzy reasoning models (grey), the argument-based models (white) against the Nasa Task Load index, measured by the Pearson Correlation Coefficient, grouped by knowledge-base and ordered from lower to higher. Fig. 12. Face validity of the non-monotonic fuzzy reasoning models (grey), the argument-based models (white) against the subjective perception of mental workload, ordered by lower to higher.
An example of an undermining attack is: ''U1: R1 undercuts FR2''. More on the benefits of having undermining attacks in addition to rebutting and undercutting attacks is discussed in detail in [91]. The union of designed arguments and attacks can now be seen as an argumentation framework (AF). Formally, an argumentation Fig. 13. Face validity between the non-monotonic fuzzy reasoning models (grey), the argument-based models (white) against the subjective perception of mental workload, group by knowledge-base and ordered by lower to higher.

Fig. 14.
Mean squared errors of the inferences of the non-monotonic fuzzy reasoning models (grey), the argument-based models (white) against the subjective perception of mental workload, ordered by lowered to higher. framework is a pair: where Args is a set of arguments (either forecast or mitigating) and Attacks ⊆ Args × Args is the list of attacks (either undercutting, rebutting or undermining). Examples of argumentation frameworks are those depicted in Figs. 18, 19 and 22 that also represent the three knowledge-bases used in this research.

Layer 3 -Evaluation of the conflicts of arguments
At this stage an AF can be elicited with data. Forecast and mitigating arguments can be activated or discarded, based on whether their premises evaluate true or false. Attacks between activated arguments are considered valid, while all the others designed argument are discarded. Contrarily to fuzzy reasoning systems, Fig. 15. Mean squared errors of the inferences of the non-monotonic fuzzy reasoning models (grey), the argument-based models (white) against the subjective perception of mental workload, grouped by knowledge-base and ordered by lowered to higher.  as designed in Section 3.1, there is no partial refutation, so a successful attack always refutes its target. In other words, no notion of strength of arguments or strength of attack is considered here. The set of activated forecast and mitigating arguments as well as the set of valid attacks form a sub-argumentation framework (subAF ). A subAF is the same or a sub-set of the original AF and it contains that portion of a knowledge-base (rules and attacks) that has been activated with data and evaluates true:

Layer 4 -Definition of the dialectical status of arguments
Given a subAF , acceptability semantics are applied in order to accept or reject its arguments. Here, a subAF is equivalent to the notion of Abstract Argumentation Framework (AAF) proposed by Dung [16]) because the internal structure of arguments is not considered, that means it is abstracted away for the acceptance of arguments. In detail, arguments are considered in a dialogical structure, an interconnect graph of nodes (the arguments) and semantics are applied for evaluating which of them are defeated. An argument A is defeated by B if there is a valid attack from A to B [16]. Not only that, but it is also necessary to evaluate if the defeaters are defeated themselves. A set of non defeated arguments is called extension (conflict free set of arguments).
Formally, An AAF is a pair ⟨Arg, attacks⟩ where: Arg is a finite set of arguments, abstractly evaluated, attacks ⊆ Arg × Arg is binary relation over the set Arg. Given sets of arguments X , Y ⊆ Arg, X attacks Y if and only if there exists x ∈ X and y ∈ Y such that (x, y) ∈ attacks. A set X ⊆ Arg of arguments is: • admissible iff X does not attack itself and X attacks every set of arguments Y such that Y attacks X ; • complete iff X is admissible and X contains all arguments it defends, where X defends x if and only if X attacks all attackers of x; • grounded iff X is minimally complete (w.r.t. ⊆); • preferred iff X is maximally admissible (w.r.t. ⊆).
Extensions are in turn used in the 5th layer of the diagram of Fig. 1, to produce a final inference.
Layer 5 -Accrual of acceptable arguments Eventually, in the last step of the reasoning process, a final inference sometimes has to be produced for practical purposes. For example, in case multiple extensions are computed, one extension might be preferred over the others. In this study, one assessment of mental workload has to be produced for practical purposes, thus a mechanism for accruing the arguments in multiple extensions, is needed. Intuitively, we argue that a larger conflict-free extension of arguments might be seen as carry more consistent information towards the description of the underlining phenomena being modelled (mental workload) than smaller extensions. Aware that other justifications are possible, we believe that the cardinality of an extension (number of conflict-free accepted arguments within it) can be used as a practical mechanism for the selection of the most suitable one. In the case more extensions exist with the same highest cardinality, then the proposal is to take into consideration of all them and their arguments, since a clear consistent inferential point of view (extension) has not emerged. After the selection of the most suitable extension/s, a single scalar is produced through the accrual of the consequents of its/their arguments. This is a simple average of the consequents of accepted forecast arguments within an extension (those that support a mental workload level and as computed in layer 1). Mitigating arguments already had their role by contributing to the resolution of conflictual information (layer 4) and they do not support any mental workload level and thus are not considered here. Formally, the overall inference of mental workload, brought forward by an extension (or multiple extensions), is computed by aggregating the scalars of its forecast arguments: with c the value of the consequent of forecast arguments arg i in extension z, as computed in Eq. (2), n the number of arguments in extension z and m is the number of extensions with highest cardinality, with m, n ≥ 1). Table 5 summarises the design of the argument-based models and their configuration following the 5-layer structures described above (Fig. 7).

Participant and procedures
A number of third-level classes have been delivered to students at the (anonymous institution). Each student has been provided with a study information sheet and a consent form, approved by the ethical committee of the university. After each class, students had to fill in the questionnaire associated to the NASA-TLX instrument (Tables 7-8, Appendix). Students were from 23 different countries (age 21-74, mean 30.9, std= 7.67). Four different topics of the module 'research methods' delivered in the school of computing, at the above institute, were delivered in different semesters during the academic terms 2015-2018. Three different delivery methods of instructional material were used by the lecturer: 1 traditional direct instruction, using slides projected to a white board. The lecturer introduced and explained the contents as in a traditional one-way class, without interaction with students; 2 multimedia video of the same content of 1 projected to a white board. In this case the delivery of the videos were significantly shorter than the deliveries of the instructional material with method 1; 3 constructivist collaborative activity performed after the delivery method 2. The lecturer randomly formed groups of students (max 4), provided them with the handouts of the class and a set of trigger questions on the instructional material previously delivered, to allow iteration of information and to enhance learning.
These delivery methods are assumed to impose on students different experience of mental workload. Overall, 231 students have participated in the study (summary statistics described in Table 6). After completing the questionnaire associated to the multidimensional NASA-TLX instrument for mental workload assessment, participants were required to answer another question providing an indication of their experienced mental workload using a uni-dimensional self-reporting scale (Fig. 8).
The assumption behind self-reporting measures of mental workload is that only the person executing a task can provide a precise account of the own experienced workload. This question was also designed to support the computation of the face validity of each model, as defined in Table 1. Each argument-based and fuzzy reasoning model was elicited with the data associated to each student, and a mental workload inference was produced for each of them. The following section is aimed at presenting and interpreting these inferences.

Results
The answers of each item of the NASA-TLX questionnaire (Tables 7 and 8) provided by each student were used to elicit each of the designed non-monotonic fuzzy reasoning model (eighteen models in Table 3) and argument-based model (6 models in Table 5). Fig. 9 depicts the distribution of the NASA-TLX scores for the entire population (avg: 46.82; sd: 13.05) and the distribution of the uni-dimensional self-reported mental workload answers for the entire population of students (avg: 53.76; sd: 14.83). Both are normal and follow a Gaussian distribution. Fig. 23 depicts the scatterplots of the inferences produced by the non-monotonic fuzzy reasoning and argument-based models against the unidimensional self-reported mental workload scores as reported by students.

Convergent validity
Figs. 10 and 11 depict the Spearman correlation coefficients (p < 0.05) of the inferences produced by the designed models   (mental workload scores) against the NASA-TLX scores, respectively ordered overall and grouped by knowledge base. The Spearman test was used because the assumptions behind the Pearson     correlation test were not met. Moderate to high correlation coefficients were generally observed (ρ within 0.3-0.8). This indicates that the assumption of the theoretical relationship between the NASA-TLX measure, known in the literature to fairly model the illdefined construct of mental workload, and the designed models, in fact exists. Additionally, a deeper analysis of these correlations reveals that argument-based models (white bars of Fig. 11) generally lead to inferences closer to the baseline (NASA-TLX) when compared to those of the non-monotonic fuzzy reasoning models. In detail, this difference is more accentuated (Fig. 11 bottom) when considering the third knowledge-base (Fig. 22). A reasonable interpretation is that this knowledge-base contains a larger amount of conflictual information when compared to the other knowledge-bases. This suggests that argument-based models seems to handle non-monotonicity of knowledge and information on average significantly better than fuzzy reasoning models. In respect to model parameters, the inferences produced by fuzzy reasoning models do not seem to be significantly affected by the different fuzzy operators used (Zadeh, Product, Lukasiewicz) nor the defuzzification method (centroid or mean of max) when analysed by considering the different knowledge bases (Fig. 11). A similar situation exists for the argument-based models whose inferences seem not to be affected by the semantic  Overload [16,120] used (grounded, preferred). However, this was expected because when preferred semantics produces only one extension of arguments, this coincides with the grounded extension produced by grounded semantics.
Face validity Figs. 12 and 13 depict the Spearman correlation coefficients (p < 0.05) of the inferences produced by the designed models against the perception of mental workload reported by students, respectively ranked from lower to higher and grouped by knowledge base. Similarly to the analysis of convergent validity, the Spearman test was used because the assumptions behind the Pearson correlation test were not met. Moderate correlation coefficients were generally observed (ρ within 0.3 − 0.5).
This indicates that the fuzzy non-monotonic reasoning models and the argument-based models appear moderately effective in following the perception of mental workload subjectively reported by students. However, since the correlations produced by argument-based models (white bars of Fig. 12) are nearly the same as the correlations of the mental workload scores of the baseline model (the NASA-TLX), considered to be the gold standard measure for subjective mental workload, against the subjective perception of mental workload, then this suggests that their face validity is very good and in line with state-of-the-art models [22]. Additionally, results of Fig. 13 clearly demonstrate how the correlations of the inferences produced by the argumentbased models against the perception of subjective mental workload, reported by students, when grouped by knowledge-bases is always higher than those correlations produced by the fuzzy nonmonotonic reasoning models and this is more accentuated again with knowledge-base 3.
A deeper analysis has been performed by investigating the error (distance) between the inferences produced by designed models and the subjective perception of mental workload reported by students (as per scale in Fig. 8). Figs. 14 and 15 depict the mean squared errors (MSEs) of the inferences of each designed model, respectively overall and grouped by knowledge base. As it can be observed, argument-based models not only have IF high physical demandTHEN not fTD1

Table 23
Undercutting attacks originating from designed mitigating arguments for knowledge-base 3.
Label Attack

U1
M1a undercuts fMD5 U2 M1b undercuts fTD5 U3 M1c undercuts fMD1 U4 M1d undercuts fTD1 U5 M1e undercuts fP1 U6 M2a undercuts fP1 U7 M2b undercuts fEF1 U8 M2c undercuts fEF2 U9 Md2 undercuts fTD1 U10 M2e undercuts fMD1 U11 M3 undercuts fF2 U12 M4a undercuts fMD1 U13 M4b undercuts fTD1 almost always a lower MSE error than the fuzzy non-monotonic reasoning models, but this time they were also often better than the baseline instrument (NASA-TLX). Additionally, among the non-monotonic fuzzy reasoning models, the difference in the inferences of those employing the centroid as defuzzification method appear to be better than the ones employing the mean of max. This is also confirmed by the scatterplots of Fig. 23 that show how the fuzzy models employing the mean of max defuzzification method produced clustered inferences around different points with low spread (last three lines of scatterplots), when compared to those inferences produced by models applying the centroid method (third and fourth lines of scatterplots). In relation to the inferences by argument-based models, Fig. 15 confirms again that when knowledge-bases are characterised by higher uncertainty due to an higher number of interacting and conflicting pieces of knowledge (knowledge-base 3, bottom of figure), then they behave significantly better than fuzzy reasoning models. This can be also observed in the scatterplots of Fig. 23. In fact, when it comes to knowledge-base 3 (third column of scatterplots), then the inferences of argument-based are more spread than those produced by non-monotonic fuzzy reasoning models. squared error when compared to the self reported mental workload over the non-monotonic fuzzy reasoning models, in addition to a slight improvement also against the baseline model (NASA-TLX). Eventually, when results were compared across knowledgebases, it was evident that when a higher degree of conflictuality of information was present (as knowledge-base 3), then defeasible argumentation seems to be a better modelling tool for handling non-monotonicity when compared to fuzzy reasoning. In other words, defeasible argumentation allowed the construction of models of mental workload with a higher convergent and face validity when compared to those constructed via nonmonotonic fuzzy reasoning, especially when higher uncertainty and conflictuality of information characterise knowledge-bases.

Conclusion and future work
This study presented a comparison between non-monotonic fuzzy reasoning and non-monotonic defeasible argumentation using three different knowledge-bases of different complexity coded from an expert human reasoner in the domain of mental workload. A primary research has been conducted including the construction of computational models using these two nonmonotonic reasoning approaches to represent the construct of mental workload and to allow its assessment. Such models were constructed from the same information used by a well-known subjective model of mental workload, namely the Nasa Task Load Index, this also used as baseline. The elicitation of these models was made possible by using data gathered in third-level classrooms from students who attended different topics delivered in different ways. The inference produced by each of these models was a single scalar representing an assessment of mental workload that was used for comparison purposes. The selected metrics for the evaluation of the inferential capacity of these designed models were convergent and face validity. The former indicates whether the inferences of the models are coherent with the assessments produced by the selected baseline model (NASA-TLX). The latter is a form of logical validity and indicates whether these models are actually measuring what they are supposed to measure, namely mental workload. Findings indicated how both the models built with non-monotonic fuzzy reasoning and defeasible argumentation had a good convergent validity with the baseline, confirming these are also modelling mental workload as a construct. However, the argument-based models had a superior face validity over the non-monotonic fuzzy reasoning models across the three different knowledge bases. A deeper analysis of the inference of constructed models indicated that, when a knowledge-base is characterised by higher degree of uncertainty and conflictuality, then defeasible argumentation is more suitable for handling non-monotonicity when compared to fuzzy reasoning. The first contribution of this research lies in the proposal and application of a comparative research aimed at evaluating the impact of different quantitative non-monotonic reasoning approaches for knowledge-representation problems under uncertainty, following and extending previous work in the field [21]. The second contribution is the execution of an experiment that is not purely based upon an analysis of the topological properties of graphs of arguments or the formal characteristics of different reasoning techniques, but rather on an empirical quantification of their inferential capacity by constructing models with human reasoners and by employing data collected in a real-world domain of application. Scholars in the field of logic are provided with a replicable approach for comparing different formalisms for non-monotonic reasoning empirically. Future work will focus on the replication of this research study by considering further knowledge-bases of different complexities and degrees of conflictuality, by extending the comparison of defeasible argumentation with other reasoning approaches such as non-monotonic expert systems [95]. It will also concentrate on the creation of different models of arguments by employing different semantics for the computation of their dialectical status, such as ranking-based semantics [65] and their application in other real-world knowledge-representation and reasoning problems under uncertainty. Eventually, the comparison can be done also with other argument-based systems, hybrid systems such as fuzzy argumentation and an in depth evaluation of their explainability.