An empirical evaluation of the inferential capacity of defeasible argumentation, non-monotonic fuzzy reasoning and expert systems

Several non-monotonic formalisms exist in the field of Artificial Intelligence for reasoning under uncertainty. Many of these are deductive and knowledge-driven, and also employ procedural and semi-declarative techniques for inferential purposes. Nonetheless, limited work exist for the comparison across distinct techniques and in particular the examination of their inferential capacity. Thus, this paper focuses on a comparison of three knowledge-driven approaches employed for non-monotonic reasoning, namely expert systems, fuzzy reasoning and defeasible argumentation. A knowledge-representation and reasoning problem has been selected: modelling and assessing mental workload. This is an ill-defined construct, and its formalisation can be seen as a reasoning activity under uncertainty. An experimental work was performed by exploiting three deductive knowledge bases produced with the aid of experts in the field. These were coded into models by employing the selected techniques and were subsequently elicited with data gathered from humans. The inferences produced by these models were in turn analysed according to common metrics of evaluation in the field of mental workload, in specific validity and sensitivity. Findings suggest that the variance of the inferences of expert systems and fuzzy reasoning models was higher, highlighting poor stability. Contrarily, that of argument-based ∗Corresponding author Email addresses: lucas.rizzo@tudublin.ie (Lucas Rizzo), luca.longo@tudublin.ie (Luca Longo) Preprint submitted to Expert Systems with Applications January 18, 2020 models was lower, showing a superior stability of its inferences across knowledge bases and under different system configurations. The originality of this research lies in the quantification of the impact of defeasible argumentation. It contributes to the field of logic and non-monotonic reasoning by situating defeasible argumentation among similar approaches of non-monotonic reasoning under uncertainty through a novel empirical comparison.


Introduction
Uncertainty associated to incomplete, imprecise or unreliable knowledge is inevitable in daily reasoning and in many real-world contexts. Within Artificial Intelligence (AI), many approaches have been proposed for the development of inferential models capable of addressing such uncertainty. Among them, non-5 monotonic reasoning emerged from the area of logical AI as an alternative to deductive inferences in logical systems. These were perceived as inadequate for decision making in realistic situations (Bochman, 2007). Hence, reasoning is non-monotonic, or defeasible, when a conclusion can be withdrawn in the light of new information (Reiter, 1988;McCarthy, 1980;Kowalski & Sadri, 1991;Longo, 10 2015; Brewka, 1991). A number of approaches for dealing with quantitative reasoning under uncertainty exist (Parsons & Hunter, 1998), including computational argumentation (also referred to as defeasible argumentation) (Prakken & Vreeswijk, 2001), fuzzy reasoning (Zadeh et al., 1965) and expert systems (Durkin & Durkin, 1998). These approaches have led to the development of 15 non-monotonic reasoning models based upon knowledge bases often provided by human experts. Intuitively, since these models have been developed with a human-in-the loop intervention, their reasoning processes and their inferences have an intrinsic higher degree of interpretability and transparency when compared to data-driven approaches for inference. Moreover, they assist on the 20 creation of models that can be verified, replicated and expanded, thus enhancing the trustworthiness of domain experts towards automated inferences and the understanding of the application under investigation. Nonetheless, these approaches have unique features that differentiate them. For instance, previous studies (Rizzo et al., 2018b,a) suggest that defeasible argumentation offers 25 more powerful conflict resolution strategies; fuzzy reasoning is suitable for robust representation of linguistic information through the application of fuzzy membership functions; and expert systems focus on imitating the problem-solving ability of an expert. These approaches have all been extensively used in practical domains such as medicine, pharmaceutical industry and engineering (Longo,30 2016; Glasspool et al., 2006;Mardani et al., 2015;Liao, 2005). However, scholars have predominantly focused on their individual application for non-monotonic reasoning, but barely attempted to empirically investigate their differences in terms of inferential capacity.
The aim of this study is to empirically evaluate the inferential capacity of 35 defeasible argumentation models when compared to other models produced by other well established reasoning approaches, in this case non-monotonic fuzzy reasoning and expert systems. This evaluation can clarify the predictive accuracy of the investigated reasoning models, allowing defeasible argumentation to be better situated among similar reasoning approaches and enabling different 40 applications and experiments to be carried out. To achieve this goal, the problem of representing the construct of Mental Workload (MWL) has been chosen.
MWL is an ill-defined construct with no clear and widely accepted definition.
In a nutshell, it can be seen as the amount of mental activity devoted to a certain task over time (Cain, 2007). A number of knowledge bases -developed 45 by experts in MWL -were employed as the basis of the modelling and assessment done by the selected approaches. Resulted models are used to infer mental workload scalars employed for achieving the envisioned comparison. In particular, the inferential capacity is compared and quantified in terms of the validity and sensitivity (O'Donnell & Eggemeier, 1986) of the produced inferences.

Inferences
Analysis of sensitivity and validity Figure 1: Streamlined design of the study using three non-monotonic reasoning approaches for mental workload modelling, compared according to their inferential capacity.
The remainder of this paper continues with Section 2 providing the related work on non-monotonic reasoning, knowledge-based techniques for dealing with non-monotonic problems and a precise description of the construct of MWL.
Section 3 presents the design of the empirical experiment aimed at answering the above research question and the tasks performed by participants of the study in 60 order to collect information for inference of MWL. The results, the analysis and the discussion of this experiment are provided in Section 4. Eventually, Section 5 concludes the study and provides recommendations for future research.

Literature and related work
Inconsistent and conflicting pieces of information are often involved in real-65 world argumentative activities. To solve these, classical propositional logic has demonstrated to be inadequate due to its monotonicity property (Reiter, 1980).
In monotonic reasoning, a knowledge base of reasons supporting certain conclu-sions, usually provided by domain experts, may only grow monotonically with new reasons, not allowing the retraction of the previous conclusions. Therefore, 70 defeasible reasoning has emerged as a potential solution to this problem, since it is aimed at formalising non-monotonic reasoning activities (Dung, 1995;Rahwan & Simari, 2009;Chesñevar et al., 2000). This section introduces some of the main non-monotonic formalisms and a few works that have attempted to make a comparison among them. Subsequently, knowledge-base approaches, in 75 particular expert systems, non-monotonic fuzzy reasoning and defeasible argumentation, are explained in depth. The theories in which these approaches are grounded are used as the building blocks for development of non-monotonic reasoning models of inference employed in the context of human mental workload.
To the best of the authors' knowledge, there is a lack of comparisons among 80 knowledge-based systems adopted for quantitative reasoning under uncertainty.
Hence, the main goal is to provide the reader with the intuitions and the required knowledge for comparing defeasible argumentation with similar reasoning approaches. 85 In non-monotonic reasoning, conclusions can be retracted in the light of new reasons. In other words, non-monotonic reasoning relies on the idea that a claim can be defeasibly derived from premises partially specified, but in the case of an exception arising the claim can be withdrawn (Kowalski & Sadri, 1991). Many non-monotonic reasoning formalisms exist in Artificial Intelligence 90 (Brewka, 1991). For instance inheritance networks with exception (Horty et al., 1990) or semantic networks using Demptster's rule (Ginsberg, 1984). Other examples include non-monotonic logics like circumscription (McCarthy, 1980), autoepistemic (Moore, 1985) and default logic (Reiter, 1980). Brewka et al. (1997) provide a nice overview of non-monotonic logics categorized by modal-95 preference logics, fixed point logics and abductive methods. The recent work of Hlobil (2018) presents a guideline for selection of non-monotonic logics based on principles they reject, such as the Deduction-Detachment Theorem and Cumu-lative Transitivity (Czelakowski, 1985;Gabbay & Guenthner, 1984), resulting in 17 different types of logics. A few works have proposed the extension of 100 rule-based approaches, such as expert systems and fuzzy reasoning systems, to incorporate a non-monotonic layer (El-Azhary et al., 2002;Nute et al., 1990;Siler & Buckley, 2005;Castro et al., 1998;Morgenstern & Singh, 1997). An alternative approach for performing non-monotonic reasoning is given by argumentation systems as proposed in early studies (Birnbaum et al., 1980;Lin & 105 Shoham, 1989) and other thorough surveys (Atkinson et al., 2017;Chesñevar et al., 2000). This type of systems formalize non-monotonic reasoning by the construction of arguments that can support or be against certain conclusions.

Non-monotonic reasoning
Nonetheless, only a few works have proposed a comparison among these formalisms. For instance, Delladio et al. (2006) investigate the relations between 110 a normal default logic and a variant of a defeasible logic programming. Dutilh Novaes & Veluwenkamp (2017) make an empirical test of the accuracy of two formal non-monotonic reasoning models: preferential logic and screened belief revision. Yang et al. (2004) compare first order predicate logic, fuzzy logic and non-monotonic logic implemented through negation as failure. Despite 115 highlighting interesting connections among these formalisms, the focus of the studies is usually theoretical or limited by a narrow scope. In this study, three knowledge-based systems are investigated: expert systems, non-monotonic fuzzy reasoning and defeasible argumentation. Knowledge-based systems are better suited for capturing the intuitions of a specific problem when compared to non-120 monotonic logics or other proof-theoretic formalisms. Since rules or arguments have to be predefined, only relevant non-monotonic contexts are modelled, living little, if any, place for confusion. The next subsections provide readers with further specific information on these. 125 First developed by the AI community in the 1960s, expert systems are computer programs created to emulate a human in a given field (Durkin & Durkin, 1998). In a nutshell, they try to transfer a vast body of specific knowledge from a human to a computer. In turn, the computer can make inferences and reach a justifiable conclusion. In respect to expert system methodologies, some 130 examples include rule-based systems, knowledge-based systems and fuzzy expert systems (Liao, 2005). Respectively, rule-based systems are based on rules typically of the form "IF (antecedent) THEN (consequent)"; knowledge-based systems are human-centred, focusing on the users, their needs and requirements; and fuzzy expert systems employ fuzzy logic for dealing with uncertainty and 135 linguistic terms. Nonetheless, regardless of the methodology, expert systems are usually built upon two internal components: a knowledge base and an inference engine (Durkin & Durkin, 1998). The former is provided by a human expert and generally translated into a set of logical rules. The latter is aimed at eliciting, firing and aggregating such rules towards a conclusive inference. More-140 over, engines might employ common strategies for producing inferences, such as backward-chaining inferencing and forward-chaining inferencing. In both cases, reasoning is exploited in a multi-step process in order to prove some goal or hypothesis. For instance, in a backward-chaining inference process, rules that contain a goal in their consequent part are collected and fired if their premises 145 (same as antecedent) evaluate true. In turn, such premises might be supported by other rules, causing the system to define sub-goals and to work in a recursive fashion. Reflecting that behaviour, a forward-chaining inference process starts by firing rules whose premises match the information initially available. In turn, fired rules might trigger the firing of new rules, leading to a continuation of the 150 process until the goal is reached or no other rule is fired. If multiple rules are fired, both forward-chaining and backward-chaining engines might employ some conflict resolution strategy. Common methods include choosing the first rule located, deciding a priority for each rule or firing all possible lines of reasoning.

Expert systems
Other types of expert systems can also be found in the literature, such as frame-155 based expert systems or probabilistic expert systems (Durkin & Durkin, 1998;Spiegelhalter et al., 1993).
Concerning areas of application, expert systems have been prominently used in fields like medicine and robotics (Nohria, 2015;Singholi & Agarwal, 2018).
For instance, medicine presents strong motivators for the development of medical expert systems, like the lack of specialists and lack of health facilities.
Most often they also require interpretable systems. Medical professionals need to have the possibility to understand the reasoning behind a machine and the causes that led it to make a decision. Therefore, in medical area, diagnosis and treatment of diseases are the main goal, with expert systems built for the 165 treatment of influenza, risk of hypertension, memory loss, liver disorders and others (Nohria, 2015). In turn, robotics presents systems developed for fault detection and fault tolerance, path and trajectory planning, vision control, mobile robot control, obstacle detection in industrial robot and so on. The integration of expert systems and robotics is a step forward factory automation still ac-170 tive and researched by the AI community (Singholi & Agarwal, 2018). A wide range of other applications can be found in the expert system literature. Liao (Liao, 2005) provides a decade review, with a considerable amount of specific applications by system methodologies, such as: teaching, agriculture, financial analysis, knowledge management, climate forecasting, decision making, urban 175 design, psychiatric treatment, sensor control, waste water treatment and others. In addition, due to its precondition of encoding human knowledge bases, expert systems have naturally made use of different approaches for knowledge representation, as presented in Hvam et al. (Hvam et al., 2008). These might include graphical notations, logic, scientific formulas and rules. On more spe-180 cific cases: Mitra and Basu (Mitra & Basu, 1997) implement an expert system which contains distinct knowledge representation schemes for designing microprocessor based systems, while Hatzilygeroudis and Prentzas (Hatzilygeroudis & Prentzas, 2004) propose the integration of symbolic rules, neural networks and cases for the enhancement of knowledge representation and reasoning in 185 expert systems.
Ultimately, non-monotonic techniques have been employed in expert systems in different ways (Gabbay, 1985) and used in industry with certain difficulty (Morgenstern, 1998). A few examples include non-monotonic techniques modelled through inheritance methods (Morgenstern & Singh, 1997), defeasible logic (Nute et al., 1990) and default reasoning (El-Azhary et al., 2002). Here, the notions of "contradictions" or "exceptions" are employed. These are defined by domain experts, and describe special cases in which a rule is no longer valid and has to be retracted from the reasoning process.

Non-monotonic fuzzy reasoning
195 Fuzzy set theory, as proposed by Zadeh (Zadeh et al., 1965), uses the notion of membership function, a special function that assigns to each object or linguistic term a grade of membership in the range [0,1] ∈ R. Fuzzy sets are formed by fuzzy objects and include similar notions to classical set theory such as inclusion, union and intersection. A fuzzy control system or fuzzy expert 200 system is a control system based on fuzzy reasoning. It is usually formed by a set of inputs defined as a fuzzy set, a rule set and a defuzzification module (Passino et al., 1998). In this case, this process is characterised as a Mamdani fuzzy inference (Mamdani, 1974) (Fig. 2) and is the approach employed in this study. Moreover, two other types of fuzzy inference methods are commonly 205 found in the literature. The first, the Takagi-Sugeno fuzzy inference (Takagi & Sugeno, 1993), presents the same fuzzification process, however, the output membership functions are always linear or constant, producing in either case a single number. On the one hand, there is no defuzzification process and on the other hand, it is necessary to define weighting mechanisms or parameters 210 for the linear output functions to compute a final crisp value. The second, the Tsukamoto fuzzy inference (Tsukamoto, 1979), also differs from the other types only by its output membership functions. In this case, consequents of each rule are crisp values defined by a monotonical membership function and the real input of the associated rule. Intuitively, it is a combination of the Mamdani 215 and the Takagi-Sugeno fuzzy inference methods.
Since the original development of fuzzy set theory by Zadeh (Zadeh et al., 1965), the range of its applications has been vast. Examples of application domains include pattern recognition, decision making, signal processing, control engineering, medicine, finance and many others. Precup Figure 2: General structure of a Mamdani fuzzy inference process (Cordón, 2011). (Precup & Hellendoorn, 2011) present an extensive survey paper on industrial applications of fuzzy control. Particularly, numerous applications of Mamdani fuzzy control systems have been reported in the fields of robotics, automotive industry and process industry. Due to the concern on the accuracy of such applications, learning techniques have also been incorporated into fuzzy control 225 systems in order to deal with the interpretability-accuracy trade-off (Cordón, 2011), leading to the fields of neuro-fuzzy systems (Nauck et al., 1997) and genetic fuzzy systems (Cordón et al., 2004). Learning techniques might cover structural changes ranging from the parameters optimization to the learning of the rule set. Other works have also suggested additional extensions of fuzzy 230 inference systems in order to support non-monotonicity of rules. Unfortunately, these extensions are not well established. For example, in (Castro et al., 1998) conflicting rules have their conclusions aggregated by an averaging function, while in (Gegov et al., 2014) a rule-base compression method is proposed for the reduction of non-monotonic rules. A third approach can be seen in (Siler & 235 Buckley, 2005), whereby Possibility Theory (Dubois & Prade, 1998) is included into the fuzzy reasoning system to tackle conflicting instructions. In Possibility Theory, contrarily to traditional fuzzy systems, propositions have two truth values: possibility and necessity. The first indicates the extent to which data fails to refute its truth while the second indicates the extent to which data sup-240 ports its truth. This theory is adopted in this study for the development of a non-monotonic fuzzy reasoning system (detailed in Section 3.2).

Defeasible argumentation
Argumentation, with origins grounded in philosophy, deals with the study of assertion and definition of arguments usually emerged from divergent opinions.

245
In the field of Artificial Intelligence, argumentation, also referred to as defeasible argumentation (Bryant & Krause, 2008), is aimed at developing computational models of arguments. Such models have become increasingly significant within AI (Bench-Capon & Dunne, 2007), making defeasible argumentation widely employed for modelling non-monotonic reasoning (Chesñevar et al., 2000). Many 250 studies also described its potential for practical applications, such as dialogue and negotiation (Bench-Capon & Dunne, 2007;Black & Hunter, 2009;Kraus et al., 1998;Amgoud et al., 2000), knowledge representation (Longo, 2015; and decision making in health-care (Glasspool et al., 2006;Patkar et al., 2006). Some of the appealing properties 255 of argument-based models include the lack of statistics or probability for inference and capability to deal with partial and inconsistent pieces of evidence.
Thus, being closer to the way humans reason under uncertainty and leading to a higher explanatory capacity (Longo, 2016). This can be exemplified by its attempted use for the development of argumentation-based approaches to ex-260 plainable AI (Zeng et al., 2018). Moreover, their conflict resolution strategy is strengthened by the large body of literature on acceptability semantics (Dung, 1995;Amgoud et al., 2017;Baroni et al., 2011;Baroni & Giacomin, 2009;Dondio, 2018). Acceptability semantics provide solid mechanisms for the selection of acceptable arguments within a set of conflicting arguments. This set is usually 265 represented by a graph in which arguments are depicted as nodes and attacks (conflicts) between arguments are depicted as arrows. The set of acceptable arguments is usually referred to as an extension. Acceptability semantics can provide a unique extension or multiple extensions for the same set of conflicting arguments. For instance, the common Dung's grounded semantics (Dung, 1995) 270 always returns a single extension while the Dung's preferred semantics might return a single or multiple ones (detailed in Section 3.3.4).
Several approaches also exist for quantitative argumentation, or argumen-tation that deals with numerical measurable arguments, such as Bipolar Argumentation, Probabilistic Argumentation, Multi-valued Argumentation and

275
Weighted Argumentation (Rahwan & Simari, 2009). Despite this number of approaches, computational argumentation systems are usually structured around layers specialised on the the definition of internal structure of arguments, the definition of arguments interactions, the resolution of conflicts between arguments and the possible resolution strategies for reaching a justifiable conclusion 280 (Prakken & Vreeswijk, 2001 a strategic or heuristic layer, which defines how a dispute should be conducted within the bounds of the procedural layers. Differently, Atkinson et al. (2017) consider five main layers as the basic building blocks of an argumentation model: 290 structural layer, relational layer, dialogical layer, assessment layer and rhetorical layer. Another example of multi-layered structure can be found in (Longo, 2016) and is depicted in Fig. 3. This research study adopts this structured due to the nature of the application selected for evaluation -modelling and assessment of human mental 295 workload. In this case, each knowledge base employed is the result of the reasoning of a single agent and do not require a rhetorial layer. The objective is to reason with arguments neutrally built from domain experts so as to achieve a numerical inference representing the imposed mental workload by a specific task. Each layer in this structure is supported by theoretical works in the field 300 of defeasible argumentation. For example, in Layer 1, Toulmin (Toulmin, 1958) provides one the first conceptual models of arguments aimed at contributing with a more articulated structure for arguments. Another example is given by Walton (Walton, 2013), who identifies and evaluates a variety of argumentation structures in everyday discourse, such as argument from consequence, appeal 305 to expert opinion, argument from analogy and argument by example. Other models of argument are also described in (Bentahar et al., 2010). In Layer 2 the focus is on the relationship between arguments and management of their conflicts. Prakken (Prakken, 2010) proposes a conflict classification with three different classes: undermining attack when an argument is attacked on one of its 310 premises, rebutting attack when an argument negates the conclusion of another argument and undercutting attack, when an argument is attacked at one of its defeasible inference rules. Following to Layer 3, the focus is now on the ability to characterize the success of an attack. Commonly, attacks have a form of a binary relation. In a binary attack relation all attacks are successful if they have 315 a target (argument being attacked) and source (argument attacking) defined.
However, other approaches are presented in the literature, such as: strength of arguments, preferentiality and strength of attack relations (Dunne et al., 2011;Modgil, 2009;Martınez et al., 2008). The first one presents the inequality of the strength of arguments that has to be accounted for in a decision-making pro-320 cess. Preferentiality assumes the information necessary to decide whether an attack between two arguments is successful is pre-specified. The last approach, strength of attack relations, tries to associate weights to attack relations instead of arguments. Given an evaluation of attacks, acceptability semantics, placed in Layer 4, can be employed for the definition of the acceptability sta-325 tus of arguments. Dung semantics (Dung, 1995) and its variations (Caminada, 2007;Caminada et al., 2012) are the most well known. Other types include SCC-recursive semantics (Baroni et al., 2005) focused on solving cyclic attack relations of odd-length and ranking-based semantics (Bonzon et al., 2016) which rank arguments from most acceptable to weakest one(s). Finally, the selection Some works tackle all these 5 layers (Chang et al., 2009;Hunter & Williams, 2010;Craven et al., 2012) while others do not (Patkar et al., 2006;Glasspool et al., 2006;Grando et al., 2013). This structure has also been reproduced in past studies (Rizzo & Longo, 2017;Rizzo et al., 2018a;Longo, 2015;

Mental workload
To tackle the research question, a precise knowledge representation and reasoning problem has been selected: mental workload (MWL) modelling. Note 350 that this problem is not the focus of this research study, but only an application that allows the proposed comparison among the non-monotonic reasoning approaches to be performed. Thus, only a brief introduction of its concept, methods of measurement and evaluation metrics are provided here. The inter-ested reader can refer to the citations along this section for further information.

355
Although no single definition has been developed so far (Young et al., 2015;Hart, 2006), MWL can be intuitively described as the total cognitive cost needed to accomplish a specific task over time (Cain, 2007). According to Cain (2007), the main reason for measuring MWL is to quantify the mental cost of performing a certain task in order to predict operator and system performance. It is mainly 360 used in the areas of psychology and ergonomics, with applications in aviation and auto-mobile industries (Paxion et al., 2014) and in interface and web design (Tracy & Albers, 2006).
Since no correct measure of MWL exists, there are different methods that have been proposed for measuring it (Eggemeier, 1988 (Hart & Staveland, 1988) has been largely employed in the last decades (Rizzo et al., 2016;Longo, 2014Longo, , 2015 and it is adopted in this research 375 study for comparison purposes. It is a combination of six factors believed to influence mental workload: temporal demand, physical demand, mental demand, frustration, effort and performance (Hart & Staveland, 1988). Each factor d is quantified with a subjective judgement coupled with a weight w computed via a pairwise comparison procedure. The set of questionnaires employed for mea-380 surement of each factor can be seen in Table A.11 (page 74). The final MWL scalar is the weighted average of these six factors d i and weights w i provided by the operator (equation 1). The pairwise comparison procedure is made through a set of questions, for example "which contributed more for the MWL: mental demand or effort?", "performance or frustration?", giving a total of 15 prefer-385 ences. The number of times each feature is chosen defines its weight. A few modified versions of the NASA-TLX have also been proposed. Among them, the most common is referred to as Raw TLX (RTLX) (Hart, 2006). It removes the pairwise comparison procedure of NASA-TLX and instead averages the features (equation 2). According to (Hart, 2006), comparisons between the NASA-TLX 390 and the RTLX seem inconclusive, being both more or less sensitive than the other to changes in task difficulty.
Another MWL assessment technique is the Workload Profile (WP) which is based on the Multiple Resource Theory (MRT) (Wickens, 1991). Contrarily to the NASA-TLX, it is built upon 8 dimensions: solving and deciding, selection of response, task and space, verbal material, visual resources, auditory resources, manual response and speech response (Table A.17, questions 6-13). The user is required to rate each feature in the range 0 to 1. The final scalar is given then by their sum (eq. 3).
Several criteria have been proposed for the selection and development of inferential models of MWL (O'Donnell & Eggemeier, 1986), such as: diagnos-395 ticity, reliability, sensitivity and validity among others. Since the goal of this research study is to evaluate the ability of non-monotonic reasoning techniques to represent and assess MWL, the focus is on three different forms of validity and sensitivity: • convergent validity: it demonstrates the extent to which different MWL 400 techniques correlate to each other (Tsang & Velazquez, 1996).
• concurrent validity: it determines to what extent a technique can explain measures of objective performance, such as task execution time (Rubio et al., 2004).
• face validity: it determines the extent to which a technique is relevant to 405 the persons answering the questions. Or if the workload reported seems to be valid to participants of the experiment (Spielberger et al., 2010).
• sensitivity: it determines the capability of a technique to discriminate significant variations in MWL and changes in resource demand or task difficulty (O'Donnell & Eggemeier, 1986).

410
Validity and its particular sub-forms have normally been assessed through the analysis of correlation coefficients (Rubio et al., 2004) between produced MWL scalars, while sensitivity has been formally evaluated by analysis of variance coupled with post hoc analysis (Rubio et al., 2004;Longo, 2015).
In summary, MWL is a complex construct built over a network of pieces 415 of evidence; accounting and understanding the relationships of these pieces of evidence as well as resolving the inconsistencies arising from their interaction is essential in modelling MWL (Longo, 2014). In formal logics, these activities are the key components of a defeasible argumentative process, where a set of interactive pieces of evidence, called arguments, can be defeated by additional 420 arguments (Longo, 2014). To the best of our knowledge, Longo (2012) was the first to attempt to model MWL as a non-monotonic concept. Thus, in spite of MWL not being the focus of this research, it is important to highlight that no other authors have followed this modelling approach. Previous works have investigated the use of expert systems for MWL modelling (Rizzo et al., 2016) and 425 the comparison of defeasible argumentation and non-monotonic fuzzy reasoning (Rizzo & Longo, 2019. Nonetheless, these are not comprehensive studies, employing small sets of data and limited sets of inference models. Here, a thorough investigation has been proposed, extending preceding studies and fine tuning designed inference models. In particular, this research is secondary in 430 terms of data employed. It employs information of studies proposed in (Longo, 2018b;Longo & Orru, 2019;Longo, 2018aLongo, , 2017Longo & Dondio, 2015) for the evaluation of MWL imposed on participants who performed two types of tasks: information seeking web-based tasks and attendance to third-level classes delivered at the Technological University Dublin (a detailed description of these 435 tasks if given in Section 3.4). The answers provided by these participants led to the creation of three different datasets evaluated simultaneously in this study. In specific, they were used to elicit the non-monotonic reasoning models introduced in the next section.

Design and methodology 440
In order to answer the research question a primary quantitative research was designed as depicted in Fig. 4. Empirical evidence was employed with two objectives in mind:  2. To investigate the quality of inferences produced by non-monotonic reasoning models.
The hypothesis for objective 1 is that non-monotonic reasoning models will demonstrate high convergent validity with baseline instruments, thus being able 450 to assess MWL. The hypothesis for objective 2 is that defeasible argumentation models will demonstrate higher sensitivity, higher concurrent validity and higher face validity than fuzzy reasoning and expert system models, thus showing that defeasible argumentation has a better inferential capacity than the other nonmonotonic reasoning approaches. Table 1 lists the hypotheses and methods 455 associated to each objective of this research study.

Objective 1
Evaluation of the capacity to assess the construct of MWL.

Method
Evaluation of convergent validity.

Hypothesis 1
Non-monotonic reasoning models will demonstrate moderate to high convergent validity with baseline instruments.

Objective 2
Investigate the quality of produced inferences.

Method
Evaluation of face validity, concurrent validity and sensitivity.

Hypothesis 2
Defeasible argumentation models will demonstrate higher sensitivity, higher concurrent validity and higher face validity than fuzzy reasoning and expert system models.

Three knowledge bases (Appendix A), designed by two interviewed experts,
were employed for the construction of models capable of inferring a mental workload scalar (value in the range [0, 100] ∈ R). Each knowledge base was built with rules constructed by only considering the information gathered with 460 well known self-reporting mental workload instruments. Each rule was subsequently elicited with the data associated to its premises. The construction of datasets, knowledge bases and description of performed tasks designed to as-sess MWL are detailed in the following subsections. As summarised in Fig.   4, non-monotonic reasoning models are firstly built upon an expert knowledge 465 base and a reasoning approach. Secondly, these models are instantiated with the data associated to the selected knowledge base and the respective inferences are produced (MWL scalars). This process is repeated for each knowledge base.
Finally, the inferences produced using all knowledge bases are compared against each other to test the research hypotheses.

Expert systems
Focused on imitating the problem-solving ability of a human expert, expert systems are one of the most well known reasoning approaches in the literature.
A step-by-step description of their inferential process is provided along with a running example (Fig. 5) for the problem chosen in this paper: mental workload 475 modelling and assessment. This example is referred throughout this section and is aimed at providing a complete overview of the expert system procedure for inferring a MWL scalar with real-world data.

IF-THEN rules and contradictions
The first step of an expert system is to model a knowledge base usually   be contradicted by other rules which intend to bring forward and support contradictory information. An example of a hypothetical contradiction is: The set of IF-THEN rules and the set of contradictions is now ready to be 495 elicited. In detail, the second step of the expert system is to define the inference engine aimed at firing rules and solving contradictions among them.

Inference engine
The inference engine starts with the activation of IF-THEN rules and contradictions with real-world data. This means that input data will be used to 500 evaluate antecedents of rules and contradictions, firing a sub-set whose evaluation returns true. If both a IF-THEN rule and at least one contradiction challenging the rule have been activated, then the inference engine discards the rule. This mechanism will eventually form a set of surviving rules. Fig. 5.A, 5.B and 5.C respectively depict the input values in the running example, the 505 set of activated rules and the set of surviving rules. Note that these rules and arguments come from a real knowledge base that can be seen in Appendix A.
They may not be the same as hypothetical rules and contradictions, such as Rule 2 and Contraction 1. Experts can have different opinions and the fact that a set of premises infers a conclusion in one knowledge base does not mean it has 510 to infer the same conclusion in another knowledge base.

Rules quantification and aggregation
The rules in the set of surviving rules might have distinct consequents. For example, in this research study, there might be rules inferring different MWL levels. Since the goal is to aggregate them and extract an unique scalar, most rep-515 resentative of the imposed mental workload, an aggregation strategy is needed.
In this situation, a usual expert system would have a typical set of choices for selection of rules, for example: deciding a priority for each rule, returning multiple outcomes or choosing the first rule activated. However, none of these strategies is applicable in this research study. The knowledge bases do not explicit prefer-520 ences among rules, order of activation or possibility to compute more than one output. Because of that, rules have to be quantified and aggregated 1 to infer a MWL scalar in the range [0, 100] ∈ R.
In the quantification step, a value has to be attributed for each surviving IF-THEN rule. In this study, this value is defined according to the numerical range 525 of the consequent of the rule, the numerical range of its premises and the input values provided for the rule activation. In the basic scenario of an IF-THEN rule with only one premise, it will be quantified as the minimum (resp. maximum) value of the numerical range of its consequent if its premise is activated with its minimum (resp. maximum) value. For instance, consider Rule 2 rewritten with 530 hypothetical numerical ranges: 33,66] In this case, if the input value for effort is 0, then Rule 2 value will be 33. Analogously, if the input value for effort is 33, Rule 2 value will be 66. Activation values in between 0 and 33 are evaluated according to a linear 535 relationship. To formalize the generic case, IF-THEN rules are precisely defined, followed by the definition of the function f that returns their value:

Definition 1 (Generic IF-THEN rule). A generic IF-THEN rule is defined,
without loss of generalisability, as: is the numerical range for the MWL level being inferred; and AND and OR are boolean logical operators.

Definition 2 (Generic rule value). The value of a generic IF-THEN rule r 545
is given by the function: Note that the value of a rule will always lies between the numerical range the impact of adding this extra information on the inferential capacity of the expert system models. In the pairwise comparison procedure, the number of times a feature has been chosen over another is its respective weight, which in turn will also represent the weight of the IF-THEN rules whose antecedents contain such feature. Observe that instead of general rule weights, rules will 565 have different weights on a case by case basis.
h 1 : definition of the sets of surviving rules grouped by their MWL level.
Extraction of the largest set. Average of the values of the rules in the largest set. In case two or more largest sets exist, the above process is repeated for each of them and their average is returned. The idea is to 570 give importance to the largest set of surviving rules supporting the same MWL level.
h 2 : same as h 1 but applying the weighted average instead of the average.
The goal here is to add the information from the pairwise comparison procedure provided by the NASA-TLX questionnaire.

575
h 3 : average value of all surviving IF-THEN rules. This is to give equal importance to all surviving IF-THEN rules, regardless of which level of MWL they were supporting.
h 4 : same as h 3 but applying the weighted average instead of the average.
Again, the goal is to employ the information of the pairwise comparison 580 procedure of the NASA-TLX.

Non-monotonic fuzzy reasoning
For comparison purposes, fuzzy reasoning is the second reasoning approach selected in this research study. It provides a robust representation of linguistic 585 information by using fuzzy membership functions. In addition, it considers Possibility Theory (Dubois & Prade, 1998) in the reasoning process to tackle non-monotonicity. Similarly to expert systems, a running example of a single inference with real-world data is depicted in Fig. 6 and referred throughout this subsection.  Afterwards, each linguistic term associated to a feature level or MWL level, such as low or underload, is described by a fuzzy membership function (FMF) 600 that is also provided by the knowledge base designers. Appendix A.4 depicts the three options provided, using linear, trapezoidal and Gaussian shapes. In the running example, membership functions for MWL levels and feature levels and can be seen in Fig. 6.C and Fig. 6.D respectively.

Inference engine 605
Once the fuzzification step has been completed and the knowledge base of the expert translated into fuzzy rules and fuzzy contradictions, the next step is to solve such contractions. Possibility Theory is used here as a possible approach, as implemented in (Siler & Buckley, 2005)   proposition A by a set of propositions Q which contradicts A is derivable as: where ¬N ec(Q) = 1 − N ec(Q). Sidnce there is no addition of supporting information but only attempts to contradict or refute information, equation (4)  Values between 1 and 0 indicates that the Fuzzy Rule 1 is partially refuted.
The truth value of the Fuzzy Rule 1 represents the truth value of underload in 620 this particular rule.
It is important to highlight that the approach developed in (Siler & Buckley, 2005) has been inspired by a multi-step forward-chaining reasoning system. In this research study, reasoning is done in a single step, in the sense that data is imported and all rules are fired at once. However, it is possible to define a 625 precedence order of fuzzy contradictions. More exactly, it is possible to define a tree structure in which the consequent of a fuzzy contradiction is the antecedent of the next fuzzy contradiction. In this way, equation (4)  Given that there is no information on the knowledge bases (accounted in this study as per Appendix A) to decide whether a fuzzy rule or a fuzzy contradiction is more important than another, here they are solved simultaneously.

Defuzzification module 670
The output of the inference engine is a graphic representation of the aggregation of the consequents (MWL levels) of the updated fuzzy IF-THEN rules ( Fig. 6.I). Several methods can be used for calculating a single defuzzified scalar (Hellendoorn & Thomas, 1993). Two are selected here: mean of max and centroid. The first returns the average of all elements (MWL levels) with maximal 675 membership grade. The second returns the coordinates (x, y) of the centre of gravity of the geometric shape formed by the aggregation of the membership functions associated to each consequent (MWL level). The defuzzified scalar is represented then by the x coordinate of the centroid (as per Fig. 6.J).

Defeasible argumentation 680
The definition of argument based-models follows the 5 layer modelling approach proposed in (Longo, 2016) and depicted in Fig. 3 (Section 2.4). It starts with the definition of the internal structure of arguments, followed by the definition of conflicts among arguments, the definition of the acceptance status of each argument and the aggregation of the accepted arguments. A running 685 example is depicted in Fig. 7 and referred throughout this subsection.

Layer 1 -Definition of the internal structure of arguments
Most commonly an argument is composed of one or more premises that provides reason or support a conclusion. Thus, the first step of an argumentation process usually focuses on the construction of forecast arguments defined as:

690
F orecast argument : premises → conclusion This structure includes a set of premises (believed to influence the conclusion being inferred) and a conclusion derivable by applying the inference rule →. It is an uncertain implication which is used to represent a defeasible argument. In order to solve the application in hand (MWL), similarly to the rules 695 of expert systems, premises and conclusions are strictly bounded in numerical ranges associated to natural language terms (for instance low and underload).
An example of a hypothetical forecast argument is given below (it matches Rule 1 of Section 3.1.1): In the running example, the selected knowledge base and input values (Fig.   7.* and 7.A) are the same employed in the expert systems and the non-monotonic fuzzy reasoning system (as per Fig. 5 and Fig. 6 respectively). The forecast arguments that are activated from these can be seen in Fig. 7

Layer 2 -Definition of the conflicts of arguments 705
In order to evaluate inconsistencies, the notion of mitigating argument (Matt et al., 2010) is introduced. This is formed by a set of premises and an undercutting inference ⇒ to an argument B (forecast or mitigating): M itigating argument : premises ⇒ ¬B Both forecast and mitigating arguments are special defeasible rules, as defined in 710 (Prakken, 2010). Informally, if their premises hold then presumably (defeasibly) their conclusions also hold. Different types of mitigating arguments exist in the literature, such as rebuttal and undermining (Prakken, 2010). In this research, the notion of undercutting attack is employed for the construction of mitigating arguments and thus enabling the resolution of conflicts. An undercutting attack 715 defines an exception, where some inference carried out in the attacked argument is no longer allowed. Contradictions, such as in Section 3.1.1, represent the information necessary for the construction of undercutting attacks. For example, the corresponding hypothetical mitigating argument that can be constructed from Contradiction 1 (Section 3.1.1) via an undercutting attack is: 720 − UA1: high effort ⇒ ¬ ARG 1 All forecast arguments and undercutting attacks form an argumentation framework (AF) (as in Fig. 7. * ). Fig. 7.C lists the activated undercutting attacks for the input values (Fig. 7.A). In this example undercutting attacks originate from the contradiction "C3: FR1 and MD4 cannot coexist", listed in 725   Table A.15. It was defined by a domain expert and manually translated as two undercutting attacks.

Layer 3 -Evaluation of the conflicts of arguments
At this stage, the created AF can be elicited with data. Forecast and mitigating arguments can be activated or discarded, based on whether their premises 730 evaluate true or false. Consequently, attacks between activated arguments will be evaluated before being activated as well. As mentioned in Section 2.4, attacks usually have a form of a binary relation. In a binary relation a successful (activated) attack occurs whenever both its source (attacking argument) and its target (argument being attacked) are activated. Another approach that can 735 be adapted in this study is the strength of arguments. In this case, similarly to the definition of rule weights in expert system and fuzzy reasoning, the strength of each argument is extracted from the pairwise comparison procedure of the NASA-TLX. The number of times a feature has been chosen in the pairwise comparison procedure will represent the feature strength, which in turn will 740 also represent the strength of the arguments employing such feature. Consequently, an attack is considered successful only if the strength of its source is equal or greater to the strength of its target.
From the activated forecast/mitigating arguments and successful attacks, a sub-argumentation framework emerges (sub-AF), as in Fig. 7.D. This is equiv-745 alent to the Abstract Argumentation proposed in Dung (1995).

Layer 4 -Definition of the acceptance status of arguments
Given a sub-AF acceptability semantics (Baroni et al., 2011;Dung, 1995) are applied to compute the acceptance status of each argument, that means its acceptability. An argument A is defeated by B if there is a valid attack 750 from A to B (Dung, 1995). Not only that, but it is also necessary to evaluate if the defeaters are defeated themselves. Hence, acceptability semantics are aimed at evaluating which arguments are ultimately defeated. A set of non defeated arguments is called extension, or a subset of arguments that can be mutually acceptable according to some rationale. Extensions are in turn used 755 in the 5th layer of the reasoning structure of Fig. 3 (p. 12), to produce a final inference. The internal structure of arguments is not considered in this layer, that is why the definition of sub-AF here is equivalent to the notion of abstract argumentation framework (AAF) as proposed by Dung (Dung, 1995). An AAF is a pair < Arg, attacks > where: Arg is a finite set of abstract arguments, 760 attacks ⊆ Arg × Arg is binary relation over Arg. Given sets X, Y ⊆ Arg, X attacks Y if and only if there exists x ∈ X and y ∈ Y such that (x, y) ∈ attacks.
A set X ⊆ Arg of argument is: admissible iff X does not attack itself and X attacks every set of arguments Y such that Y attacks X;

765
complete iff X is admissible and X contains all arguments it defends, where X defends x if and only if X attacks all attackers of x; grounded iff X is minimally complete (with respect to ⊆); preferred iff X is maximally admissible (with respect to ⊆) These represent a few argument-based semantics among others that have 770 been proposed in the literature (Baroni et al., 2011). However, here the focus is on the grounded and preferred semantics. Fig. 7.E, 7.F and 7.G depict different extensions when employing the grounded and preferred semantics in the running example.

Layer 5 -Accrual of acceptable arguments 775
Eventually, in the last step of the reasoning process, a final inference has to be produced. In case multiple extensions are computed, one extension might be favoured over the others. In this study, the cardinality of an extension (number of accepted arguments) is used as a mechanism for selecting the favoured one.
Intuitively, a larger extension of arguments might be seen as more relevant than

Participants and procedures 800
Three distinct experiments were performed with human subjects. In the first and second, a number of third-level classes were delivered to students at the Technological University Dublin, School of Computer Science, Dublin, Ireland.
In the third, nine information seeking web-based tasks of varying difficulty and demand were performed by volunteer participants over three popular web-sites: 805 Google, Wikipedia and Youtube. Subjects were briefed about the study and they were requested to sign a consent form that included data protection and treatment. Privacy and anonymity of participants were in all respects protected by the authors. After each task, a self-reporting questionnaire aimed at assessing mental workload was given to subjects. These can be seen at Fig. A.11, A.13 810 and A.17 in the Appendixes. Besides completing the questionnaires, in some scenarios participants were required to fill in another scale providing an indication of their experienced mental workload (Fig. 8). This question was designed for triangulation purposes with the assumption that only the person executing a task can precisely self-assess its own experienced mental workload (Moustafa 815 et al., 2017). Table 3 summarises the three experiments, the questionnaires employed and the number of participants. It also mentions the mental workload assessment instrument that will be employed as baseline for comparison purposes.
academic terms 2015-2018: 1. Traditional direct instruction, using slides projected to a white board; 2. Multimedia video of content. Transformation of the content of the slides of 1 into a multimedia video projected to a white board; 835 3. Constructivist collaborative activity added to 2.

Information seeking web-based tasks
In the third experiment, nine information seeking web-based tasks of varying difficulty and demand (Table B.19 in the Appendix), were performed by participants over three websites: Google, Wikipedia and Youtube. These web-845 sites were selected due to their popularity and assumption that participants were familiar with their interfaces. In this way, situations of underload MWL were expected to happen. If non-popular websites were chosen the chances of spotting underload MWL would be reduced. In addition, the original interface of each web-site was slightly manipulated in order to impose different MWL 850 demands on participants interacting with them, leading to 9 tasks on the original websites and 9 tasks on the modified websites (18 in total). 46 volunteers performed all the tasks in a random order in different days, over 2 or 3 sessions of approximately 45/70 minutes each. Afterwards, the questions of Table A.17 were answered using a paper-based scale in the range [0.
.100] ∈ ℵ, partitioned 855 in 3 regions delimited at 33 and 66. 405 valid instances were generated. Despite not being necessary in this study, the reader can obtain more information on the construction of this dataset in (Longo, 2018a(Longo, , 2017Longo & Dondio, 2015).

Summary of models and comparative metrics
Tables C.20, C.21 and C.22 in the Appendix list models built using the rea- inferences produced by the designed models and an objective performance measure, in this case task completion time. Finally, sensitivity has been formally assessed by analysing the variance of the distributions generated by inferences of the designed non-monotonic reasoning models followed by a post hoc analysis. and in which experiment they were employed. Before presenting the results and the discussion of the study, Table 6 summarises experiments by reasoning models and statistical tests applied.

Results and discussion
Collected data was used to elicit models listed in Tables C.20, C.21 and C.22

890
(Appendix C). The evaluation metrics of Table 5 are analysed in the following sections.  Table 3. Full list and detail of all the designed models can be seen in Appendix C. Additional details on statistical tests can be seen in Table 5.

Experimental settings Models Analysis
Features: 6,

Convergent validity
This property is aimed at determining whether, and to which extent, two MWL inference models are correlated. It is the metric employed to achieve  From Fig. 9 it is possible to observe that the models designed for experiment E a could all achieve a medium to high correlation coefficient with the  This indicates that acceptable MWL inference models can be designed with less information than the original NASA-TLX instrument.  (a) Correlations against RAW TLX with an inferior •), be it among models of linear fuzzy membership functions or Gaussian fuzzy membership functions. This is a strong indication that the mean of max is not a suitable parameter within a model to assess MWL in experiment E b , regardless of the fuzzy operator or shape of the fuzzy membership function employed. As for the FMFs, it is also possible to notice some  instead of h 3 (E 8 ), no significant difference between defeasible argumentation models and worse performance in general for fuzzy models employing the mean of max defuzzification approach (labelled with an inferior •). However, the 955 impact of the FMFs shape is not analogous as that of previous findings, in fact it is not possible to observe a significant difference in their correlation coefficients except for models FL24 and FC24.
In summary, it is worth highlighting some common findings and differences related to the convergent validity of models across reasoning approaches. For 960 instance, the expert system and defeasible argumentation reasoning approaches appear to be more robust for modelling the construct of MWL across the dif-

Face validity
This property is aimed at determining the extent to which a measure of MWL appears effective. It is one of the metrics employed to achieve objective 2 (Section 3) and test its research hypotheses. It was analysed according to the mean square error (MSE) of produced inferences and self-reported MWL 985 values (Fig. 8, p. 34). Fig. 12 and 13  As for experiment E a , a significant difference has been found between models employing the pairwise comparison information of the NASA-TLX and those not employing it. Among fuzzy models with linear FMF there is an average decrease Reasoning models for MWL inference Mean Squared Error

Concurrent validity
Aimed at determining the extent to which a model correlate with an objective performance measure, in this case task completion time, concurrent validity was also assessed through an analysis of correlation coefficients between the designed models and baseline instruments in experiment E c . A reminder that • F L 1 8

1020
This suggests that the investigated reasoning approaches, when set up with certain parameters, are as good as the baseline models. The exceptions presenting a lower correlation coefficient are the fuzzy models of Gaussian FMFs employing the mean of max defuzzification approach (F C20, F C22 and F C24) and the expert system E7 employing heuristic h 1 . This trend is very similar to the one 1025 depicted for convergent validity in Fig. 11, suggesting that these combinations of parameters (Gaussian FMFs + mean of max for fuzzy models and heuristic h 1 for expert system models) do not help to create robust models of MWL. It is also worth noting that fuzzy models F L20 and F L22 could achieve a favourable correlation coefficient with task completion time, despite having low convergent 1030 validity. It suggests that models with low convergent validity might also produce acceptable inferences.

Sensitivity
In line with other studies (Rubio et al., 2004;Longo, 2015), sensitivity was assessed by performing an analysis of variance over the MWL distributions scalars across tasks was rejected. That means that there exist models that lead 1050 to significantly different inferences when used to evaluate the MWL imposed by the web-based tasks. However, the Kruskal-Wallis H test does not tell exactly which pairs of tasks executed by participants are different from each other. Consequently, a post hoc analysis was performed and the Games-Howell test was chosen because of unequal variances of the distributions under analysis. Fig.   1055 15 depicts how many pairs of tasks each model was capable of differentiating at two significance levels (p < 0.05 and p < 0.01). As it can be observed, similarly to convergent and concurrent validity, defeasible argumentation models and expert system E8 outperformed the other models. When compared to the baseline instruments, results for these models are in between the NASA-TLX 1060 and the WP for both significance levels. Despite the high sensitivity of defeasible argumentation models, it is possible to observe a slight difference between them, with a better performance achieved by model A8 whose argumentation semantics is the preferred semantics. Among fuzzy models, it is worth noting that the best performance is given by F L20 and F L22. It strengthens the re-1065 sults of concurrent validity, suggesting that models of low convergent validity might produce satisfactory inferences. Another interesting observation comes from model F C19. In spite of presenting similar convergent and concurrent validity with its linear counterpart (F L19), in this case its sensitivity was superior, being close to or better than WP, while F L19 was always distant from 1070 the baseline instruments. It shows that Gaussian FMFs can provide more sensitive models when employed with certain fuzzy operators and defuzzification approaches (in this case Zadeh and centroid respectively). Other fuzzy models demonstrated to have poor sensitivity, underperforming the baseline models. In detail, as expected by convergent and face validity analysis of experiment E c , 1075 fuzzy models of Gaussian FMFs employing the mean of max defuzzification approach led to the worst performance, not being able to statistically differentiate between any pair of tasks.

Internal configurations of models and interpretations
Quantifications of the validity and sensitivity of the developed models sug-1080 gest that, in general, the investigated reasoning approaches can be successfully employed for mental workload modelling and assessment. Nonetheless, the analysis across different experiments and evaluation metrics seems to indicate a contrasting performance when particular parameters of distinct reasoning techniques are employed. Table 7  Most negative impacts seemed to be caused by the application of the mean of max defuzzification approach by fuzzy models and heuristics for the refinement 1090 of surviving rules by expert system models (h 1 /h 2 ). These lead to the development of models that, in average, underperformed in all evaluation metrics  (validity, sensitivity) and, in the case of fuzzy models, also tend to have a much higher standard deviation when compared to their counterparts: the centroid approach and the heuristics h 3 /h 4 . The explanation for such discrepancy might 1095 lie in the role of the mean of max defuzzification approach and the role of the heuristics h 1 /h 2 in their respective models. Note that despite being employed by distinct reasoning techniques these roles might in fact be related. While the mean of max defuzzification approach selects only the rules whose conclusion(s) have the highest degree of truth, the refinement of surviving rules by heuristics 1100 h 1 /h 2 discards rules not inferring the MWL level supported by the greatest number of surviving rules. Thus, these can be seen as apparently unsuccessful attempts to resolve conflicts among rules by selecting some of them believed to be suitable for inferring a final MWL scalar.       to the impact of the shape of FMFs on MWL modelling and assessment.

Discussion
The overall medium to high degree of convergent validity of the investigated models indicated that their inferences can be considered valid, as per alternate hypothesis of objective 1 (Section 3). As a consequence, the findings from the  Table 8 for convergent validity and Table 9 for the other evaluation metrics. Based on these the acceptance statuses of Hypotheses 1 and 2 (Section 3) are listed in Table   10. 1170

Ea E b Ec Expert systems Fuzzy reasoning
Partially Partially Defeasible argumentation Table 9: Status of defeasible argumentation (DA) compared to fuzzy reasoning (FR) and expert systems (ES) according to sensitivity, face validity and concurrent validity across the 3 experimental settings. Comparison symbols are used to represent equal (=), better (<) and considerably better ( ) results on average for models built upon defeasible argumentation.
A (−) means not applicable. The reasoning approach employed by the best-performing model is listed in the last row.

Hypothesis 1
Non-monotonic reasoning models will demonstrate moderate to high convergent validity with baseline instruments.

Acceptance status
Accepted by defeasible argumentation and expert systems. Partially accepted by fuzzy reasoning, with some models presenting low convergent validity.

Hypothesis 2
Defeasible argumentation models will demonstrate higher sensitivity, higher concurrent validity and higher face validity than fuzzy reasoning and expert system models.

Acceptance status
Partially accepted. On average sensitivity and validity are consistently better for defeasible argumentation. By individual models, defeasible argumentation has better results overall, but expert systems and fuzzy reasoning can produce results of equivalent face and concurrent validity on certain experiments.

Conclusion and future work
This study presented an extensive comparison of non-monotonic rule-based reasoning techniques for the practical problem of mental workload modelling.
These techniques are promising not only because they can approximate the inferential capacity of a knowledge representation and reasoning application, but 1175 they also offer a flexible approach for translating different knowledge bases and beliefs of domain experts into computational rules. Furthermore, they support the creation of models that can be falsified, replicated and extended, thus enhancing the understanding of the construct of mental workload itself and possibly other applications of interest. Such advantages, for instance, are not 1180 provived by data-driven techniques, even the ones able to produce interpretable solutions. Hence, if they are to be used in other domains of application and by other domain experts, it is necessary to perform a meticulous -and not performed before -examination of one of their crucial aspects namely inferential capacity. In particular, the inferential capacity of expert systems, non-monotonic 1185 fuzzy reasoning and defeasible argumentation models was examined. A set of models, for each reasoning approach, was created following the structures employed in the literature. For instance, expert systems adopted the common two internal components: a knowledge base and an inference engine (Durkin & Durkin, 1998). Fuzzy reasoning models followed the structure of a typical Mamdani fuzzy inference process (Mamdani, 1974). Defeasible argumentation models were constructed based on a 5-layer schema upon which argumentation systems are typically built (Longo, 2016). Nonetheless, the implementation of the non-monotonicity property was not straightforward for expert systems and fuzzy reasoning. The former required different heuristics for aggregating rules 1195 and inferring MWL as a numerical index. Usual conflict resolution strategies of expert systems could not be employed due to the nature of the domain, which required all the reasoning to be made in a single step. The latter, fuzzy reasoning, had non-monotonicity implemented by using Possibility Theory, having truth values, named possibility and necessity, associated to each piece of infor-1200 mation. Possibility allowed fuzzy reasoning models to determine the extent to which data fails to refute its truth, while necessity represented the usual truth values of fuzzy logic. Besides such adaptations, the investigation of configuration parameters was also performed for each reasoning technique for tuning purposes.

1205
Findings indicated how models or a subset of models built upon the three reasoning techniques had a good convergent validity with three selected baseline models of mental workload: the NASA Task Load Index (Hart & Staveland, 1988), its RAW extension (Hart, 2006)  instance the argumentation semantics designed in (Fan & Toni, 2015) for giving explanations to arguments. Lastly, the application of hybrid reasoning techniques, such as neuro-fuzzy systems (Nauck et al., 1997), genetic fuzzy systems 1265 (Cordón et al., 2004) and fuzzy argumentation (Dondio, 2017) is recommended.
Their investigation might lead to possible alternative solutions capable of presenting strong inferential and explanatory capacity for non-monotonic reasoning problems. Features employed in this knowledge base are the same ones listed in Table   A.17. Natural language terms and associated numerical ranges are the same ones 1640 listed in Table A.12. The remaining information for modelling and assessing mental workload by this knowledge base are described in the following tables and figures.  (Longo, 2014).

Feature Question
Mental demand How much mental and perceptual activity was required (e.g., thinking, deciding, calculating, remembering, looking, searching, etc.)? Was the task easy (low mental demand) or complex (high mental demand)?
Temporal demand How much time pressure did you feel due to the rate or pace at which the tasks or task elements occurred? Was the pace slow and leisurely (low temporal demand) or rapid and frantic (high temporal demand)?
Effort How much conscious mental effort or concentration was required? Was the task almost automatic (low effort) or it required total attention (high effort)?
Performance How successful do you think you were in accomplishing the goal of the task? How satisfied were you with your performance in accomplishing the goal?

Frustration
How secure, gratified, content, relaxed and complacent (low psychological stress) versus insecure, discouraged, irritated, stressed and annoyed (high psychological stress) did you feel during the task?
Solving and deciding How much attention was required for activities like remembering, problemsolving, decision-making and perceiving (eg. detecting, recognizing and identifying objects)?
Selection of response How much attention was required for selecting the proper response channel and its execution? (manual -keyboard/mouse, or speech -voice) Task and space How much attention was required for spatial processing (spatially pay attention around you)?
Verbal material How much attention was required for verbal material (eg. reading or processing linguistic material or listening to verbal conversations)?
Visual resources How much attention was required for executing the task based on the information visually received (through eyes)?
Auditory resources How much attention was required for executing the task based on the information auditorily received (ears)?
Manual Response How much attention was required for manually respond to the task (eg. keyboard/mouse usage)?
Speech response How much attention was required for producing the speech response(eg. engaging in a conversation or talk or answering questions)?
Context bias How often interruptions on the task occurred? Were distractions (mobile, questions, noise, etc.) not important (low context bias) or did they influence your task (high context bias)?
Past knowledge How much experience do you have in performing the task or similar tasks on the same website?
Skill Did your skills have no influence (low) or did they help to execute the task (high)? Motivation Were you motivated to complete the task?
Parallelism Did you perform just this task (low parallelism) or were you doing other parallel tasks (high parallelism) (eg. multiple tabs/windows/programs)?

Arousal
Were you aroused during the task? Were you sleepy, tired (low arousal) or fully awake and activated (high arousal)?

Label
Internal structure AD1a IF low arousal and low task difficulty THEN not PF4 AD1b IF low arousal and low task difficulty THEN not PF3 AD1c IF low arousal and low task difficulty THEN not PF2 AD2a IF low arousal and high task difficulty THEN not PF4 AD2b IF low arousal and high task difficulty THEN not PF3 AD2c IF low arousal and high task difficulty THEN not PF2 AD3a IF medium lower arousal and low task difficulty THEN not PF1 AD3b IF medium lower arousal and low task difficulty THEN not PF4 AD4a IF medium lower arousal and high task difficulty THEN not PF1 AD4b IF medium lower arousal and high task difficulty THEN not PF3 AD4c IF medium lower arousal and high task difficulty THEN not PF4 AD4d IF medium upper arousal and high task difficulty THEN not PF1 AD4e IF medium upper arousal and high task difficulty THEN not PF3 AD4f IF medium upper arousal and high task difficulty THEN not PF4 AD5a IF medium upper arousal and low task difficulty THEN not PF1 AD5b IF medium upper arousal and low task difficulty THEN not PF2 AD5c IF medium upper arousal and low task difficulty THEN not PF3 AD5d IF high arousal and low task difficulty THEN not PF1 AD5e IF high arousal and low task difficulty THEN not PF2 AD5f IF high arousal and low task difficulty THEN not PF3 AD6a IF high arousal and high task difficulty THEN not PF2 AD6b IF  Appendix B. List of information seeking web-based tasks Table B.19: List of experimental web-based tasks employed for measurement of imposed mental workload. Each website had two interfaces: the original one and one slightly modified, generating two tasks for each description. These tasks were first designed and employed in (Longo, 2014). There is no answer.

Youtube
Appendix C. List of models built using each reasoning approach