!” On the role of end user expectations in creating explanations of AI systems ✩

Research in the social


Introduction
Modern lives are increasingly shaped by data-driven decisions, often made by systems that use Artificial Intelligence (AI) and Machine Learning (ML) algorithms.These systems have the potential to augment human well-being in many ways [1]; however, they often operate autonomously as "black boxes" that were not designed for transparent interaction with end users [2][3][4][5][6].In complex cases, users thus have difficulties understanding the behaviour of such systems [7][8][9], which can result in mistrust and misuse see, for example, the algorithm aversion problem [10]).To mitigate some of these challenges, authors across relevant disciplines have suggested focusing on transparency and interpretability [9,11,4,5,[12][13][14]1].Consequently, explainable AI (XAI), which is seen as a way to achieve such transparency and interpretability [15], has recently seen an increase in interest from both the Human Computer Interaction (HCI) and the AI/ML communities [16].1In terms of content, although many explanation types have been proposed [25], appropriate design remains an open question.In general, this touches not only on what content is relevant, but also how much information is necessary given specific tasks and users, or even deciding when (or why) are they required [26,27].Further, while positive effects of providing explanations have repeatedly been demonstrated [5,[28][29][30][31], there is an associated cost; it is thus important to consider their cost-benefit ratio and to understand in which situations explanations are most likely to be needed and helpful [32][33][34][35][36].
In this context, it is relevant that system requirements (and thus associated aspects such as hardware costs and development time), are, at least in part, dictated by the type of explanation the system is meant to generate.For example, some explanations only need to convey what features of a particular input led an AI-system to make a particular decision and the generating system thus only needs information about the AI system itself.This type of explanation -describing only which features contributed to the system's output -is termed factual."This is Wilson Warbler because this bird has a yellow belly and breast with short pointy bill" is thus a factual explanation of a bird classification [37, p. 4] because it highlights the features deemed relevant for the prediction.Such explanations are used, amongst others, by the Linear Interpretable Model-agnostic Explainer (LIME) [5], Shapley Additive Explanations (SHAP) [38], and saliency methods for highlighting relevant features in images [39].
Explanations can also specifically address events that users might consider unexpected or abnormal [40,41].In other words, when users expect to see one output, but observe another, such an explanation would address why the observed, rather than the expected, output was produced.To do so requires thus a counterfactual element, hence this type of explanation is also referred to as counterfactual.As such, "if the car had detected the pedestrian earlier and braked, the passenger would not have been injured" [42, p. 6277] or "[y]ou were denied a loan because your annual income was £30,000.If your income had been £45,000, you would have been offered a loan" [43, p. 884] are both examples of this type.AI systems producing counterfactual explanations have only recently begun to appear in the literature [43][44][45][46][47].
A typical approach to generating counterfactual explanations is to identify the features that would alter the output of the model if minimally changed [48].However, as we will expand upon in the following section, research in the social sciences suggests that humans often use counterfactual explanations to address the expected outcome rather than the observed one [41].Identifying such an expected outcome (and thus appropriate counterfactual) from a possibly very large set of candidates is a more general, and more challenging, problem since it requires an understanding of the user in addition to information about the AI-system itself.Creating such counterfactual explanations is likely to be difficult in practice since there are no straightforward ways for a machine to infer user expectations.It also remains unclear in which situations such explanations (as opposed to types that are easier to produce, e.g.factual explanations) are even required; while there is a rich literature in the social sciences discussing the dynamics of human production and understanding of explanations in various contexts, the degree to which these insights apply to machines is much less explored (but see [41] for a recent XAI-focused discussion).
To address this, we conduct an empirical study in which users are given examples of both factual and counterfactual explanations in the context of an AI-assisted decision making task involving text classification.The examples are designed such that the system's output either matches user expectations or does not.When it does, the explanation provided is either factual (explaining what keywords led to the text being classified the way it was) or counterfactual (explaining why the classification was chosen over a random incorrect alternative).When, on the other hand, the output is thought to be unexpected from the end user's perspective, we provide either a factual explanation as before, or one of two types of counterfactual explanations (either explaining why the given output was chosen over what the user most likely expected, or over another unexpected category).
Our core hypothesis is that explanations should align with end user expectations.Thus, the ideal system would provide factual explanations when the output is in line with user expectations and counterfactual explanations when it is not.If the hypothesis is supported, it would demonstrate that the system generating explanations does need a way to assess user expectations (a challenging task, as noted before), both to decide what type of explanation to use on a case-by-case basis and to create suitable content for counterfactual explanations.
The remainder of this paper is structured as follows.The next section reviews related work and discusses the motivations for our study in more detail.We then present the details of our experiment, in particular, details on how we operationalise and measure aspects such as end user expectations.This is followed by both a quantitative and qualitative analysis and a discussion of our results in the context of the wider literature.

Types of interpretable and explanatory methods
While the realisation that there is a need to make AI and ML systems understandable and transparent to end users goes back, at least, to the era of expert systems [49,50], a renewed interest in this topic has appeared more recently with the increased availability of AI and ML systems.Consequently, there are now many different approaches to interpretable ML (see, for example, [51][52][53][54] for recent reviews).Here, we briefly summarise the state of the art in terms of different strategies by which issues of interpretability and explainability can be tackled.
Lipton [55], and Silva et al. [56] distinguish between models that address transparency -how the model works -and post-hoc explanations -what else the model can tell [55,56].The former comprises interpretable models that facilitate some sense of understanding of the mechanism by which the model works [55].This can be achieved at various levels: at the level of the entire model (e.g., simulatability), at the level of individual components (e.g., parameters and decomposability), or at the level of the training algorithm (algorithmic transparency) [55].Post-hoc explanations, meanwhile, confer useful information for ML users or practitioners without addressing details of the model's inner workings [55,56].This is achieved, for instance, through visualisations of models, explanations by example, natural language explanations, or factual explanations [5,38,57,39].One of the advantages of post-hoc explanations is that interpretations are provided after-the-fact without sacrificing predictive performance [55].
It is also common to distinguish between local (or instance-level [58]) and global explanations, dividing work on interpretable ML models into three categories [59].The first category is roughly equivalent to Lipton's [55] interpretable models.This kind of work learns predictive models that are, somehow, understandable by humans and thus inherently interpretable (for example, [60,61]).The second and third categories comprise approaches that explain more complex models, such as deep neural networks and random forests.In practice, such complex models are often preferred since they often outperform their more interpretable counterparts [5,60].To nonetheless explain their workings, one strategy is to build local explanations for individual predictions of the black box.Another is to provide a global explanation that describes the black box as a whole, normally employing an interpretable model [59].Thus, consistency is either maintained locally in the sense that a given explanation that is true for a data point also applies to its neighbours (see, e.g., [5,38]); or globally in the sense that an explanation applies to most data points in a class (see, e.g., [62,63]).
In spite of this diversity in approaches and recent attempts to categorize them (see, for instance, taxonomies, ontologies and toolkits in [64][65][66]25]), there remains a lack of clear guidelines regarding what type of explanation or explanatory methods should be used depending on, for example, context, task, type of end user.There are also no clear guidelines on what content is appropriate, how much information is necessary, and when the explanations are required (if at all) [26,27,64].

Contrastive and counterfactual explanations
Work from the social sciences demonstrates that human explanations are often contrastive [41,67]: they do not merely address a particular prediction that was made but rather why it was made instead of another one [68].For example, when humans expect a particular event Q but observe another event P , the question in their minds is "Why P rather than Q?" Here, P and Q are called "fact" and "foil" respectively [68].More specifically, Q , represents something that was expected but did not occur [41].It is thus the counterfactual that the explanation is expected to address.To complicate matters, when humans articulate such a question, they often leave the foil implicit [41], asking simply "Why P ?"It is therefore rarely explicit what exactly was expected instead of P .
In XAI, the terms "contrastive" and "counterfactual" have often been used interchangeably.In this paper, we use the term counterfactual, and adopt the terminology from Lipton [68] and Miller [41], using the above definitions of P and Q as fact and foil, respectively.A counterfactual explanation is thus one that addresses the foil.
The main advantage of such counterfactual explanations (addressing why P was chosen over Q ) is that they are intuitive and less cognitively demanding for both questioner and explainer [41]: there is no need to reason (or even know about) all causes of P ; it is sufficient to address only those relative to the foil Q .However, since humans often leave the foil implicit, it is up to the explainer to infer the intended one.For XAI, this is a significant challenge because the set of candidate foils in any given question is possibly infinite.To give an example from Miller [41, p. 16], the question "Why did Elizabeth open the door?" has many possible foils, including, e.g., "Why did Elizabeth open the door, rather than leave it closed?"and "Why did Elizabeth open the door rather than the window?" For XAI, using counterfactual explanations thus sidesteps the substantial challenge of having to explain the internal workings of complex ML systems [43]; they can provide information to the user that is both easily digestible and useful for understanding the reasons underlying a decision.This also enables users, if needed, to challenge these decisions and thereby influence the future behaviour of the system such that performance increases.Creating such explanations is, however, a challenging task in itself [69,70,45,71,72] and significant effort is put into devising methods for generating appropriate counterfactuals [44,73,74,43,47,48,46,43].
One way to generate counterfactual explanations is to identify features that, if minimally changed, would alter the output of the model [48].To do so, one can frame the search as an optimisation problem, seeking to find the closest hypothetical point that would be classified differently from the point currently in question [48], although defining an appropriate distance metric is far from trivial (see [43,75,76] for examples of this approach).By contrast, a model-agnostic approach -termed SEDC -for generating counterfactuals is presented in [58].SEDC uses a heuristic best-first search for finding evidence that counterfactually explains predictions of any classification model for behavioural and text data.Compared with other approaches such as LIME-Counterfactual [5] and SHAP-Counterfactual [38], SEDC is typically found to be more efficient, although LIME-C and SHAP-C have low and stable computation times [44].
While different approaches for generating counterfactuals exist and they have been shown to be useful in various specific domains (see for example [77,78]), there remain many open challenges.Here, we address two of them by building on the previously discussed insights from social sciences [41]: deciding (1) in which situation a counterfactual explanation is appropriate and (2) if they are appropriate, which counterfactual to use in an explanation.We hypothesise that explanations need to align with the expectations of the user and thus that (1) counterfactuals should only be used if the system output is different from what the user expected and (2) when used, counterfactuals should address the expected output.
These questions point towards potentially highlight significant challenges for work in XAI: if counterfactual explanations are indeed needed when the system's output does not match what users expected, and the appropriate foil is defined by this expectation, then producing suitable explanations will be, as discussed above, a very difficult task.For that reason, it is important to understand under which conditions this type of explanation is needed (as well as under which conditions it is not).

Aim and hypotheses
To summarise the previous section, there are good reasons to expect that humans interact with intelligent systems and computers in ways that directly resemble how they interact with other humans [33,79,80].Further, in human-human interactions, explanations are often needed when an event is unexpected and they need to explain the unexpected fact in relation to an implicit expected foil [41].Overall, we expect that the type of explanation given by a system should be congruent with the output that users expected, leading to the following two hypotheses for this study: H1: Factual explanations are appropriate for correct predictions because the system output is in line with the expected output.H2: Counterfactual explanations that contain the expected foil are appropriate when the system prediction is incorrect, because this situation mirrors the use of counterfactual explanations in human-human interactions.
To test these hypotheses, we showed participants small pieces of text along with their class as predicted by an AI-based text classification system and an explanation of the system's output.We chose this scenario because determining the class of simple pieces of text is something that humans also generally do well.It is therefore reasonable to expect that the participant's expectation of the system's output corresponds to the true class of the text, and, consequently, that the latter can be used to determine the various conditions in the experimental design.
Further, given the current lack of guidelines for what type of explanation or explanatory methods should be provided in a given context [26,27], we also address a few additional questions using the data obtained during the experiment; namely: (1) how factual and counterfactual explanations affect users' mental models, (2) how the type of explanations used in this study are perceived in terms of content and completeness, and (3) whether there is a need for an interactive explanation system (which also connects to discussions regarding the dynamics of explanations, and whether they are better understood as a product or a process [41]).

Experimental design
Conditions The factors and levels of the experiment were as follows.The AI-system produced a prediction that was either correct or incorrect (two levels).The Explanation-system, on the other hand, produced three types of explanations: factual (F), counterfactual with the correct (and thus expected) output included (CC), and counterfactual but including incorrect (unexpected) output (CI) (three levels).There were thus two independent variables (type of prediction and explanation) and a 2x3 experimental layout.Since we used a between-subject design, this resulted in six subject groups or conditions, see Table 1.
An important assumption in this design is that participants will generally be able to, by themselves, infer the true class of the text to be classified.If this is the case, then we can reasonably expect that a correct classification will be expected output, and an incorrect one will not be.If this assumption were to be violated, the number of data points in condition could potentially be heavily unbalanced, to the point where some aspects might not be tested at all.However it is possible to test whether participant's expectations in our study were consistent with this assumption, which we address as the first point of our results section below.Measures Table 2 summarises the main measures used in this study.They primarily address (1) how satisfied participants were with the given explanations and (2) the degree to which they understood local and global AI-system behaviour and were able to form mental models.We used two strategies to assess explanation satisfaction.First, we included the question "How satisfying did you find the explanation in terms of understanding why the system made its classification?" in the main task after the participants read the text, prediction, and explanation on each screen.This was answered on a 5-point Likert scale ranging from "Not at all satisfying" to "Highly satisfying".Second, we designed a questionnaire based on the review and scales presented by Hoffman et al. [81, e.g., Appendix C, pp.[39][40] that contained closed and open-ended questions regarding the explanations provided assessing satisfaction, completeness and how useful the explanations were in supporting the understanding of the AI-system behaviour.
A second questionnaire focused on assessing participants' understanding of the behaviour of the AI-system.Following, once again, the summary presented by Hoffman et al. [81], we included questions that assess the understanding of the local (particular prediction) and global behaviour of the AI-system, and the learned models for each class.Therefore, we asked participants whether they could predict the outcomes of the system (as done recently by, e.g., [82] and listed by Hoffman et al. [81, p. 11] as a method to elicit mental models).They were also asked to rate the AI-system's performance for each class, showing if they had discovered how well the AI-system classified each category (see section 3.3 for more details regarding the AI-system hidden behaviour).
All questionnaire items were either five-point Likert-type scale questions or open-ended questions, with the exception of rating the AI-system's performance for the various classes where the participants have to choose between "Poor", "Fair", "No opinion", "Good", "Excellent".Details of the questions can be found in Table 2 and Fig Pilot Before running our experimental study, we carried out a pilot study with 20 participants that led to some changes in the formulation of both the main task and the questionnaires.The pilot study was also used to determine how many texts should be classified within a reasonable amount of time; we decided that 18 texts would be appropriate for an average completion time of 25 min.

Dataset, apparatus and stimuli
Dataset We chose the 20 Newsgroups text dataset2 [83], a collection of approximately 20,000 newsgroup documents, partitioned into around 20 different newsgroups.This collection has become a popular corpus for carrying out ML experiments for text classification purposes [23,84].In an effort to keep the classification task simple, yet elaborate enough to allow the construction of all counterfactuals, we selected three existing classes from the collection: 1. politics (talk.politics in the dataset, including texts from the talk.politics.misc,talk.politics.guns,and talk.politics.mideastnewsgroups), 2. science (sci in the data set, including texts from sci.crypt, sci.electronics, sci.med, and sci.space), and 3. leisure (rec in the dataset, including texts from rec.motorcycles, rec.sport.baseball,rec.autos, and rec.sport.hockey).
The AI-and Explanation-systems Participants were told that an AI-system would classify random text and emails found online into the three classes mentioned above, and that an Explanation-system would generate an explanation regarding why the AI-system made these classifications.Fig. 1 was shown to the participants to illustrate the relation between both

Table 2
Measures and example questions used for each of the measures.QI refers to the Explanation-system questionnaire and QII to the AI-system questionnaire, see Fig. B.12.

Measure Measured in
Completeness, overall satisfaction and content of explanations "The explanations provided of how the AI-system works have sufficient detail."(1)(2)(3)(4)(5) [QI] "The explanations provided regarding how the AI-system classifies the text seem complete."Perceived need for an interactive explanation-system "I would have liked to have an interactive explanation system that would answer my questions."(1)(2)(3)(4)(5) [QI] "If you would have liked to have an interactive explanation system, what would you like that system to be like?"(Open-endedquestion) [QII] systems.The online behavioural science platform Gorilla [85], which includes the Gorilla Experiment Builder platform, 3 was used to implement and conduct the experiment.
We selected texts from the 20 Newsgroups dataset and the three classes according to the following two criteria: (1) the texts did not include any personal or controversial information, and (2) they consisted of one or two paragraphs that could fit the design of the webpage (thereby discarding texts that were too long or too short).To classify the selected texts and generate the keywords used in the explanations, we used an adapted version of the code (available on GitHub) by Ribeiro et al. [5] that uses a multinomial Naïve Bayes classifier (MultinomialNB) from the scikit-learn library [86].Using the prediction probabilities of this classifier, we then also discarded texts that were either too easy or too challenging to classify.
Fig. 1.An AI-system classifies text in one of three classes: politics, science or leisure.This classification can be correct or incorrect.The Explanation-system provides either a factual explanation based on the most important words for the predicted class, or a counterfactual explanation based on words missing from the text that would, if present, have led to another class being selected.This other class (the foil) could be either the correct class of the text, or another incorrect class.
Finally, our experimental design required participants also to be given clearly incorrect predictions from the AI-system.Thus, we modified the classifier such that the class politics will always be predicted correctly, the class science will randomly be predicted correctly 50% of the times and the class leisure will always be predicted incorrectly.We balanced classes across the counterfactuals, ensuring that no class was over-represented.Participants were not told about this aspect of the system; we therefore refer to it as the "hidden" global behaviour of the system.This is in contrast with the local behaviour of the system in terms of each particular classification and associated explanations that participants did see.We again used the class membership probabilities given by the classifier to assign the incorrect class.The purpose of this hidden global behaviour was to test whether the participants built mental models of the system based on the local explanations.
Stimuli: explanations Two types of explanations, factual and counterfactual, were generated for the selected texts.The foil used in the counterfactual explanations for incorrect predictions could be either the correct text class or another incorrect one.We used LIME [5] and an adapted version of its implementation for the multiclass case provided by the authors at GitHub to help us generate such explanations.Based on the words highlighted by LIME (keywords) and the three models built for each of the classes, we manually built readable and narrative explanations for non-data scientists.For consistency, all explanations followed the same sentence structure substituting only the keywords from each text.Thus, we built nine different explanations (since each incorrect prediction has two possible outcomes) for each text.An illustration of the process is shown in Fig. 2, and a full example including a particular text and all possible explanations is provided in the Appendix C. The following is one of the possible texts followed by example explanations that were generated for it: Factual explanation, correct prediction: The AI-system classifies the text as politics because words as economy, Americans and percent were found.
Factual explanation, incorrect prediction: [Predicted class science] The AI-system classifies the text as science because words as survey and harvard were found in the text.
Counterfactual explanation, correct prediction: [predicted class politics, foil science] The AI-system classifies the text as politics instead of science because words such as experiment or investigation were not found (even though the words survey and harvard were).
Counterfactual explanation with the correct class, incorrect pred.: [predicted class science/leisure, foil politics] The AI-system classifies the text as science/leisure instead of politics because words such as financial or growth were not found (even though the words American and percent were).
Counterfactual explanation with the incorrect class, incorrect pred.: [predicted class science, foil leisure] The AI-system classifies the text as science instead of leisure because words such as travel or vacation were not found (even though the words Japan and worldwide were).
In total, we used 20 texts from each class, and of those, based on the completion time determined in the pilot study, we selected 18 in total for our experimental study (six for each class).For all of them, we generated all the possible explanations given correct and incorrect predictions.For each incorrect prediction, there were two possible outcomes from the AI-system, and in turn, the counterfactual explanation could include the correct class or not.For example, if the correct class of a text was science, incorrect outcomes were politics or leisure.A counterfactual explanation of a correct prediction had to include one of the incorrect classes as the foil, while a counterfactual explanation of an incorrect prediction could include a foil that was either the correct class or an incorrect one, i.e., science if the correct class was used, or the remaining incorrect class (leisure or politics, respectively).

Procedure
Fig. 3. Experimental design, procedure and participant's view of the experiment.After the consent, instructions and demographics form, each participant is randomly assigned to one of the six groups (between-subjects).Each group sees a different set of explanations (as shown in Table 1).Illustrative examples of the explanations are given in section 3.3 and at the bottom of the figure.Each participant goes through 18 (randomly presented) texts to classify and finally, answers two questionnaires.Fig. 4. Snippets of the on-line experiment interface: text to classify, prediction by the AI-system and explanation.The participants needed to answer two questions (1) "how satisfying did you find the explanation in terms of understanding why the system made its classification?"(1-5) and do you agree with the classification made by the AI-system?(Yes/No).If the answer was No, an additional question (as shown at the bottom) was displayed; the participant would then enter which class he/she thought was the correct one (politics, science, leisure).
The experiment had three stages: (1) introduction, (2) main classification task and (3) evaluation questionnaires (see the overall procedure in Fig. 3).The first stage consisted of a consent and a basic demographics (gender, age and education) form, followed by the instructions for carrying the task (see a print screen of the instructions in Appendix A.11).The overall aim of the task presented to the participants was to help improve and evaluate an AI-system that classified random text and emails found online.The main task, stage 2, consisted of 18 texts to classify (6 per class), one text per screen (Fig. 4).All participants received the same texts in randomised order.After each text, the system predicted one of the three possible classes (politics, science and leisure).Each prediction was followed by an explanation (generated by the Explanation-system) as to why the AI-system made such classification.The type of explanations participants received depended on their experimental group as described in Table 1.
On each screen of the main task, participants had to answer two questions, one regarding how satisfied they were with the explanation provided ("How satisfying did you find the explanation in terms of understanding why the system made its classification?")and one on whether or not they agreed with the classification made by the AI-system.If the participant did not agree, they had to specify which category they thought was the correct one.The instructions indicated that providing the correct class would help the system become better: "you will help the AI-system to learn from you and be better in the future".All questions had to be answered.The interface during the main classification task is shown in Fig. 4.
Finally, two questionnaires were given to the participants, one focusing on the Explanation-system and one on the AIsystem (see Appendix Fig. B.12 and previous description of measures in Table 2).

Participants
200 participants were recruited through the online platform Prolific 4 and 181 retained for the final analysis.The selection criteria for participating in the study were that they were fluent in English and aged between 18-65.Participants were paid £ 3.50 for finalizing the test; the payment was made through Prolific.The age distribution of the participants was as follows: 60 participants were between 18-24 years, 46 between 25-30, 46 between 31-40, 20 between 41-50, 6 between 51-60 and 3 between 61-65.The mean age was 31,15 σ = 10.1.91 of the participants were male (50,28%) and 90 were female (49,72%).
The highest academic qualification reported by the participants was "high school": 63, "bachelor's degree": 70, "master's degree": 38 and "other": 10.Each participant was randomly assigned (balanced) to one of the six conditions or groups; after removing and cleaning the data, the number of valid participants per group were

Table 4
Measured variables and corresponding questions for an overall appraisal of the explanations given by the system, as well as the figures in which the corresponding visualisation of the responses can be found.Note that both Likert scale and open-ended questions were used.
Measure Questionnaire item(s) Figure

Satisfaction
"How satisfying did you find the explanation in terms of understanding why the system made its decision?"5a "The explanations provided of how the AI-system classifies text are satisfying."7a Completeness "The explanations provided regarding how the AI-system classifies the text seem complete."7b "Would you have liked for the explanations to contain additional information?If so, what type of information and when, i.e., in which situations?" Sufficient detail "The explanations provided of how the AI-system works have sufficient detail."7c Understanding "From the explanations provided, I understand how the AI-system works."7d

Collected data
Table 3 summarises the final number of participants per condition and the corresponding number of data points collected (18 per participant).First, we verified that participants are able to determine the true class of each text and that, therefore, it is valid to use the correct output as a proxy for user-expected output, as assumed in our experimental design.Thus, we checked how often participants agreed with a correct AI-system output and disagreed with an incorrect one (which we refer to as behaviour that is consistent with the assumption underlying the experimental design).We found that users were consistent about 90% of the time (see Table 3), which indicates that the assumption was reasonable.We excluded the inconsistent responses for most of the analysis below because the assumption that whether the system output was correct or not cannot be used as a proxy for user expectations in this case.While one might consider to "reassign" inconsistent data points to other conditions, this would distribute responses from individual participants over several groups, which should be avoided.We do however describe these responses qualitatively in the next section.

Satisfaction with the given explanations
To address our main hypotheses, we first analysed participant responses to questions related to their satisfaction with explanations given by the system (see Table 4).In particular, we considered both satisfaction with both individual explanations and the overall impression (how satisfying explanations were overall, whether they contained sufficient detail, how complete they were perceived to be, and to what degree they helped understand the behaviour of the AI-system).
To test for the statistical significance of differences between the explanation types in the Likert scale responses, we used ordinal regression models; specifically, cumulative link models (CLMs [90,91]) provided by the R package ordinal and test for significance using Type II Analysis of Deviance (ANODE) tests [90].Although Likert scale data are sometimes analysed using parametric tests, it is important to note that while the appropriateness of such tests, in this case, remains a matter of debate, to err on the side of caution, it is recommended not to apply them to ordinal data [92].Although a Mann-Whitney U test would be a non-parametric alternative, it has recently been argued that ordinal logistic regression, e.g.CLMs, would be preferable [90].
The core assumption behind CLMs is the so-called proportional odds assumption.In the analyses below, we checked this assumption for each test.If the assumption was violated, we modified the model to allow different scales for the variables in question (which is the preferred way to approach this violation of assumptions and is directly supported by the ordinal package [91]) and note this when results are reported.Fig. 5. Distribution of satisfaction ratings per group for all data points as well as only instances in which users agreed with the AI-system's output and those where users disagreed with the AI-system's output.Numbers below the x-axis indicate the total number of points in each group (as per Table 3).

Table 5
Summary statistics for the data presented in Fig. 5.

Satisfaction ratings for individual explanations
Fig. 5 presents a visual representation of the Likert scale satisfaction ratings of the individual responses provided during the classification task for each given explanation (i.e., the answers to the question "How satisfying did you find the explanation in terms of understanding why the system made its decision?").Table 5 contains the corresponding descriptive statistics.An interesting observation is that the rating was strongly dependent on whether or not participants agreed with the system output as revealed by splitting the responses accordingly (Fig. 5, bottom-row plots).
In this context, it is also possible to consider the inconsistent data points that are excluded from the remaining analysis (Fig. 6).It is, for example, apparent that this observation remains true even when user behaviour is inconsistent, as illustrated in Fig. 6 bottom-row plots, showing that user's expectations were likely an important factor in their satisfaction.Although visual inspection of the two figures suggests the actual system performance was also relevant (for example, satisfaction appeared to be higher when the system was actually correct, see the bottom-row left plot in Figs. 5 and 6), the heavily unbalanced number of data points in each group prevents strong statements to this effect.
Statistically, we found a significant effect only for the explanation type when the system output was correct (Table 6): in this case, factual explanations scored higher.Our first hypothesis (H1) is therefore supported but H2 is not.This is also illustrated by the separation into correct and incorrect output shown in Fig. 5, in which a higher satisfaction for the groups that received factual explanations for correct outputs can be observed while none of the explanation types for incorrect  outputs appeared to modulate satisfaction.Overall, this indicates that none of the explanation designs used in the present study -including, surprisingly, counterfactual explanations -were optimal when end users expected the AI-system to produce an output different from what was actually produced.

Overall satisfaction, completeness and content of explanations
Since the overall impression of explanations can differ from individual responses, we next turned to responses regarding overall aspects of the explanation system.Fig. 7 summarises the responses regarding overall satisfaction with the explanations, whether they had sufficient detail, appeared complete, and helped in understanding the AI-system.The corresponding descriptive statistics are given in Table 7.For all questions, we again found a significant effect only for explanation type when the system output was in line with the true class of the text, with, as before, participants assigning higher scores to these measures for factual explanations, and in line with H1, while H2 receives no support.(See Table 8.) We also analysed whether participants needed additional information in the explanations through a content analysis of the answers to the additional information open-ended question.Specifically, the comments were first analysed to identify categories and then labelled manually by one coder.The process of extracting the codes was iterative, going from an initial value of 11 to a final value of 5.For groups [PC:EF/PI:EF], [PC:EF/PI:ECC], [PC:EF/PI:ECI] and [PC:EC/PI:ECC] around a third of the participants (11)(12)(13) in each group commented that they did not need additional information.For groups [PC:EC/PI:EF] and [PC:EC/PI:ECI], however, only 6 and 4 participants respectively indicated no need for additional information.This is interesting insofar as that groups covered by both our hypotheses (H1: [PC:EF/PI:EF] and [PC:EF/PI:ECI]; H2: [PC:EF/PI:ECC] and [PC:EC/PI:ECC]) showed least need for additional information.In terms of what additional information participants would have liked, we found that it clustered into the following five categories (see evidence for each category in Table 9): Context.This suggestion was voiced by participants in all groups and refers to the fact that our explanations only contained single words while participants would have liked to see sentences and connections between words, colloquialisms, slang, or the various meanings of the words in relation to the context of the text.4 for the complete question phrasing.

Table 7
Summary statistics for the data presented in Fig. 7 Factual explanations.All groups that received a counterfactual explanation when the system output did not align with the true class (i.e.all groups except for [PC:EF/PI:EF] and [PC:EC/PI:EF]) had participants who asked for factual explanations, i.e. why the predicted class was chosen instead of the alternative one included in the counterfactual explanation.This supports the previous result that counterfactual explanations, at least by themselves, appear to be insufficient.
Metrics.Some participants asked for quantified information, such as a matching percentage for each class.
Global model, process and behaviour.More information about the classification process at a higher abstraction level and why the system selected some words and no others was requested.
More examples.Some participants would simply have liked to see more words in the explanations, for example, more keywords that contributed positively and negatively to the final prediction.
Other additional suggestions related to the constraints of the study itself.For example, some participants would have liked more refined categories and subcategories for classification.For instance, some felt that the "science" category was too broad ("I think the categories Science, Leisure and Politics could have been broken down further, for example, Science could also be Medicine or Physics, that would have made the AI more accurate in my opinion"; [PC:EF/PI:ECC]) or that there simply could be Example evidence for the main classes of requests for additional information voiced by participants as well as the groups this evidence is sampled from.
Context "The AI seemed to pounce on certain words and not look at the context" (PC:EF/PI:EF) "I would've preferred the explanations to contain more contextual information rather than assumptions made with the absence/presence of certain words" (PC:EC/PI:ECC) "Some description of whether or not the AI recognises context in anyway.The impression given by the explanations is that the AI relies on simple word matching, which is plainly insufficient" (PC:EC/PI:ECI) Factual explanations "The explanations told when it didn't classify the text as an option because certain words weren't used.I would've liked to have seen more about how it considered which words that WERE used" (PC:EC/PI:ECC) "I would have liked additional information on why the incorrect choices were made by the AI, and not the reasons why the correct answer was not chosen" (PC:EF/PI:ECC) "I would have liked to see why the AI system classified texts based on which were included (as opposed to the model of which words were omitted)" (PC:EC/PI:ECI) Metrics "I think that the AI should have looked for more keywords to gauge the subject of each email.Perhaps adding a percentage meter showing how much the subject matches its findings would be helpful" (PC:EF/PI:ECI) "Information on why certain classifications were excluded, if the system is based on 'scoring' significant words, information on scores would be useful" (PC:EF/PI:ECI) Global model, process and behaviour "Anything that helped me understand better what is the algorithm behind the classification system employed by the AI" (PC:EF/PI:ECC) "I would like to know how it selected the relevant words and why it ignores other words" (PC:EC/PI:EF) "The amount of words that the AI has associated for each category" (PC:EF/PI:ECC)

More examples
"More words that it picked up to decide the category in which the text fits into" (PC:EC/PI:EF) "I would have liked the explanations to include more words that they screened and more words that they ruled out as being relevant to the category" (PC:EF/PI:ECI) more categories ("There are certain situations in which more categories would be beneficial.A category for sport, religion, medicine, etc. would likely be beneficial if properly implemented" [PC:EC/PI:ECC]).It is worth bearing in mind that the whole newsgroups dataset contains 6 topics and 20 subcategories, of which 3 topics were used in this study.

Perception of the AI-system
One of the goals of using XAI solutions can be to support creating an understanding of the inner workings of an AIsystem; for example, such that users are able to build an accurate mental model of the system.Here, we analysed the effect that different types of explanations had on this.Table 10 lists the measures and corresponding questions that were used to this effect.

Perceived understanding of AI-system's inner workings
We first looked at whether participants felt they could understand the system locally, globally, both locally and globally, and in terms of its limitations and the mistakes it made (see the visualisations of the responses in Fig. 7 and full questions

Table 10
Measured variables and corresponding questions investigating participants' perception and understanding of the functionality of the AI-system based on the explanations provided.Note that the questions regarding perceived performance of the AI-system were open-ended questions.

Measure
Questionnaire item Figure

Understanding of the AI system
"The explanations provided help me to understand a particular prediction made by the AI-system."8a "The explanations provided help me to understand the global behaviour of the AI-system."8b "The explanations provided help me to understand a particular prediction made by the AI-system but also the global behaviour of the AI-system."8c "The explanations provided help me to understand the limitations and mistakes of the AI-system."8d Predictability of the AI system "I know what will happen the next time I use the AI-system because I understand how it behaves."9a "The outputs of the AI-system are very predictable."9b Performance of the AI system "Do you think the AI-system performed well considering the classifications you have seen?"N/A "From your point of view, does the AI system need improvement?"N/A "Do you think that the AI-system classifies the different types of text equally?" N/A Fig. 8. Answers to the AI-system questionnaire measuring whether users believe that explanations help to understand the behaviour of the text classifier (a) locally, (b) globally, (c) locally and globally as well as whether they help to understand (d) the classifier's limitations and causes of the mistakes.See Table 10 for the complete phrasing of the questions used.
in Table 10).We found effects for the explanation type both for correct and for incorrect outputs with respect to the ground truth only for local understanding (Table 12), with higher scores given to factual explanations for correct output and counterfactuals with the correct class explanations for incorrect output (see the summary statistics in Table 11).This is in line with both H1 and H2.We also found that the explanation type for incorrect output had a significant effect on the perceived ability to understand the system's limitations and mistakes (Table 12), again with higher scores given to counterfactuals with the correct class explanations.Lastly, it is worth noting that the best overall ratings (in terms of most positive and least negative) for [PC:EF/PI:EF]-[PC: EF/PI:ECI] and [PC:EC/PI:EF]-[PC:EC/PI:ECI] are observed for [PC:EF/PI:ECC] and [PC:EC/PI:ECC] respectively; i.e. for the groups that received counterfactual with correct class explanations when the system output was not in line with the true class of the text.This provides partial qualitative support for H2, although no strong statements can be made given that statistically speaking, no significant effects were observed.

Ability to predict AI-system behaviour and performance
Prediction tasks are quick windows into users' mental models of the system [81].We measured both the participants' ability to predict AI-system's outputs and whether they considered the system to be predictable (Fig. 9 and the correspond-  ing summary statistics in Table 14).We only found a significant effect of the explanation type for correct output on own ability to predict (Table 13), which is partial support for H1.Answer to the question "I would have liked to have an interactive explanation system that would answer my questions."No statistically significant differences were found (see Table 15).

Perceived performance of the AI-system
Lastly, we examined whether participants picked up on the hidden model of the system (see Methods section; the politics, science and leisure classes had accuracies of 100%, 50% and 0% respectively) through open-ended questions about the perceived performance (see Table 10).There was a great variety of answers to these questions.Many participants highlighted that the AI-system needed improvement and suggested a wide range of amendments.Some were in line with previous suggestions for additional content for the explanations (e.g., consider context, use refined categories and display a larger number of relevant words).
Overall we found that between 34% (for [PC:EC/PI:ECC]) and 50% (for [PC:EF/PI:ECC]) of the participants recognised the hidden model and an additional 14-20% partially recognised it (for example, picking up that the leisure category demonstrated worse performance).Interestingly, participants attempted to rationalise this, speculating, for example, that this category has a broader vocabulary than, e.g., science, which might have more specialised keywords.However, there were no notable differences between the groups.As such, while these open-ended questions yielded insights into participants' ability to infer hidden models, there was no strong support for either hypothesis.

Perceived need for an interactive explanation-system
As a final point of interest, we considered the possibility that participants might have preferred an interactive explanation system rather than the type used here.Likert scale responses to "I would have liked to have an interactive explanation system that would answer my questions" are shown in Fig. 10 and the corresponding summary statistics in Table 16.Although there were variations between the groups, around one third of the participants would have liked an interactive explanation system, while the rest showed no clear preference ([PC:EC/PI:ECC] was the group most inclined to use interactivity).

Table 17
Example evidence for interactive explanation system designs suggested by participants.
Human in the loop Along the same line, answers to the open-ended question "If you would have liked to have an interactive explanation system, what would you like that system to be like?"showed that around one third (between 9 and 12) of the participants from each group explicitly indicated that they did not need or want an interactive system (min [PC:EF/PI:ECC] and [PC:EC/PI:ECI] with 9 participants, max [PC:EF/PI:ECI] with 12).Participants did however, provide interesting suggestions for future systems, which we summarise here (with example evidence listed in Table 17): Human-in-the-loop.Several participants expressed a wish to input information to the system, so that the system would learn from them.
Questions.One of the most common requests regarded the possibility of talking to the AI-system and ask questions as one would with chatbots and current personal assistant devices such as Siri or Alexa.
Show the classification process.Several participants would like to have insight into the classification process itself.
Interaction with the global model.Several participants would like to build intuitions about how the model worked.For instance, they would like to see the words associated with each category and select them to see their influence on the classification outputs.
Interaction with the text.Under this category, we have placed all the proposals that suggested that the most important words should be highlighted for selection and further exploration, and the possibility of hovering over them to see how the AI-system would classify them and show the confidence/accuracy of the prediction.
Other suggestions included, for example, a personalised system that matches expectations ("as personalized as possible to my expectations and needs" [PC:EF/PI:ECC]) or the use of voice and sonification ("something with voice-over and maybe like an e-book, so we could relate the speech of the person to the text, since tone of voice matters quite a bit when speaking and/or reading" [PC:EF/PI:EF]).

Main result
Our results first showed that factual explanations given when the system's output corresponded to the true class of the text received statistically significant higher scores than counterfactual explanations in nearly all aspects we considered (specifically satisfaction, completeness, detail and system understanding), in line with our first hypothesis.
However, we found no strong evidence to support the second hypothesis, namely that counterfactual explanations that included the expected output were most appropriate when the system output did not match with end user expectations.We can note that the groups that received this kind of explanation, [PC:EF/PI:ECC] and [PC:EC/PI:ECC], tended to score higher than the others regarding explanation completeness and system understanding with [PC:EF/PI:ECC] nearly always providing the best ratings, but there was no evidence to support any claims of statistical significance.We further found evidence that participants in [PC:EF/PI:ECC] may have built the most accurate mental model of the AI-system.
Overall, this indicates that counterfactual explanations (with the correct class) most likely capture part of what users look for in an explanation when the system output does not match their expectations, but are unlikely to be sufficient by themselves.This aspect will require further investigation.In particular, we noticed, in answers to our open-ended questions, that several participants who received counterfactual explanations for incorrect predictions suggested that they would have liked to receive factual explanations.However, data from groups that did receive this ([PC:EF/PI:EF] and [PC:EC/PI:EF]) indicated that this was also not sufficient.In future work, it may be interesting to investigate whether a combination of factual and counterfactual explanations when AI-system's outcomes do not align with end users expectations might show better results, and assess if this hybrid type of explanation is a suitable alternative over the approaches considered here.
Our results therefore suggest that it is necessary for explanation generating systems to somehow infer the output that users expected of an AI-system.The clear role for this information, as a way to decide when a factual explanation is the most appropriate, is demonstrated by our first hypothesis.This is important since, as discussed at the beginning, it implies that explanation systems might therefore need a model of the end user to estimate these expectations.Interestingly, this also aligns with very early research on the design of Intelligent Tutoring Systems (1970s-1980s), as reviewed in [93, pp. 33-34], where the explanation system required both "subject knowledge" and "teaching knowledge," and thus, a method/model for how to interact with the learner.
While we did not find support for our second hypothesis, the data collected showed that the most appropriate explanations to give when the system output does not match what users expected likely needs to contain more than just a foil addressing the expected output.As such, this provides empirical support for recent theoretical arguments in favour of hybrid approaches to counterfactual explanations [18].This is also relevant because counterfactuals are thought to be a critical component in human-human interaction [41], and it is commonly thought that such results would translate to interactions with machines [33,79,80].

Expectations, satisfaction and model accuracy
Research in XAI, or more generally in AI-systems acceptance, rarely explores users' expectations of system outputs [94].Our results demonstrated that they are indeed a relevant factor in producing satisfying explanations since our users explicitly reported higher satisfaction if the output of the system was what they expected, even if the AI-system output and the users' expectation were both wrong.Such a result cannot be fully explained by simply arguing that model accuracy has a strong effect on explanation satisfaction.There are two relevant studies in this regard, albeit focusing on trust.The first [95] studied the influence of model accuracy and explanation fidelity on trust in AI, demonstrating that the systems' accuracy levels were most decisive for user trust: the higher the accuracy, the higher the user's trust.Further, stated accuracy was in principle found to affected people's trust in the model [96].However, this trust was significantly affected by observed accuracy (i.e., after a chance to observe the model's accuracy in practice) irrespective of its stated accuracy.Thus, both studies [95,96], demonstrated an impact of perceived model accuracy on trust.This is in line with our results, and highlights the importance of perceived, rather than actual, model accuracy in how users experience AI-systems.
Another interesting result was that participants in group [PC:EF/PI:ECC] gave the highest ratings and scores for all the measures, including when rating the content of explanations and their understanding the local and global behaviour of the AI-system.Such aspects (e.g. the content, level of detail and completeness of explanations from AI-systems, and whether to address local or global behaviour, or both) have also been discussed in the literature.One study, for example, found that completeness of explanations was more important than soundness in forming mental models [26].Another demonstrated that overly detailed explanations enhance trust but also create over-reliance while short or absent explanations prevent over-reliance but decrease trust [97].Relatedly, it was found in a real-world task of predicting housing prices that it is harder for humans to simulate the model's predictions accurately with longer explanations [34].Lastly, an analysis of the impact that four types of explanations have on people's fairness judgements of MLsystems found that there is no one-size-fits-all solution for effective explanations [24].Instead, they depend on the kinds of fairness issues and user profiles, leading to the suggestion that providing hybrid explanations, i.e., allowing both an overview of the model (global) and scrutiny of individual cases (local) may be necessary for accurate fairness judgement.

Other considerations for the design of explanations and explanation systems
In this section, we briefly touch upon other considerations for the design of explanations and explanatory systems based on the answers provided to the open-ended questions (see examples in Tables 9 and 17) that coincide with similar results from empirical studies within XAI.These relate to the (1) content of explanations and (2) desirable features of an interactive explanation system.

Content of explanations
Context.One of our participants' suggestions was to show context in order to improve the content of the explanations, in line with other findings that highlight the importance of context for building and presenting explanations from AI-systems.For example, the recommendation to present the most relevant words together with their contexts in a text classification task was also given in [98], while another study [99] found a positive effect of context in the sense that local and broader contextual explanations helped users form a stable mental model of the agent's behaviour while playing a game.
Confidence.Another suggestion from our participants, adding accuracy and confidence values along with explanations, has also been investigated recently [100,101].The first study [100] found that confidence scores can help calibrate people's trust in the AI model.However, trust calibration alone is not sufficient to improve AI-assisted decision-making, which may also depend on whether humans can bring in enough unique knowledge to complement the AI's errors and limitations.Further, Van der Waa et al. [101] present a method for conveying the system's confidence in the advice that it provides (it does so by showing how likely it is that the given advice turns out to be correct based on past experiences).Two user experiments showed that these explanations facilitated users' understandability of the algorithm and that participants especially preferred their confidence values to be explained by referring to past experiences.
Global model behaviour.Participants also indicated that they would have liked to see information to help them understand the global model for each class and the overall behaviour or classification algorithm of the AI-system.This connects to current discussions on whether local or global (or both) explanations should be provided, or interpretable models per se should be used.There is, however, a significant body of work in XAI that uses only local explanations which should reveal the global structure of the model, e.g.[102].Using explanations to reveal global model behaviour is indeed challenging [98], and very few studies present global explanations or try to build/improve human's mental model of how the AI-system works (but see [103] for an exception).

Desirable features of an interactive explanation system
In the final questionnaire, we included aspects related to having an interactive explanation system.Even though participants likely envisioned different interactive systems, the purpose of these questions was not to gather information about any particular design feature for such a system but to explore the willingness to interact with the explanations.
One interesting aspect that was mentioned concerned the ability to ask questions and get answers along the way or to see the classification process, being able to select parts of the text and receiving answers on the fly.Information on-demand and interactive disclosure of information has, certainly, a long history in HCI research and has also started to be studied in an XAI context (e.g.[104,105]).This relates as well to an ongoing discussion regarding whether explanations should be considered an outcome, a process, or even several processes.For instance, [41, p. 10] differentiates between two: a cognitive process -the process of determining an explanation for a given event -and a social process -the process of transferring knowledge between explainer and explainee.The latter generally takes the form of an interaction in which the goal is that the explainee has enough information to understand the causes of the event.It also relates to the very division of interpretability methods themselves [56], one referring to transparency, asking how the model works, and the other relating to post-hoc explanations, inquiring what the model can tell [55].Thus, interpretability can be related to the system output and the system architecture and processes themselves.
After all, the variety of answers suggesting additional information from the Explanation-system or whether or not it would be beneficial to interact with it also shows the broadness of the area of explainable AI: users would like to receive explanations and look into the black box for very different reasons, at various levels and stages, depending on what they know and would like to know, and so on.An interesting discussion in this regard [106] summarises the results of a qualitative study aimed at understanding how various stakeholders characterise the problem of XAI.In general, it is unlikely that one single strategy for producing explanations will suit the needs of all users [24,64,72].It follows that it is also hard to design an interactive system that can be fine-tuned to the needs of all possible end users.Our results suggest that at least a partial solution might again lie in systems that are able to infer user expectations of the AI-system behaviour and adapt their explanations accordingly.

Limitations and final considerations
Our study does have a number of limitations.First, the task given to the participants was formulated in the introduction of the study as "We need your help improving and evaluating an AI-system that classifies text and emails found online [...] Like this, you will help the AI-system learn from you and be better in the future".This task thus invited participants to try to understand the system so that it can be improved; it is conceivable that other types of tasks would have led to other results.
Second, some of our design choices were rather simple.For example, we defined the global model of the AI-system as a collection of representative keywords for each of the classes (those keywords are used to build the counterfactuals).Also, the AI-system's behaviour was designed such that, of the three text classes, one was always predicted correctly, one was always predicted wrongly, and the last one was predicted correctly 50% of the time.While this was acceptable given the concrete aims of our study, real-world AI-systems can rarely be captured by such a simple model.Indeed, a global model of a system or its behaviour can, in general, be hard to define and present to the end user.Building explainable global models is challenging [107] and as seen earlier, many solutions in the XAI area tackle the problem of building overall interpretable models (interpretable ML).Another simplification that was necessary for our study is that the AI-system's task was such that participants could easily anticipate the correct class of the text.However, this is also rarely the case in reality; more typically, there would be uncertainty associated with the outcomes of an AI-system.However, our design mimics the case of trained expert users of a given AI-system who do have expectations of the system's output, not because the task is trivial, but because they have extensive experience with the system and have developed their own sophisticated model of it, something that is difficult to achieve in a study like ours.As such, although the task is simple, our findings are likely to generalise to a more realistic system and its trained users.
One could argue that some of the limitations discussed in this section, such as the overall task given to the participants, the simplified AI model, or the fact that only three classes were considered for classification, could compromise the generalizability of the results.It is, indeed, relatively easy to find fault in any given experiment because all factors cannot usually be completely controlled and, if they are, external validity (generalisability) and ecological validity (the degree to which the experimental situation reflects the type of environment in which the results will be applied) can be affected [108].
In our study, we focused on the relation between end user expectations and the appropriate type of explanation.The focus of our experimental design was therefore on ensuring these expectations were clear and unambiguous while the other aspects were kept simple.Nonetheless, the chosen task (classification), domain (newsgroups data) and users (novice) are common choices when evaluating XAI (see, e.g., [23,109]).The setting captures a typical AI-assisted decision support application, in which classification predictions are made to users who are not domain experts, and there is time to analyze the predictions and explanations provided by the system.It remains of course likely that appropriate types of explanations are domain-dependent.However, the fact that, for example, counterfactual explanations did not appear as sufficient as expected in this simple scenario suggests, at a minimum, that there is a need to investigate the underlying causes further in this domain.It also points to a need to study the appropriateness of such explanations in other domains that differ significantly from the text classification we have used here.In particular, if they are found to be working well in another domain (for example, there have been suggestions that they might be suitable in the medical domain, see [110,77]), there is a need to understand the contributing factors further.
Lastly, participants were recruited online through a well-known platform for research studies, Prolific.Even if online platforms as Prolific can successfully be used for perceptual evaluations of this kind as recently shown in [111], there is always some degree of uncertainty associated with the quality of the answers; to help to overcome this challenge, we eliminated, for example, the responses of those that had spent very little time answering the questionnaires.

A note on methodology
Before we conclude, it is also worth reflecting on the method used in this paper.Specifically, we have used an experimental design with six conditions and a relatively large number of participants to investigate two main hypotheses from several angles.In other words, our experimental design was broad in scope, relying on one major experiment to investigate the appropriateness of different types of explanations in different contexts, their impact on mental model formation, and more open-ended end-user preferences such as whether or not explanations should be interactive.This approach is not unconventional in literature studying human interaction with computer systems, and similar work can, for example, be found both in the XAI domain [34,112] and related fields such as human-robot interaction [113][114][115][116].We are, however, indebted to one anonymous reviewer for highlighting the alternative to this approach; namely to conduct much smaller studies targeting specific effects.This has, for example, the advantage that results are clearer to interpret and more easily validated.
As one example of an aspect that could have been a separate study, we can consider the exploration of mental models that users build of the system.This was arguably not needed to test either of our hypotheses and could thus have been omitted from the design.However, if one were to design an experiment to only study this aspect, participants would still have to go through the same -or at least similar -process as we have used here.This illustrates one of the key trade-offs: from the perspective of required participant time, it seems advantageous to study multiple aspects that can be investigated with the same task, provided these aspects do not interact with each other.Here, for example, manipulating how often the system is correct with its predictions per category does not change that humans have expectations regarding a particular output, so it does not interact with the main hypotheses.
This thus gives a pragmatic reason why, at least in user studies with computer systems, intelligent technologies, robots, or similar, relatively large designs are regularly used.However, the downside is that it results in a larger, more complex, overall experiment, and the interesting question is whether this trade-off can be justified.It goes beyond the scope of this paper to discuss the possible experimental designs and intricacies of the method in detail here, and we only noted one pitfall to be aware of (whether multiple studied aspects interact with each other).Nonetheless, as our anonymous reviewer also pointed out, there remains a need to engage more with the advantages and disadvantages of different experimental designs, including tracking developments in fields that have expertise in studying human behaviour, such as experimental or social psychology.This conversation is also taking place in human-robot interaction research [117] and appears to become increasingly relevant as technical fields look to user studies of the novel systems they develop.

Conclusions
Despite the rapid advances in AI/ML in multiple application areas, considerably less progress is made in understanding how users interact with ML/AI systems [118].Conventional evaluations of humans interacting with AI-systems are carried out using traditional methods and metrics either from the ML-community (algorithm-centred evaluation), or from the HCIcommunity (human-centred evaluation) [119].Several authors have recently pointed out that there is an overall lack of user evaluations that add a user-centred focus to the field of XAI [120,121,17].
In this paper, we provided such an evaluation to investigate whether systems that generate explanations of AI-systems need to take into account user expectations of system output in their functioning in order to produce adequate explanations.We found that factual explanations appear appropriate when the system output was what end users expected.We further found that counterfactual explanations when the system output was not what end users expected may be part of what these users require from an explanation, but do not appear to be sufficient by themselves.This points to a need to investigate other hybrid types of explanation (see also [18]).Our results connecting the accuracy of mental models of an AI-system with the explanations provided suggest that aspects such as context or details of the global system model may be useful to explore further in this respect.
These results further suggest that explanation systems may need to infer user expectations of the AI-system's output, possibly through the use of models of the user, so as to be able to provide the correct type of explanation.How this is to be achieved will also require further study.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
(1)(2)(3)(4)(5) [QI] "The explanations provided of how the AI-system classifies text are satisfying."(1)(2)(3)(4)(5)[QI] "Would you have liked for the explanations to contain additional information?If so, what type of information and when/which situations?" (open-ended question)[QI]    Perceived understanding of inner workings of AI-system based local explanations "The explanations provided help me to understand a particular prediction made by the AI-system."(1)(2)(3)(4)(5)[QII] "How satisfying did you find the explanation in terms of understanding why the system made its decision?"(1-5) [Main classification task]Perceived understanding of the AI-system global behaviour "The explanations provided help me to understand the global behaviour of the AI-system."(1)(2)(3)(4)(5)[QII] "The explanations provided help me to understand a particular prediction made by the AI-system but also the global behaviour of the AI-system."(1)(2)(3)(4)(5)[QII] "The explanations provided help me to understand the limitations and mistakes of the AI-system."(1-5) [QI] Perceived capability of predicting the AI-system behaviour "I know what will happen the next time I use the AI-system because I understand how it behaves."(1-5) [QII] "Do you think that the AI-system classifies the different types of text equally?" (Open-ended question) [QII]Actual capability of predicting AI-system behaviour and performance "The performance of the AI-system classifying text about politics/science/leisure was:" (Poor, Fair, No opinion, Good, Excellent)[QII]

Fig. 2 .
Fig. 2. Process of generating explanations.A selection of texts from the three classes (politics, science and leisure) from the 20 newsgroups dataset was run through LIME[5] to find the most relevant words that contributed positively to the prediction of a class.Some of the words highlighted by LIME were collected to build a global model (bag of words) for each class (right-hand side of the figure).We built the factual explanations based on the highlighted words from a particular text.The counterfactual explanations are built taking words from the global models (highlighted in pink in the example) and the individual highlighted words from each text.(For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.) [PC:EF/PI:EF] = 31, [PC:EF/PI:ECC] = 28, [PC:EF/PI:ECI] = 31, [PC:EC/PI:EF] = 30, [PC:EC/PI:ECC] = 32 and [PC:EC/PI:ECI] = 29, with a total of 181.The participants took part in the experiment for 21:09 min on average (σ = 11.35).

Fig. 6 .
Fig.6.Distribution of satisfaction ratings per group for all instances in which users showed inconsistent behaviour: disagreeing with a correct output or agreeing with incorrect output.The figure is the complement to Fig.5.Numbers below the x-axis indicate the total number of points in each group.

Fig. 7 .
Fig. 7. Answers to the Explanation-system questionnaire and classification task.See Table4for the complete question phrasing.

Fig. 10 .
Fig.10.Answer to the question "I would have liked to have an interactive explanation system that would answer my questions."No statistically significant differences were found (see Table15).

"
The system could ask whether a preliminary classification based on certain words was accurate and the user could indicate whether it had misinterpreted certain words."(PC:EF/PI:ECI) "If the AI categorized a subject, you have the ability to select certain words in the email to help explain why it fits a different category."(PC:EC/PI:ECC) Questions "It should be like chat service and answer basic questions." (PC:EC/PI:EF) "I could ask the AI pre-determined questions that would give me deeper insight as to why the AI chose this and this option instead of that."(PC:EC/PI:ECC) "I want to be able to debate the machine."(PC:EC/PI:ECI) Show classification process "It should be a step by step process, maybe even showing the intermediate steps, what was the idea of the AI at a certain point and why it changes."(PC:EC/PI:EF) Interaction with the global model "Yes, the system should have a definition for each three of the words and show particular keywords that fall under each category."(PC:EC/PI:EF) "Probably something like having a list of words that were or were not used in the text.Then you could click on a word and it would explain whether or not the word was considered and how that affected the classification."(PC:EC/PI:ECC) Interaction with the text "Maybe when you highlight certain key words in text it tells you how it classifies them."(PC:EF/PI:ECC) "Perhaps a system with highlighted keywords that, when clicked, provide more details on why this was considered."(PC:EC/PI:EF) "Hovering over words would tell you what category the AI thought the word should fit into."(PC:EC/PI:EF) "If you hover over a sentence I would like to see how it interprets the category it belongs to." (PC:EC/PI:ECC)

Table 1
Experimental conditions and groups.Type of explanation if the prediction by the AI-system is correct by type of explanation if the prediction is incorrect (2x3 experiment).Between-group experimental design with six groups.PC = Prediction is Correct, PI = Prediction is Incorrect, EF = Explanation Factual, EC = Explanation Counterfactual, ECC = Explanation Counterfactual with expected-Correct class and ECI = Explanation Counterfactual with a non expected-Incorrect class.For example, PC:EF/PI:ECC: if the Prediction is Correct (PC) this group sees an Explanation that it is Factual (EF) / if the Prediction is Incorrect (PI) this group sees an Explanation that is a Counterfactual with the Correct class (ECC).

Table 3
Number of participants and resulting data points (n × 18 questions) per condition, as well as the number and proportion of "consistent" data points, for which participant agreed with a correct classification or disagreed with an incorrect one (see section 3.2).
satisfying were the explanations?

Table 6
Summary table (ANODE Type II tests) for effects of explanation type on the satisfaction rating for individual explanations based on a CLM model.All variables were adjusted for scale effects due to assumption violations. .
a Multiple modes exist; the smallest value is shown.

Table 8
Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.7.Statistically significant effects are indicated in bold.Test assumptions are satisfied in all cases.

Table 11
Summary statistics for the data presented in Fig.8.
a Multiple modes exist; the smallest value is shown.

Table 12
Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.8.Statistically significant effects are indicated in bold.Variables that violated test assumptions and were thus included in scale effects in their respective CLMs are indicated with an asterisk.
Fig. 9. Answers to the AI-system questionnaire measuring predictability.

Table 13
Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.9.Statistically significant effects are indicated in bold.Test assumptions are satisfied in all cases.

Table 14
Summary statistics for the data presented in Fig.9.
a Multiple modes exist; the smallest value is shown.

Table 15
Summary table (ANODE Type II tests) for effects of explanation type on preference for an interactive system (see Fig.10).No test assumptions were violated.

Table 16
Summary statistics for the data presented in Fig.10.
a Multiple modes exist; the smallest value is shown.