A privacy-preserving dialogue system based on argumentation

Dialogue systems are a class of increasingly popular AI-based solutions to support timely and interactive communication with users in many domains. Due to the apparent possibility of users disclosing their sensitive data when interacting with such systems, ensuring that the systems follow the relevant laws, regulations, and ethical principles should be of primary concern. In this context, we discuss the main open points regarding these aspects and propose an approach grounded on a computational argumentation framework. Our approach ensures that user data are managed according to data minimization, purpose limitation, and integrity. Moreover, it is endowed with the capability of providing motivations for the system responses to offer transparency and explainability. We illustrate the architecture using as a case study a COVID-19 vaccine information system, discuss its theoretical properties, and evaluate it empirically.


Introduction
Dialogue systems are among the most popular forms of commercial AI products.In the 2019 Gartner CIO Survey, CIOs identified chatbots as the main AI-based application used in their enterprises, 1 with a global market valued in the billions of USD. 2  In fact, chatbots are one example of the extent AI technologies are becoming ever more pervasive, both in addressing global challenges, and in the day-to-day routine.Public administrations too are adopting chatbots for key actions such as helping citizens in requesting services 3 and providing updates and information, for example, in relation with COVID-19 (Amiri and Karahanna, 2022;Miner et al., 2020).
However, the expansion of intelligent technologies has been met by growing concerns about possible misuses, motivating a need to develop AI systems that are trustworthy.On the one hand, governments are pressured for gaining or preserving an edge in intelligent technologies, which make intensive use of large amounts of data.On the other hand, there is an increasing awareness of the fundamental need for data protection regulations.To make matters more complicated, different jurisdictions have different data protection regulations.Indeed, threats to the privacy of individuals are real.For example, a recent uproar was caused by Singapore's admission that data from its COVID-19 contact tracing programme could also be accessed by police, reversing earlier privacy assurances. 4All this motivates the need for a trustworthy AI.
According to several studies, including the Ethics guidelines produced by the High-Level Expert Group on AI, 5 trustworthy AI systems need not only be robust, but also respectful of all applicable laws and regulations, as well as of ethical principles and values.Among the tenets of trustworthy AI are privacy and data governance, transparency, and auditability.For dialogue systems in particular, a study by Saglam et al. (2021) shows that a major reason of distrust in them is the user's loss of agency over the data they provide to the system.
Concerning privacy and data governance, in the context of chatbots we believe that a legitimised access to data should be ensured by an architectural design that takes data access into account from the very beginning, preserving the user's agency over their data.This is especially true of applications that necessitate the interaction among different legal entities.For example, consider a government's chatbot, giving citizens information about COVID-19 vaccines.Let us say that such a chatbot relies on a transnational, or regional agency that contributes medical expertise on the subject.To provide relevant information, the system needs eliciting user personal, possibly sensitive data.It is thus extremely important that data processing is limited to the specific purpose that matches the user's need, and that only the necessary user data is stored and transmitted.Moreover, the user should be able to investigate and understand the reasons behind the AI system's recommendations, without having to be technically savvy.Finally, we think that auditability is especially important when an AI chatbot has to deal with personal data and offer advice: in particular, the models and algorithms used to produce such an advice should be transparent and verifiable.
We thus propose a dialogue system architecture inspired by the principles and values of trustworthy AI, that explicitly addresses the above points in the following way: • user interaction is carried out in natural language, not only for providing information to the user, but also to answer user queries about the reasons leading to the system output (explainability); • the system selects answers based on a transparent reasoning module, built on top of a computational argumentation framework with a rigorous, verifiable semantics (transparency, auditability); • the treatment of user data is made in accordance with the data minimization, purpose limitation and storage limitation principles.
To this end, the natural language interface and the reasoning module are decoupled so as to ensure that no personal data is passed from one module to the other (privacy and data governance).
The present paper demonstrates the feasibility of the approach and shows its workings via an illustration consisting of an AI chatbot aimed to give advice on COVID-19 vaccines.Our goal here is to move one step towards bridging the gap between fundamental, perhaps abstract, ethical principles, and practical AI applications.
In our previous works (Fazzinga et al., 2021a;2021b), we presented a very general and preliminary idea of our framework with no formalization, and provided a very preliminary evaluation.Here we present a detailed and complete overview of our architecture, describing each part in detail, present the algorithm and the formal properties of our strategy, and include a more broad and robust evaluation.
The paper is structured as follows: in Section 2 we present the legal background and the related works.In Section 3 we propose our approach, while in Section 4 we discuss an illustration on COVID-19 vaccines.Section 5 evaluates the approach and Section 6 concludes.

Background
Our proposal of a privacy-preserving AI dialogue system builds on recent advances in language technologies (dialogue systems), knowledge representation (argumentation as a framework for non-monotonic reasoning and explainability), and on the latest European regulations in terms of data protection.In this section, we provide the necessary background and comment on the most significant differences of our proposal from relevant work.

Data protection in the E.U.
In the European Union, every processing of personal data, in any context, is subject to a set of rules and principles that impose obligations on those who process data and attribute rights to those the personal data is referred to.The E.U.'s General Data Protection Regulation6 (GDPR) has general applicability and not only defends the fundamental right to data protection, but also aims to protect all the fundamental rights and freedoms that are implicated by the processing of personal data (Hildebrandt, 2020).
Especially relevant for the purposes of this article is GDPR's Article 5.In particular, Article 5(1) states the principles that must guide the processing of personal data.These include purpose limitation, data minimization, storage limitation, and integrity and confidentiality.The former three impose that the collection and process of personal data must be limited only to what is strictly necessary to fulfill a specified purpose and must not be kept or processed further, while the latter aims to guarantee the appropriate security of personal data.Article 5(2) instead covers accountability and specifies that the data controller, the legal body which determines the purposes and means of the processing of personal data, must be able to demonstrate compliance with all these principles.

Document redaction and sanitization
The anonymization of unstructured textual data is still an open and challenging task (Batet and Sánchez, 2018;Lison et al., 2021).Various approaches have been proposed.Redaction is the processing of textual document with the aim to remove personal sensitive information (Szarvas et al., 2007).Sanitization instead replaces such information with more general and impersonal variations (Chakaravarthy et al., 2008).Unfortunately, current redaction and sanitization technology is still far from guaranteeing zero-risk to the user (Li et al., 2017).Common solutions only focus on predefined categories of entities.If they can certainly serve as useful privacy-enhancing techniques, they do not qualify as full anonymization as defined by regulations like the GDPR, because they ignore elements that may play a role in re-identifying the individual (Lison et al., 2021;Pilán et al., 2022).Moreover, the most successful approaches are limited to specific domains, and often rely on large, hard-to-obtain datasets (Hassan et al., 2019;Iwendi et al., 2020;Nguyen and Cavallari, 2020;Sánchez et al., 2014).
Our approach distinguishes itself from the above, because it does not share or save user information, but it replaces it by a set of general, "sanitized" information elements that are pertinent to the use case.

Dialogue systems
Although terminology varies widely, there is a generally accepted distinction between conversation-oriented and task-oriented dialogue systems.Conversational agents aim to support open-domain dialogues.They are commonly called chatbots.Task-oriented dialogue systems instead aim to assist the user in completing well-defined tasks in a given domain (Chen et al., 2017;Deriu et al., 2021).Such tasks may consist in performing a specific action or eliciting user information, or, like in our case, providing information to the user.
The dialogue response typically involves four stages: understanding the question, managing the dialogue, optionally performing an action, and generating the answer.While these steps can be managed separately with pipeline architectures, much of the recent literature regards end-toend neural approaches.Conventional pipeline architectures have problems propagating the user's feedback across all the components, while the advantage of modularity is thwarted by the interdependence of such modules, hampering the adaptation of a pipeline system to a new domain or a new task (Wen et al., 2017;Zhao and Eskénazi, 2016).The success of deep learning architectures at many NLP task, among which natural language understanding and generation (Galassi et al., 2021;Young et al., 2018), has motivated researchers to extensively research their application in this area (Luo et al., 2019;Mohamad Suhaili et al., 2021;Rajendran et al., 2018).The downside of such approaches is that they are usually costly to implement for a new task because they require large training datasets.Moreover, data-driven dialogue systems are subject to bias (Barikeri et al., 2021;Dinan et al., 2020;Liu et al., 2020), privacy violations, adversarial examples, and several other ethical issues and safety concerns.Henderson et al. (2018) provide a broad and thorough discussion.
Given our focus on user protection and our aim to develop a general, data-independent approach, we have opted for an architecture that does not involve training, and is modular.Our approach uses techniques common in Information Retrieval-based dialogue systems, where the user's sentences are treated as queries and answers are retrieved from a knowledge base made of dialogues.For example, Charras et al. (2016) compare the use of cosine similarity between TF-IDF representations of sentences, and specifically trained doc-to-vec embeddings (Le and Mikolov, 2014).Likewise, we measure the similarity between sentence embeddings to match the user sentences with the ones provided by the knowledge base.However, we rely on pre-trained models and on sanitized text produced by domain experts.
To the best of our knowledge, we are the first ones to introduce a dialogue system architecture whose design goal is to guarantee user data protection.In fact, previous work regarding sensitive, health data related (Brixey et al., 2017;Xu et al., 2019) do not specifically address user and data protection.
Regarding the use of argumentation in the context of dialogue systems, previous works mainly focused on persuasion.Rosenfeld and Kraus (2016) rely on reinforcement learning, while Rach et al. (2018) envision the dialogue as a game and the answers as moves along a previously defined scheme.In both cases, the agents are limited in their input and outputs to the sentences present in the knowledge base.Chalaguine and Hunter (2020) propose a persuasive dialogue system to convince students of the importance of university fees.The approach is based on a knowledge graph where each possible user sentence is encoded in a node and all its answers are linked to it by an edge.When the user writes a sentence, the system searches for an argument in the graph that matches with the user sentence, and chooses an answer to give to the user among the nodes linked to it.Although this system is surely related to the one we propose, it also differs significantly in that the choice of the answer is made "locallyǥ, by only looking at the possible answers to the last question posed by the user, while ignoring what was said earlier.The lack of a history of the conversation, with the ensuing impossibility of retrieving multiple pieces of information within a single exchange, strongly limit the application of such a system in real-world scenarios where dialogues naturally take into account things said at different times.Another limitation of these approaches is their relying on lexical, instead of semantic similarity.

Dialogue systems for COVID-19 pandemic
Chalaguine and Hunter (2021) developed a persuasive chatbot to persuade users to get vaccinated against Covid-19, using the same techniques employed in their previous work.Dos Santos Júnior et al. ( 2021) focused on natural language understanding and study the use of embeddings and clustering algorithms to automatically annotate a datasets of covid-related conversations with intentions labels.The VIRA system (Gretz et al., 2022) is trained to recognize the intent of the user among a set of candidates, and replies with one of the corresponding answers.Altay et al. (2021) studied the positive impact of the use of chatbots, Judson et al. (2020) highlighted the users' suspicions, while Schubel et al. (2021) reported differences in their use by different population subgroups.
None of these works addresses handling user sensitive data and related privacy issues.Moreover, besides some focus on persuasive dialogue systems, to best of our knowledge no work has been done on systems for providing reliable information to the users.
More detailed information regarding the use of chatbots in the Covid-19 public health response can be found in the survey by Amiri and Karahanna (2022).

Argumentation frameworks
Abstract Argumentation (AA) (Dung, 1995) is a branch of Artificial Intelligence (AI) that gained significant attention in the last years due to its capability of modelling debates, dialogues, and, in general, situations where conflicts and diversity of opinions arise.One important point that leads to the usage of AA as a reasoning mechanism at the core of several dialogue-based applications in AI is also its natural aptitude to provide ǣexplanationsǥ (Modgil et al., 2013).Indeed, in recent years, the capability of providing motivations for systems/agents behaviours has become crucial in AI, and AA is taking on an increasingly central role (Chesñevar et al., 2020;Cyras et al., 2021).In fact, modelling a dispute/dialogue as an AA framework not only offers the possibility of locating the arguments that represent a good/bad point in a rebuttal, but has the further advantage of possibly providing a ǣwitnessǥ of the reason why a certain argument is a good/bad point.From a technical standpoint, the disputes in AA are modelled as graphs, where the arguments, that are the sentences claimed by the agents participating the dispute, are the nodes, and the conflicts/contradictions between the sentences, named attacks, are the edges of the graph.As an example, consider the following scenario.Andrea says argument a: ǣMilan is a very livable cityǥ.Matt says argument b: ǣMilan is one of the most polluted cities of the world, so it is absolutely not livableǥ.Alice says argument c: ǣSeveral parameters are used to establish whether a city is livable, thus you cant say that Milan is not livableǥ.This scenario can be modelled as the AA graph 〈A,D〉, where A consists of the arguments a, b, c and D consists of the edges (a, b), (b, a), (c, b).
A lot of work has been devoted to reasoning over argumentation graphs (Baroni and Giacomin, 2009;Charwat et al., 2015;Fazzinga et al., 2019), and several ways of identifying ǣrobustǥ arguments or sets of arguments have been proposed, called semantics (Dung, 1995;Dung et al., 2007).A popular one is the admissible semantics.It stipulates that a set S of arguments is an admissible extension (that is, it conforms to the admissible semantics) if and only if (i) S is conflict-free, i.e. there is no attack between arguments in S and (ii) S defends every argument in it, i. e., S attacks every argument (outside S) attacking arguments in S. Condition (ii) reveals that the admissible semantics is based on the fundamental concept of acceptance: to be an admissible extension, every argument a of S must be acceptable w.r.t.it, meaning that S must counterattack every attack from outside towards a. Continuing the above example, both S 1 = {c} and S 2 = {a, c} are admissible extensions, while S 3 = {a, b} and S 4 = {b, c} are not, as they are not conflict-free, and neither is S 5 = {b} as b is not acceptable w.r.t.S 5 : in fact S 5 does not defend b against the attack from c.

Architecture and methods
To illustrate our proposal, let us consider a government intending to provide a personalized information service to its citizens through a dialogue system.The interaction between the user and the system would be similar in many different scenarios, while the back-end retrieval of information would depend on the specific case.It is also reasonable to imagine scenarios where the knowledge base used by the service to retrieve the specific information may not be maintained or owned by the entity itself, but instead by a third party, such as a transnational agency contributing specialized medical expertise.In such cases, where the interaction with the knowledge base is handled by a third party, the provider of the dialogue system must guarantee to the users the protection of their personal information.The third party's access to them must be limited, both in terms of content and time, to what is strictly necessary for computing the pertinent answer.It is, therefore, the responsibility of the service provider to ask the user for the relevant information, to analyze and process them, to guarantee that any information that is irrelevant (but potentially sensitive) is removed, and to guarantee that the set of maintained data as a whole guarantees anonymity.

B. Fazzinga et al.
Consequently, we propose a modular architecture for dialogue systems made by the following components: • a Knowledge Base (KB) made by experts, containing all the possible relevant cases, answers, and relationships between them; • a Language module that processes the user's input, including sensitive information, and maps it to the corresponding KB cases; • an Argumentation module for reasoning over such KB cases and computing answers.
The dialogue process is shown in Fig. 1 and can be summarized as follows: 1.The user interacts with the Language module, providing personal information needed to satisfy their request.2. The language module compares the user information with the KB to understand which cases are relevant.3. The language module establishes a connection with the Argumentation module and provides it an anonymous, sanitized, and generalized version of the user's information.4. The Argumentation module elaborates the information and computes an answer.5.The Argumentation module sends the answer to the Language module, which forwards it to the user, optionally processing it further.Such an answer may be the information required by the user or a request for further personal information that is needed to provide a proper answer.
6.In case more information is required the process goes back to point 2. 7.In case the user has received the final answer, they can ask for a detailed explanation of the answer, which will be provided by the Argumentation module.8.As soon as the user decides to terminate the interaction, the connection between the modules is closed, and all the information related to the user is deleted.
It is important to highlight that the exchange of personal and sensitive data occurs only between the user and the Language Module.The Argumentation module has access only to a general and broad representation that is strictly necessary to provide the answer.Moreover, any information that is deemed irrelevant by the Language module never reaches the next module.These two properties reflect the principles of data minimization and purpose limitation, respectively.
This architecture also fits a client-server scenario where the Language module is hosted on the client-side while the Argumentation module is hosted on the server-side.The client can be implemented as an application to be installed on the personal device of the user, while the sessions of interaction with the server are completely anonymous: the personal information of the user never leaves their device unless it is relevant for the answer.
We shall remark that our main focus is on the processing of the user's personal data and how it is used to produce the final answer.Other aspects such as the management of the dialogue will have to be addressed when implementing the system, but they are beyond the scope of this work, since they do not pose a challenge to users' privacy and therefore can be realized with any of the techniques already available in literature.
In the rest of this section, we will describe each component in detail and provide a formal definition of the communication process.

Knowledge base
We encode our background knowledge into an argument graph made of status nodes and reply nodes.The former encode facts that correspond to the possible user sentences.Each status node is linked to one or more reply arguments it endorses.Status nodes may also attack other status or reply nodes, typically because the facts they represent are incompatible with one another.Indeed, in our argumentation graph, we assume that all the attacks between status arguments are mutual.  : startConversation() if U is not an explanation request then 6: Cons, PCons ← retrieveReplies(S , N) 1: for all r ∈ PCons do 2: for all n ∈ N * do 5:

Example 1. Consider the argumentation graph
), (a 2 , r 2 ), (a 3 , r 1 ), (a 4 , r 2 )}, where dashed lines denote the endorsement relation and continuous lines denote the attack relation.Among the several sets attacking or endorsing replies, we have that: set S 1 = {a 1 , a 2 } attacks both r 1 and r 2 , while set S 2 = {a 1 , a 3 } attacks r 2 , and set S 3 = {a 3 , a 4 } endorses both r 1 and r 2 .
Additionally, each argument in A is annotated with a set of natural language sentences, that represent some possible ways a user would express the fact A is meant to encode.These different representations of facts could be produced by domain experts or crowd-sourced as proposed by Chalaguine and Hunter (2020) and then validated by domain experts.Each argument in R is annotated with one or multiple natural language sentences that express the answer it encodes.

Language module
The Language module has a double purpose: to map the user information with the proper KB nodes, and to filter out sensitive information.
Similarly to Chalaguine and Hunter (2020), we aim to compare the information provided by the user with the natural language sentences in the KB in particular those associated with status nodes.Once we have determined which of these KB sentences are similar to the input sentences, we obtain a set S of associated status nodes.Such a set represents all the information communicated by the user that is relevant for the   2010).In fact, BAFs have a unique set of arguments A and the support relation is a subset of A × A, instead our endorsement relation is a subset of A × R, thus it only involves pairs 〈status, reply〉.Furthermore, the support relation of BAFs also affects the extensions (in fact, several variants of the admissible semantics have been defined depending on the type of considered support), instead, our endorsement relation only affects the choice of the replies to be given to the user: the fact that a node is accepted w.r.t. a set only depends on the attack relation, as done in the classical AFs.task at hand, therefore the information that is needed for the Argumentation module to compute the answer.
This representation is completely anonymous and general since it does not include the original inputs of the user, nor information regarding the single matched sentences.It can be effectively considered a sanitized version of the input of the user.Also, any additional irrelevant, but potentially sensitive, information that the user has provided will result dissimilar from the KB sentences.Therefore, S is the minimal information, in terms of quantity and format, among that provided by the user that is necessary to compute the answer.
There are many possible strategies that can be used to compute the match, with no consequences on the rest of the architecture as long as they are accurate.Chalaguine andHunter (2020, 2021) represent sentences using TF-IDF vectors and compare them using cosine similarity (Kenter and de Rijke, 2015), selecting only the most similar sentences in their KB.We instead propose to use sentence embeddings, which allow representing a sentence as a real-valued vector, mapping it into a semantic vector space.As opposed to TF-IDF, this technique allows mapping sentences with similar semantic content into nearby vectors, even if they are very different from a lexical point of view.Instead of cosine similarity, we propose to use Bray-Curtis similarity (Bray and Curtis, 1957), since it has led to satisfactory results in the context of sentence similarity (Galassi et al., 2020).Finally, since the user's sentence may contain information related to multiple nodes of the KB, we do not select only the most similar node, but instead, use a threshold hyper-parameter to discriminate between similar and dissimilar sentences.Among the many possible sentence embeddings, we have decided to use Sentence-BERT (Reimers and Gurevych, 2019), which are based on deep network models trained on a vast amount of data and designed specifically for the task of sentence similarity.This choice is motivated by the key role they have played in the advancement of Natural Language Processing in recent years.The choice of whether to train an embedding model from scratch, use a pre-trained one, or fine-tune a pre-trained one depends on the specific domain of application and the data that may be available for training.The downside of this approach is that it requires hardware resources capable of loading the neural model, which may be unfeasible in some contexts.As an alternative, GloVe embeddings (Pennington et al., 2014) are usually less performing, but do not involve the use of neural models and therefore may be applicable in the general case.
We have proposed to compute matches by similarity between sentence embeddings, but it is important to remark that our general architecture would be compatible also with other methods.A possible alternative would be to use techniques that directly compute the similarity of the two sentences.This could be implemented either using specific algorithms such as the Damerau-Levenshtein dissimilarity (Damerau, 1964), or neural networks such as Poly-Encoders (Humeau et al., 2020).However, this alternative would have a heavy computational footprint, since it would require processing every pair of sentences at run-time.As opposite, the approaches based on sentence embeddings would be very fast, since all the KB embeddings can be pre-computed and the comparison between numerical vectors is rather inexpensive.Finally, it is important to be aware that the ability of sentence embedding to model some concepts may still be imperfect.For example, BERT (Devlin et al., 2019) may better capture negations (Lin et al., 2020;Zhu et al., 2018), whereas GloVe may better understand punctuation (Karami et al., 2021).In some cases, it may be necessary to partially or completely rely on other techniques so as to obtain better matches, e.g., additional pre-processing steps, stance detection, rule-based models, or the use of ontologies.

Argumentation module
The Argumentation Module is in charge of computing the replies to the user queries/sentences.To appreciate our approach, it is important to understand the limitations of choosing a reply only based on the last user sentence, as done by Chalaguine and Hunter (2020).Let us consider the case where a user interacts with a dialogue system in the context of the vaccines for COVID-19, to understand whether they can get safely vaccinated.In this case, there can be conditions making some candidate replies invalid that are not revealed or excluded by the content of the user last sentence, and further information is required: this information can be already available (as it is contained in what the user has previously declared) or needs to be collected, by gathering the users answers to specific new questions.
For instance, suppose that the user says: ǣI am celiac, can I get vaccinated?ǥ.In the case of celiac people that suffer from no other

Table 1
Preliminary experimental results of the embedding models and the threshold criterion on the sentence matching task.For each model, we report only 3 fixed thresholds, the ones that have reached best precision, recall, and F1.disease, the answer to this question is R = ǣYes, you can get vaccinated at any vaccine site.No special monitoring time is neededǥ. 8In fact, there is no known specific side effect for people suffering from celiac disease, thus there is no need for celiac people to get vaccinated in the hospital or to undergo a specific monitoring time.However, suppose that the user also suffers from bronchial asthma.In this case, the answer that correctly follows the AIFA guidelines is ǣYes, but you must get vaccinated at the hospitalǥ.
The point is that choosing how to reply to the user by only looking at the current user sentence may be unsafe: the dialogue system should further investigate the health conditions of the user in order to exclude any pathology that makes a candidate reply inappropriate.In the example above, before giving the reply R to the user, the dialogue system should ask the user whether they suffer from bronchial asthma and/ or from the other (few) diseases that make R inappropriate.Furthermore, the dialogue system should keep track of everything the user said (differently from what done by Chalaguine and Hunter), because the reply to any further users question must be given by taking into account all the relevant information provided by the user, otherwise there is the risk of giving a reply that could mislead the user.

Reply strategy
We are ready to present our strategy for providing users with replies and the algorithm encoding it.Each dialogue session relies on dynamically acquired knowledge, expressed as a set of status arguments S, that encode user information.Basically, S contains the status nodes of the KB activated so far, that is corresponding to the information the user has communicated to the system since the starting of the dialogue session.
Differently from other proposals, at each turn, our system does not simply select a reply endorsed by S. On the contrary, the aim of the dialogue strategy is to provide the user with a reply that is both endorsed and defended by S. In other words, the system works to provide only robust replies, possibly delaying replies that need further fact-checking.
In fact, our system distinguishes between consistent and potentially consistent reply.The former can be given to the user right away, as, as formally stated in Section 3.3.2, it can not possibly be proven wrong in the future. 9The latter, albeit consistent with the current known facts, may still be defeated by future user input, and therefore it should be delayed until a successful elicitation process is completed.The formal definitions, reported below, are based on the KB and on a representation of the state of the dialogue consisting of a set S. In particular, S ⊆ A contains all the arguments activated during the conversation so far and is assumed to be conflict-free.Furthermore, both definitions are based on the concept of acceptable arguments, recalled in Section 2.

Definition 3.2. (Consistent reply)
Given an argumentation graph G = 〈A, R, D, E〉 and a conflict-free set S ⊆ A, a reply r ∈ R is consistent w.r.t.S iff S endorses r and r is acceptable w.r.t. S.

Definition 3.3. (Potentially consistent reply)
Given an argumentation graph G = 〈A, R, D, E〉 and a conflict-free set S ⊆ A, a reply r ∈ R is potentially consistent w.r.t.S iff S endorses r, S does not attack r and r is not acceptable w.r.t. S.
As it will be clearer in the next section, the aim of the strategy is that of looking for consistent replies to be given in response to the user input.If no consistent reply exists, then the strategy is that of looking for potentially consistent replies and trying to make one of them a consistent reply.The following example shows consistent and potentially consistent replies.

Example 2. Consider the argumentation graph
), (a 2 , r 2 ), (a 3 , r 1 ), (a 4 , r 2 )}, and S = {a 3 }.Because r 1 is attacked by a 2 and it is not defended by S, r 1 is not a consistent reply.However, r 1 is a potentially consistent reply as it is endorsed by S and not attacked by S. In the case that S = {a 1 , a 3 }, instead, r 1 is a consistent reply.
In addition to provide replies, the system is also able to provide ex-Fig.7. Matches computed by the models using the 0.65 threshold value on sentences from S1 to S19.The colored cells indicate the matches computed by the two models.
8 Taken from the FAQ section of the AIFA web site: https://www.aifa.gov.it/en/vaccini-covid-19/. 9 The implicit assumption here is that the user does not enter conflicting information, and that the language model correctly interprets the user input.
Clearly, if this is not the case, the system's output becomes unreliable.But that wouldn't depend on the underlying reasoning framework.The definition of fallback strategies able to handle such exceptions would be an important extension to the system.planations for the given replies.An explanation of a reply r consists of two parts.The first one contains the arguments leading to r, i.e., those belonging to a set S that endorses r.The second one encodes the why nots, to explain why the system did not give other replies.Example 3. Continuing the previous example, an explanation for r 1 in the case that S = {a 1 ,a 3 ,a 4 }, is given by 〈{a 1 ,a 3 },{〈r 2 ,{a 1 }〉}〉, meaning that r 1 has been given to the user since it is endorsed by both a 1 and a 3 , and that r 2 could not have been given to the user (although it is endorsed by S) since it is attacked by a 1 .
The behaviour of our dialogue system is specified by Algorithm 1.Initially, the system starts the conversation with the user (line).This includes understanding the user question and the context of reasoning.In this work we do not focus on how this method is implemented, but on how to collect the relevant information and how to provide the correct reply.At line, the first user sentence is acquired and stored into variable U. Line initializes the set S that will be used to store the arguments activated during the conversation, set RY that will be used to store the replies given to the user and variable r that will be used to store the current reply to be given to the user.
The while loop (line) handles the conversation with the user, until they terminate the chat by using a closing sentence.In the case that U is not an explanation request (), function computeMatches is invoked (line), whose task is performed by the language module and consists in matching the relevant information given by the user with the status arguments of the KB.The output of function computeMatches is a set N of status arguments, that are first added to S (line) and then given as input to function retrieveReplies in order to retrieve the reply arguments that are endorsed by S, and in particular by N, that contains the last activated nodes.In particular, the output of retrieveReplies is a pair 〈Cons, PCons〉, where Cons is a set of consistent replies, according to Definition 3.2.Instead, set PCons contains the potentially consistent replies, that are reply arguments endorsed by S but are not acceptable in S at the moment, as per Definition 3.3.This basically means that an argument a ∈ PCons could turn to be acceptable in S by adding some new argument to S making S defend a, and this is done by collecting more information by the user.
Then the operations aimed at finding a reply to be given to the user start.If Cons is not empty, a reply is arbitrarily selected among those in Cons (line), stored in RY and returned to the user (). 10 In case both Cons and PCons are empty (line), a consistent reply can not be found and the conversation is terminated.Otherwise, if PCons is not empty, Algorithm 1 starts the elicitation strategy, aimed to turn some reply in PCons consistent.Specifically, function elicit is invoked (line), that receives as input sets S, PCons and RY and returns a boolean whose value indicates the outcome of the elicitation process: if its value is TRUE it means that a consistent reply has been found and given to the user by the function, and that sets S and RY have been correctly updated, thus the while loop continues the conversation with the user by acquiring a new user sentence (line), otherwise it means that no reply in PCons turned out to be consistent and that no new consistent reply has been found, thus the conversation must be terminated (line).More detail on function elicit will be given shortly.In the case that U is an explanation request (line), meaning that the user is looking for an explanation for the last reply r the system gave to them, the proper explanation according to Definition 3.4, is retrieved at line, and given to the user (line).
Function elicit works as follows.Every potential consistent reply r belonging to PCons is examined by the for loop at line, then the arguments not belonging to S that attack the attackers of r are retrieved  At the end of this inner for loop, if N new is a empty or a subset of S, r is still not consistent, then the only operations that the algorithm can do is giving a reply belonging to RY and returning TRUE (line).In the case that N new contains new arguments that make r to be a consistent reply (line), r is given to the user, r is added to RY and the function returns TRUE.In the case that r is still not consistent, instead, new candidate replies are retrieved (line) and then (i)if Cons is not empty, one reply is selected and given to the user (line), r is added to RY and TRUE is returned, otherwise (ii) PCons is updated () and the main for loop continues with another iteration.
As regards the worst-case complexity of our algorithms, it is easy to see that each iteration of the while loop of Algorithm 1 can be executed in time O(H × max(K,H)), where H is the number of status nodes and K is the number of reply nodes.In fact, both functions computeMatches and retrieveReplies are O(H), while selectCandidateReply is O(K), retrieveExplanation is O(H × K), and elicit is O(H × max(K, H)).In particular, the complexity of elicit is determined by the complexity of the main loop (O(K)) multiplied by the complexity of most expensive operation, that can be the for loop at line (O(H)), or one between selectCandidateReply and retrieveReplies.It should be noted that, in practice, all the operations require much less time than the theoretical upper bound, due to the usage of indexes and suitable data structures.Section 4 provides an example of how Algorithm 1 works.

Properties
Our approach enjoys some interesting properties.The first one is a property of consistent replies.Indeed, the fact that a reply r is consistent means that S counterattacks every attack towards r, thus as the algorithm proceeds and S grows, no status arguments added to S can make r inconsistent, as long as S remains conflict-free, i.e., as long as the user does not make conflicting statements.

Proposition 1. Given an argumentation graph 〈A, R, D, E〉 and a set S ⊆ A, a consistent reply r w.r.t S is a consistent reply for any conflict-free set S
′ ⊇ S.
Proof.We prove the statement reasoning by contradiction.Suppose that S ′ ⊇ S is conflict free and r is not a consistent reply for S ′ .This means that (i)S ′ does not endorse r or (ii)N ∈ S ′ \S attacks r.As regards (i), since S is a subset of S ′ , the fact that S ′ does not endorse r contradicts the hypothesis that r is a consistent reply w.r.t S. As regards (ii), since r is acceptable w.r.t.S, then S attacks all the arguments attacking r, meaning that S also attacks N, contradicting the fact that S ′ is conflict-free.□Example 4. Consider the argumentation graph G 1 = 〈A, R, D, E〉 depicted in Fig. 3, where A = {a 1 , a 2 , a 3 , a 4 , a 5 }, R = {r 1 , r 2 , r 3 , r 4 }, D = {(a 1 , a 2 ), (a 2 , a 1 ), (a 1 , r 2 ), (a 3 , r 2 ),(a 3 , a 4 ), (a 4 ,a 3 )}, E = {(a 1 , r 1 ), (a 2 , r 2 ), (a 3 , r 3 ), (a 4 , r 4 ), (a 5 ,r 4 )}.It is easy to see that S = {a 2 } has no consistent reply, while S ′ = {a 2 , a 4 } has two consistent replies that are r 2 and r 4 , and that no conflict-free superset of S ′ exists making r 2 not a consistent reply w.r.t.it.Now we introduce the concepts of inconsistent set and well-formed argumentation graph, that will be used to state the existence of potential consistent and/or consistent replies.

Definition 3.5. (Inconsistent set)
Given an argumentation graph 〈A,R, D, E〉, a set K ⊆ A is an inconsistent set of arguments if and only if it is conflict-free and every r ∈ R that is endorsed by K is also attacked by K. Basically, an inconsistent set is a set that admits no potential consistent replies, and thus no consistent replies at all.In the case that no inconsistent set exists in the argumentation graph G, G is said to be wellformed.

Definition 3.6. (Well-formed Argumentation Graph)
An argumentation graph 〈A, R, D, E〉 is well-formed if and only if there not exists any inconsistent set K ⊆ A.
The following property concerns the replies provided by our Algorithm.
Proposition 2. In the case that the input argumentation graph G = 〈A, R, D, E〉 is well-formed and G is such that ∀a ∈ A there exists r ∈ R s.t.(a, r) ∈ E, the output of function retrieveReplies is such that 〈Cons, PCons〉 ∕ = 〈∅, ∅〉 at each invocation.
Proof.Reasoning by contradiction, assume that 〈Cons, PCons〉 = 〈∅, ∅〉 for some S. The fact that both Cons and PCons are ∅ means that every r ∈ R is such that r is not endorsed by S or is attacked by S, otherwise at least one between Cons and PCons would be different from ∅.The case that no r ∈ R is endorsed by S contradicts the hypothesis that ∀a ∈ A there exists r ∈ R s.t.(a, r) ∈ E, thus it must be the case that every r endorsed by S is also attacked, but this contradicts the hypothesis that G is well formed.□The property reported above means that, in the case that the graph is well-formed, at least one of the sets outputted by the function is different from the empty set, meaning, in turn, that there is at least one potential consistent reply or one consistent reply at each iteration or both.
The following property regards the termination of Algorithm 1.
Proof.We prove the statement by examining the alternative scenarios that can occur and showing that the termination is reached in every scenario.The function starts with the main for loop at line that selects a reply r belonging to PCons and invokes selectDefenceNodes.□At this point the two following alternative cases can occur.Case a)N new is empty (as a possible consequence of N * = ∅ or because replyAcquireMatch returns an empty set) or is a non-empty subset of S: in this case, if an already given reply exists, it will be selected (line) and given to the user, terminating the function, or another iteration starts; Case b)N new is not empty and not a subset of S: in this case, if r is now a consistent reply, the function provides r to the user (line) and terminates at line.Otherwise, in the case that r is still not consistent, the function can terminate at line, or continue with another iteration.Even in the case that another iteration starts, the termination is guaranteed as one of the following two cases will finally occur.Case i)all the nodes of the graph have been added to S: then, N * at line is empty and this could cause that the function returns TRUE as in the case a) explained above or that no new nodes can be added to PCons at line, that in turns means that, after examining all the replies in PCons, the for loop ends and the function returns FALSE at line ; Case ii)the user sentence are matched to some nodes already in S (line): it is easy to see that this case results in returning TRUE at line or in returning FALSE at line, when all the replies in PCons have been examined by the for loop.the case that the input argumentat Note that, if the user does not contradict himself, S is admissible in every iteration of the algorithm.In fact, since the attacks are mutual, every node added to S counterattacks by itself the attacks it receives, thus every added node is acceptable.This guarantees that the reply given to the user are endorsed and defended by an admissible set of status nodes, enforcing the reasonableness of the replies.

Case study
In this section, we provide an example of how our algorithm works by showing a case study for the context of the vaccines for COVID-19.
Here the aim of creating a dialogue system able to accurately answer user questions about vaccine modalities and more.A concrete scenario may include a government agency providing the dialogue service to its citizens while relying on a third-party scientific institution (e.g., a research center) for the argumentation service and the knowledge base, or one where citizens can use a mobile-phone app to retrieve information provided by a research center.
In Fig. 5, we show an excerpt of the argumentation graph encoding the knowledge base,11 in particular the part related to the modalities of getting vaccinated.In this graph, the yellow rectangles represent the status nodes, the blue ovals represent the possible replies, the green dotted arrows encode the endorsement relations, thus point to the possible replies to a given user sentence, and the red ones denote the attack relations, thus encoding the replies that the system must not give to user sentences matched to the nodes attacking them.
It is worthwhile noticing that the graph contains both the positive and negative version of each status argument.This is a key modeling feature in the context at hand, as it enables the system to properly capture and encode all the information provided by the user about their health conditions.Inside each status node we represent the associated natural language sentences.12Let us consider this example: the user sentence acquired at the first iteration of the while loop is ǣHi, I am Morgan and I suffer from latex allergy, can I get vaccinated?ǥ(line).The language module processes the user sentence and compares it against all the sentences provided by the knowledge base, resulting in a single positive match with the sentence 'I have latex allergy' associated with node N 11 , then N = N 11 at line.At this point, function retrieveReplies returns 〈∅, {R 2 }〉, as R 2 is the only reply endorsed by N.This reply is not a consistent reply, because it is attacked by both N 8 and N 15 .It is, however, a potentially consistent reply.Thus, function elicit is invoked, with S = {N 11 }, PCons = {R 2 } and RY = ∅.Function elicit invokes function selectDefenceNodes, that returns {N 7 ,N 16 }.In fact, to make R 2 consistent, S must be augmented with both N 7 and N 16 .This means that the user must tell that they do not suffer from bronchial asthma and that they had no previous anaphylaxis.Then, the inner for loop is executed, at the end of which N new = {N 7 , N 16 } (line), supposing that the user does not suffer from the mentioned diseases.Since R 2 now is a consistent reply w.r.t.S = {N 7 , N 11 , N 16 }, it is given to the user at line and the function terminates.
Alternatively, suppose that the user writes that they do suffer from bronchial asthma.In that case, we would have S = {N 11 ,N 8 ,N 16 }, hence R 2 would not be a consistent reply.Accordingly, function retrieveReplies is invoked at line of function elicit, with N new = {N 8 ,N 16 }, that returns 〈 Fig. 12. Matches computed by MPNet-PM using the mean+std threshold value on the test set.
{R 3 }, ∅〉.In this case, R 3 is given to the user at line and the function terminates.
Besides Covid-19 vaccine information, our architecture can accommodate many other scenarios where privacy matters.In particular it would be most useful in any context where (i) the desired information are publicly available but may be difficult to obtain or to navigate, and (ii) to provide the correct answer it is necessary to know the user's sensitive information.In particular, the motivation for (i) is that it could be possible for a user to reconstruct the argumentative graph of the KB through the use of multiple queries.Therefore our proposal is not suited for scenarios where the reasoning process or the knowledge base must remain hidden.One possible domain of application could be the access to legal information, for example in the context of immigration (Queudot et al., 2020).Fig. 6 shows an example of the KB that can be used to address the problem of whether a immigrant is required to leave UK as soon as they stop having a job in UK.

Evaluation and discussion
Since the Argumentation module is a symbolic module for knowledge representation and reasoning, its evaluation was based on formal properties such as consistency, well-formedness, and termination (Section 3.3.2).To assess the effectiveness of our Language module instead we run an experimentation on a use case of vaccines for COVID-19.Following the KB illustrated in Fig. 5, we build a small-sized dataset of sentences that are representative of its arguments (i.e., the nodes N1 to N16 of the graph).We are especially interested in evaluating our method on sentences with a similar syntactic structure, but different meaning (e. g., a sentence and its negation).Initially, we consider only 6 argumentative nodes in a preliminary experiment aimed to find the best models and their optimal hyper-parameters.Then, we test our choices on the remaining 10 argumentative nodes.For each node, our KB contains between 3 and 7 natural language sentences that can be used to express that concept (see Table 2 and Appendix A).
To obtain a quantitative evaluation, we frame the task as binary classification of every pair of different sentences that are in our KB.A pair is considered a positive instance if two sentences are associated with the same node, a negative instance otherwise.For each combination of models and thresholds, we measure precision, recall, and F1 score of the positive class.Precision is especially important: false positives can be seen as cases where the system "misunderstands" the input of the user, and therefore precision can be seen as a measure of correctness.Recall instead can be seen as a measure of the ability of the system to not "miss" information contributed by the user.For our system, poor recall is a less serious problem than poor precision for two reasons.First, it is necessary to match only one sentence associated with an argumentative node to activate it.Second, the argumentative reasoning module proactively asks the user for missing bits of information that would influence the final result.In our perspective, the priority must be to guarantee the correctness of the final answer, even if this means that the system will, in some cases, ask for information that the user has already submitted.For this reason, we use precision as the main metric of comparison.

Selection of embedding models and threshold values
We consider the TF-IDF representation used by Charras et al. (2016) and Chalaguine and Hunter (2020), along with the following Sentence-BERT models:13 • stsb-mpnet (MPNet-S): based on MPNet (Song et al., 2020) and pre-trained for semantic similarity on the STSbenchmark (Cer et al., 2017).
In this experiment, we want to investigate a wide range of threshold values, which is the only hyper-parameter of our method.We consider two values that are based on the distribution of the similarity scores: one is given by the average of the similarities (mean), and the other one is given by the sum between the average similarity and the standard deviation (mean+std).Additionally, we consider a set of 13 fixed values ranging from 0.20 to 0.80.Results are shown in Table 1.
Our results clearly show that the MPNet-S and the MPNet-P models are the best ones, with the former achieving perfect precision along with high recall and F1.In particular, they both achieve an almost perfect result (only one false positive, no false negatives) using the mean+std threshold.The MPNet-PM model performs only slightly worse than the monolingual version, providing encouraging results in the perspective of future multilingual applications.The TF-IDF baseline performs worse than all the sentence embedding models.Fig. 7 shows an example of matching using sentences from S1 to S19, which are the ones related to the argumentative nodes "Has celiac disease", "Has not celiac disease", "Is immunosuppressed", "Is not immunosuppressed".The matches are computed by the MPNet-S and the MPNet-P models using a threshold value of 0.65.The former achieves perfect precision but not perfect recall, and indeed we can see that it misses some matches, such as S8 and S10.The latter reaches perfect recall but not precision, which indicates the presence of false positives e. g. the pair S1 and S8.
To understand better how effectively the sentence have been modelled by the different embedding methods, we can use Principal Component Analysis to project the embeddings into a 2-dimensional space.Fig. 8 clearly shows that TF-IDF is the least effective at separating the sentences according to their nodes, while the MPNet methods are the most effective.It can also be seen that non-MPNet sentence embedding methods are not very effective in separating nodes that express opposite concepts (e.g., celiac vs non-celiac), while MPNet methods effectively separate sentences that express a negation (in green, blue, and cyan) from the others.

Test on the rest of the dataset
We test the best models and the associated best thresholds on the remaining 10 nodes and the related 38 natural language sentences.In particular, we use the three best fixed-value thresholds and the value mean+std previously obtained for each model.We do not compute a new mean+std value because our purpose is to validate the hyperparameters selected in the previous step.The results are shown in Table 3, while Fig. 9 provides a graphical representation of the dataset sentences as they are encoded by the models.
The MPNet-S model achieves the best overall F1 score, while MPNet-PM is the one with the best recall among the cases where 100% precision is obtained.Both the models obtain their best F1 score for the lowest of the considered thresholds.Also, all three models obtain the lowest F1 with the higher threshold.
For the mean+std threshold, all the models have comparable F1 scores, MPNet-P is the least precise model, while the other models have /.

B. Fazzinga et al.
perfect or almost perfect precision scores.The specific matches of obtained with each model are represented in Figs. 10,11,and 12.By looking at them, we can see that the multilingual model completely fails to recognize two sentences, "I've never had bronchial asthma" and "I have never had an allergic reaction with latex".Regardng the MPNet-S model, its only false poitive is a match between the sentence "I went into anaphylactic shock before" and "I've never gone into anaphylactic shock before".This would prompt the argumentative model to ask for additional information on these topics, which surely is not ideal, but is not harmful either.
The experimental results are satisfactory and confirm the quality of our method.The fact that the multilingual model performs comparably to the monolingual models is encouraging in the perspective of future developments of multi-lingual chatbots that can be used not only by native speakers but also by tourists, migrants or refugees.Some of the observed false positives might be particularly troubling in a real application since they mean that the system has misunderstood a sentence both for its real meaning but also for its negation, e.g. the sentence "I have a latex allergy" as "I do not have a latex allergy".The argumentative reasoning module would be able to easily detect such conflicts and in future works, we plan to include conflict resolution modules and procedures.A careful user experience design may also be able to mitigate the issue, for instance by displaying relevant pieces of information interactively as they are understood by the system.
We have never encountered cases where a sentence is misunderstood as something completely different, e.g."I am celiac" as "I do not have any drug allergy", nor cases where a sentence is misunderstood only as its opposite.These cases are potentially harmful when the user has not provided other information regarding the latter aspect, so the argumentative model would not be able to detect the conflict.Such a problem could be addressed inside the language module, for example by establishing that a node is considered matched only if the user's input is similar to at least K associated sentences, with K being a new hyperparameter.Whether to enact such a strategy and the appropriate value for K would depend on the specific use case and the number of sentences available in the KB for each node.

Conclusion
Dialogue systems are an increasingly popular class of AI systems that are nowadays used by many companies and institutions to provide services and information.The actual effectiveness of these systems is closely related to the trust that they inspire in the users.Usually, dialogue systems are designed to hide what happens behind the curtains, with the purpose to appear as "human" as possible and gaining the user's trust.Such a trend has not yet led to the desired outcome, and despite many efforts, users are still suspicious about dialogue systems, trusting them less than websites (Ischen et al., 2020).Also, while the design of the user experience is largely addressed in the literature (Rhim et al., 2022), surprisingly little technical work has been done yet in aspects such as the soundness of the answers, transparency of the computational process, and management of users' sensitive information.
The present work addresses important research questions: How to ensure data protection while providing personalized services to the individual?How to implement explainability in dialogue systems?How to implement a trustworthy dialogue system?It does so not in the abstract, but by presenting a concrete, modular architecture, and by evaluating its prototypical implementation on a simple but realistic case study.By doing so, it aims to bridge the gap between general ethical principles and practical realizations.Compare to mainstream dialogue systems architectures, our design is unusual, since it focuses primarily on data protection and transparency.It also addresses explainability, thanks to an Argumentation module that can justify the system's responses with KB facts that encode all the (sanitized) relevant information provided by the user.In this respect, a noteworthy feature of the Argumentation module is its ability to support reasoning over the conflicts between arguments, which lead to support or discard some responses.For example, our system can provide the user with justifications like 'Since you suffer from bronchial asthma, you can not get vaccinated at the vaccine site'.We believe that justifying why a response cannot be given based on elements previously entered by the user is a good way to make the user understand the response and trust the system.
A Covid-19 vaccination case study was intended to illustrate how our proposal can be fit a real-world scenario, showing how users can interact with the system to retrieve information and obtain explanations about them.However, the architecture is of general applicability.Besides the case study, we assessed our system formally, with regard to the Argumentation module, and empirically, with regard to the Language module.Our empirical results, although in a small-sized case study, indicate that the concept is not only feasible but also, possibly, quite effective.

Fig. 1 .
Fig. 1.System architecture and example of interaction with the user.Sentences are represented as rectangles and indicated with S, while circles are used for status and reply nodes (indicated respectively with N and R).We represent a case where nodes and sentences refers to two concepts, A and B, and the user sentence regards B. The information provided by the user is represented with the green color and by diagonal stripes.It is easy to see that such information does not reach the argumentation module.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Definition 3.1.(Argumentation graph) An argumentation graph is a tuple G = 〈A,R,D,E〉, where A and R are the arguments of the graph and are called status arguments and reply arguments, respectively, D ⊆ A × (A ∪ R) encodes the attack relation and it is such that for each (a, b) ∈ D | a, b ∈ A it holds that also (b, a) ∈ D, and E ⊆ A × R encodes the

Fig. 6 .
Fig. 6.An excerpt of an argumentation graph encoding knowledge about immigration in the UK.

B
.Fazzinga et al.
Definition 3.4.(Explanation) Given an argumentation graph G = 〈A, R, D, E〉, a set S ⊆ A and a reply r ∈ R, an explanation for r is a pair 〈End, NotGiven〉, where End contains the arguments a ∈ S s.t.(a, r) ∈ E and NotGiven is a set of pairs 〈r ′ ,N ′ 〉, where r ′ ∕ = r, r ′ is endorsed by S and N ′ ⊆ S contains the arguments b attacking r ′ .

Fig. 8 .
Fig. 8. Visualization of the encoded sentences after PCA projection and normalization.Nodes from N1 to N6 are represent through different colors; in order: red, green, blue, orange, cyan, and black.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 9 .
Fig. 9. Visualization of the encoded sentences after PCA projection and normalization.Nodes from N7 to N16 are represent through different colors; in order: red, green, blue, orange, cyan, black, violet, yellow, brown, and pink.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 10 .
Fig. 10.Matches computed by MPNet-S using the mean+std threshold value on the test set.

Fig. 11 .
Fig. 11.Matches computed by MPNet-P using the mean+std threshold value on the test set.

Table 2
Nodes used in our case study and example of sentences associated with them.

Table 3
Experimental results of the selected embedding models and the threshold criterion on the sentence matching task.