Revealing the Roles of Part-of-Speech Taggers in Alzheimer Disease Detection: Scientific Discovery Using One-Intervention Causal Explanation

Background Recently, rich computational methods that use deep learning or machine learning have been developed using linguistic biomarkers for the diagnosis of early-stage Alzheimer disease (AD). Moreover, some qualitative and quantitative studies have indicated that certain part-of-speech (PoS) features or tags could be good indicators of AD. However, there has not been a systematic attempt to discover the underlying relationships between PoS features and AD. Moreover, there has not been any attempt to quantify the relative importance of PoS features in detecting AD. Objective Our goal was to disclose the underlying relationship between PoS features and AD, understand whether PoS features are useful in AD diagnosis, and explore which PoS features play a vital role in the diagnosis. Methods The DementiaBank, containing 1049 transcripts from 208 patients with AD and 243 transcripts from 104 older control individuals, was used. A total of 27 PoS features were extracted from each record. Then, the relationship between AD and each of the PoS features was explored. A transformer-based deep learning model for AD prediction using PoS features was trained. Then, a global explainable artificial intelligence method was proposed and used to discover which PoS features were the most important in AD diagnosis using the transformer-based predictor. A global (model-level) feature importance measure was derived as a summary from the local (example-level) feature importance metric, which was obtained using the proposed causally aware counterfactual explanation method. The unique feature of this method is that it considers causal relations among PoS features and can, hence, preclude counterfactuals that are improbable and result in more reliable explanations. Results The deep learning–based AD predictor achieved an accuracy of 92.2% and an F1-score of 0.955 when distinguishing patients with AD from healthy controls. The proposed explanation method identified 12 PoS features as being important for distinguishing patients with AD from healthy controls. Of these 12 features, 3 (25%) have been identified by other researchers in previous works in psychology and natural language processing. The remaining 75% (9/12) of PoS features have not been previously identified. We believe that this is an interesting finding that can be used in creating tests that might aid in the diagnosis of AD. Note that although our method is focused on PoS features, it should be possible to extend it to more types of features, perhaps even those derived from other biomarkers, such as syntactic features. Conclusions The high classification accuracy of the proposed deep learner indicates that PoS features are strong clues in AD diagnosis. There are 12 PoS features that are strongly tied to AD, and because language is a noninvasive and potentially cheap method for detecting AD, this work shows some promising directions in this field.


Background
Alzheimer disease (AD) is a serious and the most common form of dementia worldwide. In the United States, more than 5 million individuals are living with AD and AD-related dementia, which costed the nation US $244 billion in 2019. The National Academy of Sciences, National Plan to Address Alzheimer's Disease, and Affordable Care Act through the Medicare Annual Wellness identify earlier detection of AD-related dementia as a core aim for improving brain health for millions of Americans.
Traditionally, brief cognitive screening tests and biological marker methods (usually neuroimaging [1][2][3][4] or cerebrospinal fluid examination [5]) have been used for identification. However, these approaches tend to be invasive, be expensive, and trigger patient compliance problems. Alternatively, spoken language is a rich and inexpensive source of information in the detection of cognitive status, even at the early stage.
Robinson et al [6] showed that patients with AD are more likely to have a reduction in vocabulary size and difficulty in correctly using verbs and nouns. Croisile et al [7] showed that patients with AD give a shorter speech, more implausible details, and syntactically simplified descriptions.
Recently, machine learning-or deep learning-based automated early-stage AD detection using linguistic features has been proposed and has demonstrated outstanding diagnosis accuracy. Eyigoz et al [8] demonstrated that a patient's language performance in naturalistic probes can expose subtle early linguistic signs of progression to AD much before a clinical diagnosis of the impairment. Khodabakhsh et al [9] studied the diagnosis of AD using speech features extracted from a spontaneous conversation and obtained 90% AD detection accuracy. Machine learning-or deep learning-based methods allow for the use of latent features that go beyond handcrafted features and represent more sophisticated concepts. For example, word or sentence embeddings map words or sentences from a vocabulary to a vector of real numbers. Good embeddings will encode concepts similar to the adjacent vectors. Studies that used word embeddings for AD diagnosis include the studies by Karlekar et al [10], Wang et al [11], Palo and Parde [12], and Mahajan and Baths [13]. In addition to word embeddings, the study by Karlekar et al [10] used part-of-speech (PoS) features; the study by Wang et al [11] used PoS features and sentence embeddings; the study by Palo and Parde [12] used targeted psycholinguistic, sentiment, and demographic features; and the study by Mahajan and Baths [13] used recurrent neural networks to capture the temporal dynamics in speech recordings for improving the diagnosis accuracy.
However, most previous studies were performance oriented and constructed more complex models with an increasing number of features and modalities. Although these models achieve better diagnostic accuracy, they usually sacrifice transparency in the diagnosis-making process. This is because most of these complex models are deep learning based, which makes them inherently opaque, and not all the features are human interpretable. This is especially true if their influence on the prediction is not well understood. This opaqueness and lack of understanding of the contributions of individual features to the prediction has resulted in reluctance among the clinical community to use these methods in practice [14].
Explainable artificial intelligence (XAI) refers to methods that can reduce the opaqueness of deep learning models. XAI methods can be classified according to various criteria. One of the taxonomies is based on the format of explanation. Local or example-based explanation explains an individual prediction, whereas the global explanation explains the model behavior (eg, feature importance).
Beyond explaining the model's internal mechanism, recent studies have used XAI methods for scientific discovery. XAI-based scientific discovery enables the discovery of insightful scientific concepts from model explanations obtained through XAI methods. Ginsburg et al [15] proposed Feature Importance in Nonlinear Embeddings for the analysis of cancer patterns in breast cancer tissue slides. Feature Importance in Nonlinear Embeddings automatically determines the important features that revealed previously unknown scientific attributes. Li et al [16] showed that concepts similar to Kepler laws of planetary motion and the Newton law of universal gravitation can be obtained through XAI methods.

Objectives
Our goal was to disclose the underlying relationship between PoS features and AD. Ours is the first study to explore the predictive power of PoS features for AD diagnosis by using a well-performing transformer-based [17] model, which is trained to use PoS features for AD diagnosis. If a feature does not impact the decision of this predictor, then it stands to reason that this feature does not have much predictive power. Note that although PoS features were used in previous works for AD diagnosis, and impressive accuracies were achieved, they were usually combined with other features as inputs; hence, the effect of PoS features alone is unclear. In our study, we found that using only PoS features can still yield a high AD diagnosis performance with 92.2% accuracy. Hence, we believed that it would be interesting to discover which PoS features play vital roles in this prediction.
To understand the importance of any given feature for a particular problem, it is important to study the effect this feature has globally on all samples. To achieve this goal, we used an example-based explanation called counterfactual explanation (CFE) [18] on our predictor. Example-based explanation gives explanations for individual data samples. Then, we analyzed the statistical summary of the CFEs of a group of data samples to show the global effect of each input feature.
Conventionally, CFE aims to answer "Why" questions such as "Why the model's decision is Y" or "What would have happened to Y, had I not done X?" The first step in obtaining CFE is to search for counterfactual examples, which are defined as the examples obtained by applying minimal changes to the features of the original example and having the predefined outputs. Then, CFEs can be extracted by comparing the differences between the original example and its counterfactual examples. For example, if the model's prediction is changed from a patient with AD to healthy control as we manually increase the appearance of nouns by the minimal unit (eg, 1) in a data sample, then the CFE would indicate that the number of nouns used is an important factor for classifying the sample as data collected from a patient with AD.
However, when generating counterfactual examples, the conventional CFEs assume that features are independent of each other. This can result in counterfactual examples that are not feasible in the real world. For example, an infeasible CFE can suggest that the number of nouns decreased whereas the number of adjectives increased, which is anticausal because adjective words are usually used to modify noun words; hence, its appearance is supposed to increase or be unchanged as the number of noun words increases.
It is clear that conclusions drawn from potentially infeasible counterfactuals cannot be reliable. Hence, it is important to develop a causally away counterfactual intervention method for our purposes. We argue that the key to making the generated counterfactual examples feasible is ensuring that the generation process of counterfactual examples obeys causal rules. That is, as counterfactual examples are generated by making changes to some features, the causal consequences of these changes (eg, an increase in the number of nouns causes an increase in the number of adjectives) have to be considered.
To generate feasible counterfactual examples, we propose using a causal model that contains a directed graph that models the random variables by nodes and their causal relation by directed edges. Each edge in the causal model also encodes the causal function f: P→C, where C represents any variable that is modeled in the causal model, and P represents the variables that cause variable C. Then, one can generate counterfactual examples of the original example by performing interventions in the causal model. Performing interventions is the process in which some variables within a sample are changed to fixed values, and the rest of the variables are generated according to the causal functions (eg, f). A counterfactual sample can be regarded as a CFE if it can yield the predefined output.
To understand the importance of a single feature, we intervened on only one feature at a time for counterfactual generation. Hence, we named our proposed method one-intervention causal explanation (OICE). We then used the one-intervention causal examples to explain the importance of each feature by asking, "What would have happened to the output, had I intervened on feature A?" Moreover, using one intervention allowed us to systematically study the impact of the different features. Each feature (and its descendants) that is impacted by the parent feature in this one-intervention approach could be further analyzed using the structural causal model (SCM). Finally, we defined 3 metrics for quantifying the importance of the features in the decisions.

Related Work: CFE Method
CFEs are a widely used method for generating explanations of a model's decision and aim to answer "How the world would have to be different for a desirable outcome to occur" [18]. By studying these counterfactual instances, that is, by examining the difference between the original scenarios and the hypothesis or a possible suggestion about how the desired outcome can be obtained by changing some of the features, one can explain why a model arrives at a specific outcome. Generally, CFEs are generated by finding the minimal changes required to change the classification of this instance to the desired class. Wachter et al [18] formulated a general form for finding the CFEs x CF : where x is the query instance, f w is the classifier, y' is the desired output, and d (•,•) is a distance function. In practice, maximization over λ is done by iteratively solving for x' and increasing λ until a sufficiently close solution is found.
The quality of CFEs is measured in terms of actionability, feasibility, diversity, and sparsity. The meaning of each metric is stated as follows: • Actionability: refers to the extent a suggested alternative scenario or action is practical and feasible to implement. In contrast, a CFE that changes any immutable features (eg, gender: male → female) is unactionable.
• Feasibility: features that are changed by a CFE should be within a reasonable range or population. An infeasible CFE could be changing the number of credit cards from 5 to -1.
• Diversity: this refers to the ability to generate diverse CFEs.
• Sparsity: this refers to the number of features that are changed in CFEs. Fewer changes or high sparsity is favorable because humans can only extract limited information.
Most existing approaches in the literature on CFEs are dedicated to improving the aforementioned metrics. Recent studies [19,20] considered the distribution of data and generated counterfactual instances from the relatively high-density region of the input space. This method improves feasibility by avoiding unlikely or unrealistic counterfactual instances under the data distribution. Ustun et al [21] improved actionability and feasibility by allowing the counterfactual instances that optimize a user-specified cost function and prevent counterfactuals from changing immutable variables such as age, sex, and gender. Russell [22] proposed a mixed-integer programming formulation to handle mixed data types and offered CFEs for linear classifiers that respect the original data structure. This formulation is guaranteed to find coherent solutions by only searching within the "mixed-polytope" structure defined by a suitable choice of linear constraints.
The study most similar to ours is that of Karimi et al [23], which shifted the paradigm from the nearest CFEs to minimal interventions. Specifically, in the study by Karimi et al [23], counterfactual examples were generated by the predefined SCM and a set of possible interventions to achieve the desired outcomes. The optimal intervention set is obtained by choosing the optimal intervention set is the one that induces the minimum cost, where the cost is measured by a predefined cost function on the intervention sets. In addition, they proved the necessity of considering all intervariable causal dependencies and demonstrated efficiency on some toy data sets. We used a more complex SCM known as Causal Generative Neural Network (CGNN) [24] to capture the intervariable causal dependencies and generate CFEs using the intervention. In addition, we statistically analyzed the derived explanations to inspect the global behavior of the model.

Overview
For scientific discovery purposes, our method incorporated 3 phases: knowledge learning, knowledge extraction, and knowledge verification. As shown in Figure 1, in the knowledge learning phase, we used a transformer-based classifier to learn the underlying association between PoS features and AD. In the knowledge extraction phase, we used our proposed XAI method, OICE, to extract the learned mechanism. In particular, OICE quantitatively indicated the importance of PoS features used by the model in AD classification, and the extracted knowledge (ie, feature importance) was verified with the findings of previous studies in phase 3. A model that is verified to have high consistency with previous findings is more plausible and, hence, is more likely to provide reliable insights into the underlying mechanism among PoS features and AD.
In the following sections, first, we introduce the data set and, subsequently, the structure of the transformer-based classifier. Then, we introduce the proposed model explanation method, OICE. Finally, we describe the details of implementing the introduced methods.

Data Set Description
The DementiaBank [25] is a database of multimedia interactions for the study of communication in patients with dementia. This data set comprises the transcripts of individuals (individuals with dementia and control individuals) who were given four tasks: (1) cookie theft description, in which the participants in both the control group and dementia group were given a picture of a child attempting to steal a cookie and asked to describe what they saw; (2) word fluency, in which the fluency of the participants in the dementia group was measured; (3) recall, in which the participants in the dementia group were tested on their memory recall; and (4) sentence construction, in which the participants in the dementia group were tested on sentence construction. In total, the corpus contains 1049 transcripts from 208 patients with AD and 243 transcripts from 104 older control individuals, amounting to a total of 1292 transcripts. Two examples from the DementiaBank data set are presented in Table 1. In this study, we used all the transcripts described earlier.
The transcripts were tokenized into single-word tokens, and each token was computed with PoS tags using the Natural Language Toolkit [26]. Upon each transcript, we generated a PoS feature vector with the counts of 27 PoS tags. The names and the meanings of the 27 PoS features are presented in Table  2. Table 1. Two examples from the DementiaBank data sample. In our experiment, we analyzed the part-of-speech features that were extracted from the speech records.

Speech record Label
Okay, well the mother is drying the dishes, the sink is overflowing, um the little girl's reaching for a cookie, and her brother's taking cookies out of the cookie jar, and the stool is going to f knock him on the floor laughs, he's going to fall on the floor because the stool's not uh what, with gravity, whatever, uh the uh curtains are blowing I think, that's all I can see Healthy control I would like to have a lead pencil, the tree is blossoming, I hope my child doesn't hafta go to the hospital, I hope my child doesn't hafta go to the hospital, I shouldn't say that because we have a daughter who's pregnant, and I do want her to go to the hospital, okay then, this winter has been a very cold one, the doctor said I, I sat in the chair by a the doctor, brief, I'm not, I forgot to try make them brief, the bureau drawer stands open Patient with Alzheimer disease

Ethical Considerations
We used the DementiaBank data set, which is archived by TalkBank. TalkBank is subject to its own Code of Ethics (detailed in the Code of Ethics page of the TalkBank website [27]), which supplements but does not replace the generally accepted professional codes of the American Psychological Association Code of Ethics and the American Anthropological Association Code of Ethics.

Transformer-Based AD Classification Model
Recently, we proposed a transformer-based [11] classifier to exploit PoS features, as shown in Figure 2. In our architecture, we used the multihead attention (MHA) module and the encoder structure of the transformer to process these features. Our motivation for this is stemmed from the success of this architecture in creating state-of-the-art language embeddings, as demonstrated in the study by Wang et al [11]. This architecture comprises a self-attention module that captures the intrafeature relationships, an attention layer together with a following 1-D Convolutional Neural Network layer. The MHA module is the same as that proposed in the study by Wang et al [11] for the popular transformer architecture. If R = {r 1 , r 2, I, r n } is the set of records, then r i is the ith record in the data set. We computed PoS features for each record. Let P = {p 1 , p 2 , I, p n } be the set of PoS feature vectors and p i be the ith vector in the PoS matrix. We used 6 MHA layers on P = {p 1 , p 2 , I, p n } to capture the relationship between the PoS features. The MHA transformed P into another matrix of n-dimensional vectors A = {a 1 , a 2 , I, a n }. The MHA module was followed by a 1-layer Convolutional Neural Network and a softmax layer to obtain the final classification.

Overview
To derive an explanation, OICE first calculates the CFEs for each sample. Each CFE can simply be seen as a vote for the importance of the features of each sample. Then, OICE groups these CFEs to summarize the global explanation about feature importance. In this subsection, we first outline the preliminary information on SCM, which is an essential element for obtaining CFEs. We then describe how we learn an SCM from the data. Next, we discuss how we formulated OICE and how OICE generates individual CFEs using the pretrained SCM. Then, we introduce the metrics that we propose to measure feature importance (global explanation) according to a group of CFEs. We denote an intervention in SCM by a do-operator do (•).

The Concept of SCM
Intervening the set of X to the value α can then be described as do ({X i = a} iεI ) where I is a set of indices of the subset of endogenous variables to be intervened upon. By intervention, causal relations and causal mechanisms defined in the original SCM can be changed. Endogenous variables from I can be obtained through do (X i = a) rather than X i = F i (PA i , U i ). Therefore, by performing the intervention, the original SCM M can be changed to a postintervention SCM M I .

SCM via Generative Network
We used the CGNN proposed in the study by Goudet et al [24] to represent SCM because it does not limit the types of causal mechanisms (eg, linear or nonlinear). Given a causal graph, a CGNN can be trained to learn the causal mechanisms underlying the causal graph by reducing the maximum mean discrepancy [28] between the ground-truth data and the generated data. CGNN generates each endogenous variable through θi is a generative neural network parameterized by θ i . For simplicity, we use F i to represent F i θi in the rest of this paper. U i are random samples drawn from Gaussian distribution. Figure 3 illustrates an example of SCM construction using CGNN.
The weights of causal mechanisms (ie, θ i ) are updated to minimize the maximum mean discrepancy between the ground-truth samples and the samples generated by the CGNN. In our experiment, we discovered the causal relations in the DementiaBank data set using the PC algorithm [29]. The PC algorithm is a constraint-based causal discovery method, under the assumption of causal sufficiency (ie, no latent confounders). We discovered causal relations among PoS features from the DementiaBank data set rather than using generic PoS causal rules, as the former would better capture the causal relations among PoS features in the dementia group. Figure 3. Example of a structural causal model. Left: causal graph. Right: causal mechanisms. X, U, and F stand for the endogenous variables, the exogenous variables, and the causal mechanisms, respectively. As for the Causal Generative Neural Network, each causal mechanism is implemented with a generative neural network.

Explanation by Minimal Intervention
We now introduce some notations and discuss the formulation of OICE. We formulated the problem of OICE as searching for the optimal I* that results in a counterfactual example x CF , which would flip the outcome from y to y'. One intervention was implemented by fixing the ||I|| 0 to be 1. It was formulated as follows: where h denotes the predictive model. In most cases, the model h is a probabilistic model; we then select the optimal solutions I* as those that result in counterfactual examples that can achieve a particular degree of certainty to be y' (eg, h (G (U F , I; F)) is 80% certain to be y'). Thus, multiple optimal solutions were obtained, which contain different intervened features. Note that the same type of intervened features may have different intervention values. Consequently, we further distilled our optimal solutions set by keeping only one solution for each subset with the same intervention that causes the minimum distance weighted by the median absolute deviation [18].

Metrics for Measuring Importance
So far, we have introduced how we obtained explanations for individual instances using OICE. We then made inference of the model's global behavior (ie, importance of features) by statistically analyzing the explanations derived from a batch of samples. In this section, we introduce some metrics to measure the impact of intervening a feature to cause a flip in the outcome. The impact of features can be further associated with their importance for a machine learning model in making a decision.
Let S = {S (1) , S (2) ,…, S (n) } represent a set of n samples that belong to class y (ie, h (S i ) = y, for I = 1, 2,…, n). In our case, the problem is a binary classification problem, and the classes are "control" or "Alzheimer's." Let C k (i) denote the CFE of the ith sample obtained by intervening on the feature k and, hence, ) ≠ y. To measure the impact on the flip in the outcome caused by the intervening feature k, we introduce our first metric, impact score (IS).IS k can be interpreted as the proportion of counterfactual samples for which the feature k must be intervened to flip the outcome and is defined as follows: where I k = {i: h (C k (i) ) ≠ y, i = 1, 2,…, n} is a set that contains the indices of samples in S that have a CFE obtained by intervening on the feature k. The IS score describes the overall impact and does not consider the cost of the intervention (ie, how much a feature has been increased or decreased). Accordingly, we introduce another metric, weighted IS (wIS), to measure the impact made by changing the unit value of a feature. This measure trades off the impact with the cost of impact (CI). wIS can be used to draw comparisons among the features. Features with higher wIS values have more importance in flipping the outcome. To define wIS, we first introduce the parameter CI to measure the average absolute change that must be made to achieve the impact (ie, IS). Using subscript j to index the jth feature of a sample S (i) or C k (i) the CI for the feature k can be defined as follows: where R k is the range of feature k. Next, we define the wIS as follows: Note that the wIS defined in Equation 5 does not consider the trends of change in a feature (ie, increasing or decreasing). To address this, we separated wIS k into wIS k + and wIS k to represent the wIS for increasing and decreasing the value of the feature k, respectively. They are calculated using the following rules: (1) if all the trends of change (ie, sign(C k,j (i) -S j (i) )) are same, It is important to understand how much each changed feature contributes to flipping the outcome. Consequently, we introduce another metric called pure IS (PIS) to quantify the importance of every changed feature within the CFEs obtained by the same intervention.
Hence, the PIS for a feature is calculated by subtracting the impact (on flipping the outcome) caused by its child nodes from the IS score of this feature. As the wIS represents the change in IS per unit change in the value of the feature, the impact of each child node is m and can, hence, be quantified as the average of the changes in m's values multiplied by the wIS of m. The impact caused by the feature m when m is causally affected by the feature k is defined as follows: The PIS for the intervened feature k, PIS k k , is defined as follows: where CH k is the set of indices of the child nodes of the feature k. The value of PIS k m is then normalized over IS k to represent the percentage of effort for flipping the outcome.

Model Settings
In our experiments, we used 6 layers for the MHA module. We used stochastic gradient descent + momentum (SGD + Momentum) as the optimizer for training. Because DementiaBank is an unbalanced data set, we added a class weight correction by increasing the penalty for misclassifying the less frequent class during model training to reduce the effect of data bias. The class weight correction ratio used in this study is 7:3. We randomly split the original data into 81% training set, 9% validation set, and 10% testing set over multiple seeds. Our proposed model achieved a high accuracy of 92.2%,

PoS Feature Causal Relation Discovery
As mentioned earlier, we used the PC algorithm [29] to discover the intrafeature dependencies. The causal graphs returned by the PC algorithm contained undirected edges. Hence, we further revised the returned graph by orienting the undirected edges. The edges were oriented according to our knowledge of the linguistic features. For example, we made the causal direction noun (NN) → adjective (JJ) because NN causes the use of JJ. The full causal graph for the 27 linguistic features used in our experiment is illustrated in Figure 4.

Problem Solver
Solving the l 0 norm constraints in Equation 2 is a nontrivial task. However, the PoS features used by the proposed classifier are all integers and within narrow ranges. This makes it possible to solve the problem by exhausting all the solutions and then selecting the optimal ones. In addition, we set the certainty parameter to 80%; this implies that all solutions, I, that satisfy ||h (G (U {F} , I; F)) -y'|| 2 < α, where α=.04, are considered optimal. The value of α is chosen to reflect 80% certainty.

Predictive Power of PoS Features
All PoS features described in Table 2

Knowledge Extracted From Model Explanation
In this section, we continue to reveal the important PoS features that direct the model's decision. We analyzed the counterfactual examples from a statistical perspective and analyzed the important features derived from this analysis. We studied the CFEs for a control sample (ie, an individual without AD). The important features were derived by analyzing which feature plays a vital role in misclassifying a control sample as a patient with AD. In this paper, we report the results of 210 of the 243 controls. These 210 control samples were classified correctly by the classifier. The optimal CFE for all the 210 results could be achieved by intervening only one feature. Other samples were excluded because of misclassification.
We report both the IS and wIS for all PoS features in Tables 3  and 4 Table 2. Further information on these features can be found in the Google Sites PoS tutorial [30] and the study by Toutanova et al [31].
We then analyzed the important features to answer the following question: how exactly does intervening a feature cause the outcome to flip? To answer the above question, we considered the children features of the intervened feature given by SCM.  Figure  6, we illustrate how changes in each feature contribute to flipping the outcome and show 4 representative features as examples. First, we consider features to be "cooperative" (Figures 6A-6D) if both the intervened feature and its descendant features contribute to flipping the outcome. Second, we define the feature as "dominant" (Figures 6E-6H) if the intervened feature significantly contributes to flipping the outcome, while its descendant features make either no or an opposing contribution. Third, we classify the intervened feature as "idling" (Figures 6I and 6J) if it only slightly contributes to flipping the outcome, while the child features make a substantial contribution. Finally, we introduce the term "inverse" ( Figures  6K and 6L) to describe a feature that moves the original instances away from the decision boundary upon intervention but causes other features to substantially push the original instances toward the decision boundary.
To complete the explanation that we promised at the beginning of this section, we use CI to quantitatively describe the average minimal changes that must be done to flip the outcome. In Table  5, we report CI and the changing direction (an up arrow means an increase in the value is required, whereas a down arrow means a decrease is required). For example, reducing the use of NN by 16.88% of the total range of NN feature will make the classifier flip the final decision. Now, we combine the results from both Tables 4 and 5 to offer explanations for all important features. For clarity, in the following explanation, we do not use the words "increase," "decrease," and "change" to denote the actions that can modify the values of features. These 3 words are used to represent the pattern of how much the divergence of a feature from its real value can affect the decision of the model. We use "contribution" or "contribute" to denote the positive effort (measured by PIS) or process to flip the outcome. As opposed to "flip the outcome," we use the terminology "consolidate the outcome" to denote that changing a feature causes the outcome to move further away from the decision boundary.
• VBG: decreasing the value of VBG by 20.7% causes the values of both determiner (DT) and verb in third-person singular present form (VBZ) to decrease. The decrements of VBG, DT, and VBZ contribute to flipping the outcome.
• PDT: increasing PDT by 20.2% causes VBG, DT, and RP to decrease or remain unchanged. VBG and DT contribute substantially to flip the outcome, whereas PDT makes only partial contributions.
• NNP: increasing NNP by 29.5% will cause DT to decrease. Increasing NNP contributes substantially to flipping the outcome, whereas the resulting decrements in DT make a partial contribution.
• VB: increasing VB by at least 69.3% will cause RB and WP to change or remain unchanged and cause "to" (TO) to increase. The changes in VB, RB, and TO contribute substantially to flip the outcome. The changes in WP make small contributions.
• JJ: increasing JJ by at least 16.5% will cause NNP and interjection (UH) to increase or remain unchanged and cause RB to change or remain unchanged. Even though the changes in NNP and RB consolidate the outcome, increasing JJ can substantially contribute to flipping the outcome. In addition, the change in UH makes a negligible contribution compared with the increment in JJ.

Principal Findings
First, the high performance of the AD diagnosis model on PoS features indicates that PoS features are rich clues of speech or language impairments that happen in patients with AD. Later, by explaining the model using our proposed OICE XAI method, we reveal several important linguistic biomarkers for early-stage AD detection. Some of the findings are consistent with the previous findings in psychology and natural language processing.
• RB is highly relevant to semantic impairment: the study by Varley [32] claims that RB shows a deictic purpose, which is more common in aphasics with a semantic impairment. Furthermore, in the study by Fraser et al [33], RB was proved to have higher correlations with a diagnosis of AD.
Our one-intervention method shows that increasing the use of RB in the speech of a healthy control causes the same speech to be classified as that of a patient (from a control). Hence, our experiments align with previous findings that the increased use of RBs is an indicator of AD.
• Increased PRP use is an important sign of semantic dementia: the study by Almor et al [34] shows that patients with semantic dementia produced an increased number of PRPs than controls. The result is in line with our conclusion that increasing the number of PRPs in a control's speech classifies it as a speech sample of a dementia patient.
• NN naming deficits indicate cognitive deficits: patients with AD show graceful degradation of using living and nonliving NNs [35]. We see the same decline in NN use when shifting from a control sample to a dementia sample.
The consistency between the findings of this study and those of previous studies implies that the model possibly learns useful clues about PoS features. It somewhat supports the point that the rest of the features that were not studied can offer new insights. To sum up, 3 of 12 important features (25%; RB, PRP, and NN) found by our method are consistent with previous findings. We also found 8 other important features that have not been reported yet, namely IN, RP, VBG, PDT, NNP, JJ, VBD, VB, and WP. Our work also seems to suggest that the most important feature may be IN or the use of prepositions. Further clinical studies may be necessary to verify this insight.

Limitations and Further Study
For the scope of work considered here, we do not see any limitations; however, we do believe that there is good scope for further study in this area. More modalities can be used in designing an AD predictor. These modalities could include brain imagery and other traditional biomarkers. The OICE method can then be applied to all the features used to detect AD, leading to a much more nuanced understanding of the causal relations of these biomarkers. This could then lead to clinical trials that test these findings. A subset of noninvasive biomarkers may then emerge as important in predicting AD, which might, in turn, lead to easier-to-implement screens for the disease.

Conclusions
In this study, we propose a novel CFE method called OICE to analyze the dominant linguistic features, specifically PoS features, that can be used for AD disease detection. We propose 3 metrics to evaluate the contributions of these features to the final decision of the model. We collected the explanations from the AD detection model of high accuracy and analyzed these explanations using the metrics we defined. The features declared as important in the detection of AD by our methods, namely RBs, pronouns, and NNs, are consistent with previous works in psychology and natural language processing. We also found a few other features that are important but have not yet been reported. Finally, by leveraging SCM, we further explained how these important features affect the decision-making process.

Data Availability
The DementiaBank data set [25] used in this study is password protected and restricted to members of the DementiaBank consortium group. Accessibility to this data set can be granted after joining the DementiaBank consortium group as a member. For details about accessing the data set, please refer to the study by Boller and Becker [25].