On Sample Based Explanation Methods for NLP: Faithfulness, Efficiency and Semantic Evaluation

In the recent advances of natural language processing, the scale of the state-of-the-art models and datasets is usually extensive, which challenges the application of sample-based explanation methods in many aspects, such as explanation interpretability, efficiency, and faithfulness. In this work, for the first time, we can improve the interpretability of explanations by allowing arbitrary text sequences as the explanation unit. On top of this, we implement a hessian-free method with a model faithfulness guarantee. Finally, to compare our method with the others, we propose a semantic-based evaluation metric that can better align with humans’ judgment of explanations than the widely adopted diagnostic or re-training measures. The empirical results on multiple real data sets demonstrate the proposed method’s superior performance to popular explanation techniques such as Influence Function or TracIn on semantic evaluation.


Introduction
As complex NLP models such as the Transformers family (Vaswani et al., 2017;Devlin et al., 2019) become an indispensable tool in many applications, there are growing interests to explain the working mechanism of these "black-box" models.Among the vast of existing techniques for explaining machine learning models, Influence Functions (Hampel, 1974;Koh and Liang, 2017) that uses training instances as explanations to a model's behavior have gained popularity in NLP very recently.Different from other methods such as using input erasure (Li et al., 2016), saliency maps or attention matrices (Serrano and Smith, 2019;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019) that only look at how a specific input or input sequence impacts the model decision, explaining with training instances can cast light on the knowledge a model has encoded about a problem, by answering questions like 'what knowledge did the model capture from which training instances so that it makes decision in such a manner during test?'.Very recently, the method has been applied to explain BERT-based (Devlin et al., 2019) text classification (Han et al., 2020;Meng et al., 2020b) and natural language inference (Han et al., 2020) models, as well as to aid text generation for data augmentation (Yang et al., 2020a) using GPT-2 (Radford et al., 2019).Although useful, Influence Function may not be entirely bullet-proof for NLP applications.
First, following the original formulation (Koh and Liang, 2017), the majority of existing works use entire training instances as explanations.However, for long natural language texts that are common in many high-impact application domains (e.g., healthcare, finance, or security), it may be difficult, if not impossible, to comprehend an entire instance as an explanation.For example, a model's decision may depend only on a specific part of a long training instance.
Second, for modern NLP models and large-scale datasets, the application of Influence Functions can lead to prohibitive computing costs due to inverse Hessian matrix approximation.Although hessianfree influence score such as TracIn (Pruthi et al., 2020b) was introduced very recently, it may not be faithful to the model in question and can result in spurious explanations for the involvement of sub-optimal checkpoints.
Last, the evaluation of explanation methods, in particular, for the training-instance-based ones, remains an open question.Previous evaluation is either under an over-simplified assumption on the agreement of labels between training and test instances (Hanawa et al., 2020;Han et al., 2020) or is based on indirect or manual inspection (Hooker et al., 2019;Meng et al., 2020b;Han et al., 2020;Pruthi et al., 2020a).A method to automatically measure the semantic relations at scale and that highly correlates to human judgment is still missing in the evaluation toolset.
To address the above problems, we propose a framework to explain model behavior that includes both a set of new methods and a new metric that can measure the semantic relations between the test instance and its explanations.The new method allows for arbitrary text spans as the explanation unit and is Hessian-free while being faithful to the final model.Our contributions are: 1. We propose a new explanation framework that can use arbitrary explanation units as explanations and be Hessian-free and faithful at the same time; 2. A new metric to measure the semantic relatedness between a test instance and its explanation for BERT-based deep models.

Preliminaries
where H is the Hessian matrix calculated on the entire training dataset, a potential computation bottleneck for large dataset D and complex model with high dimensional θ.
TracIn (Pruthi et al., 2020b) instead assumes the influence of a training instance z is the sum of its contribution to the overall loss all through the entire training history, and conveniently it leads to where i iterates through the checkpoints saved at different training steps and η i is a weight for each checkpoint.TracIn does not involve Hessian matrix and more efficient to compute.We can summarize the key differences between them according to the following desiderata of an explanation method: Interpretability Both methods use the entire training instance as an explanation.Explanations with a finer-grained unit, e.g., phrases, may be easier to interpret in many applications where the texts are lengthy.

Proposed Method
To improve on the above desiderata, a new method should be able to: 1) use any appropriate granularity of span(s) as the explanation unit; 2) avoid the need of Hessian while maintaining faithfulness.We discuss the solutions for both in Section 3.1 and 3.2, and combine them into one formation in Section 3.3 followed by critical implementation details.

Improved Interpretability with Spans
To achieve 1), we first start with influence functions (Koh and Liang, 2017) and consider an arbitrary span of training sequence x to be evaluated for the qualification as explanation 3 .Our core idea is to see how the model loss on test instance z changes with the training span's importance.The more important a training span is to z , the greater this influence score should be.We derive it in three following steps.First, we define the training span from token i to token j to be x ij , and the sequence with x ij masked is x −ij = [x 0 , ..., x i−1 , [MASK], ..., [MASK], x j+1 , ...] and its corresponding training data is z −ij .We use logit difference (Li et al., 2020) as importance score based on the empirical-riskestimated parameter θ obtained from D train as: imp(x ij |z, θ) = logit y (x; θ) − logit y (x −ij ; θ), where every term in the right hand side (RHS) is the logit output evaluated at a model prediction y from model θ right before applying the SoftMax function.This equation tells us how important a training span is.It is equivalent to the loss difference when the cross entropy loss L(z; θ) = − y i I(y = y i )logit y i (x; θ) is applied.
Then, we measure x ij 's influence on model θ by adding a fraction of imp(x ij |z; θ) scaled by a small value to the overall loss and obtain θ ,x ij |z := argmin θ E z i ∈D train [L(z i , θ)] + L(z −ij ; θ) − L(z; θ).Applying the classical result in (Cook and Weisberg, 1982;Koh and Liang, 2017), the influence of up-weighing the importance of x ij on θ is Finally, applying the above equation and the chain rule, we obtain the influence of x ij to z as: IF + measures the influence of a training span on an entire test sequence.Similarly, we also measure the influence of a training span to a test span x kl by applying Eq. 3 and obtain The complete derivation can be found in Appendix.
On the choice of Spans Theoretically, IF + and IF ++ can be applied to any text classification problem and dataset with an appropriate choice of the span.If no information about valid span is available, shallow parsing tools or sentence split-tools can be used to shatter an entire text sequence into chunks, and each chunk can be used as span candidates.In this situation, the algorithm can work in two steps: 1) using masking method (Li et al., 2020) to determine the important test spans; and 2) for each span we apply IF ++ to find training instances/spans as explanations.
Usually, we can choose top-K test spans, and even can choose K=1 in some cases.In this work, we look at the later case without loss of generality, and adopt two aspect-based sentiment analysis datasets that can conveniently identify a deterministic span in each text sequence, and frame the span selection task as a Reading Comprehension task (Rajpurkar et al., 2016).We discuss the details in Section 5. Note that the discussion can be trivially generalized to the case where K>1 using Bayesian approach such as imp( which can be explored in future work.

Faithful & Hessian-free Explanations
To achieve 2), we would start with the method of TracIn (Pruthi et al., 2020b) described in Eq. 2 which is Hessian free by design.TracIn defines the contribution of a training instance to be the sum of its contribution (loss) throughout the entire training life cycle, which eradicated the need for Hessian.However, this assumption is drastically different from IF's where the contribution of z is obtained solely from the final model θ.By nature, IF is a faithful method, and its explanation is faithful to θ, and TracIn in its vanilla form is arguably not a faithful method.
Proposed treatment Based on the assumption that the influence of z on θ is the sum of influences of all variants close to θ, we define a set of "faithful" variants satisfying the constraint of The smaller δ is, the more faithful the explanation method is.Instead, the δ for TracIn can be arbitrary large without faithfulness guarantees, as some checkpoints can be far from the final θ.Thus, we construct a δ-faithful explanation method that mirrors TracIn as: The difference between TracIn and TracInF is that the checkpoints used in TracIn are correlated in time whereas all variants of TracInF are conditionally independent.Finding a proper δ i can be tricky.If ill-chosen, δ i may diverge θ so much that hurts gradient estimation.In practice, we estimate δ i = η i g(z i | θ) obtained from a single-step gradient descent g(z i | θ) with some training instance z i on model θ, scaled by an i-specific weighting parameter η i , which in the simplest case is uniform for all i.Usually η i should be small enough so that θ + δ i can stay close to θ.In this paper we set η as the model learning rate for proof of concept.
Is TracInF faithful?First, any θ + δ i is close to θ.Under the assumption of Lipschitz continuity, there exists a A proper η i can be chosen so that the right hand side (RHS) is sufficiently small to bound the loss within a small range.Thus, the gradient of loss, and in turn the TracInF score can stay δ-faithful to θ for an sufficiently small δ, which TracIn can not guarantee.

The Combined Method
By combining the insights from Section 3.1 and 3.2, we obtain a final form named TracIn ++ : This ultimate form mirrors the IF ++ method, and it satisfies all of our desiderata on an improved explanability method.Similarly, TracIn + that mirrors

Additional Details
Since the RHS of IF, IF + and IF ++ equations all involve the inverse of Hessian Matrix, here we discuss the computation challenge.Following (Koh and Liang, 2017), we adopt the vector-Hessian-inverse-product (VHP) with stochastic estimation (Baydin et al., 2016).The series of stochastic updates, one for each training instance, is performed by the vhp() function in the torch.autograd.functionalpackage and the update stops until convergence.Unfortunately, we found that naively applying this approach leads to VHP explosion due to large parameter size.To be specific, in our case, the parameters are the last two layers of RoBERTa-large (Liu et al., 2019) plus the output head, a total of 12M parameters per gradient vector.To stabilize the process, we take three approaches: 1) applying gradient clipping (set to 100) to avoid accumulating the extreme gradient values; 2) adopting early termination when the norm of VHP stabilizes (usually < 1000 training instances, i.e., the depth); and 3) slowly decaying the accumulated VHP with a factor of 0.99 (i.e., the damp) and update with a new vhp() estimate with a small learning rate (i.e., the scale) of 0.004.Please refer to our code for more details.Once obtained, the VHP is first cached and then retrieved to perform the dot-product with the last term.The complexity for each test instance is O(dt) where d is the depth of estimation and t is the time spent on each vhp() operation.The time complexity of different IF methods only vary on a constant factor of two.
For each of TracIn, TracIn + and TracIn ++ , we need to create multiple model variants.For TracIn, we save three checkpoints of the most recent training epochs; For TracIn + or TracIn ++ , we start with the same checkpoint and randomly sample a mini-batch 3 times and perform one-step training (learning rate 1E-4) for each selection to obtain three variants.We do not over-tune those hyper-parameters for replicability concerns.

Evaluation Metrics
This section introduces our semantic evaluation method, followed by a description of two other popular metrics for comparison.

Semantic Agreement (Sag)
Intuitively, a rational explanation method should rank explanations that are semantically related to the given test instance relatively higher than the less relevant ones.Our idea is to first define the semantic representation of a training span x ij of z and measure its similarity to that of a test span x kl of z .Since our method uses BERT family as the base model, we obtain the embedding of a training span by the difference of x and its span-masked version x ij as where emb is obtained from the embedding of sentence start token such as "[CLS]" in BERT (Devlin et al., 2019) at the last embedding layer.To obtain embedding of the entire sequence we can simply use the emb(x) without the last term in Eq. 4.
Thus, all spans are embedded in the same semantic space and the geometric quantities such as cosine or dot-product can measure the similarities of embeddings.We define the semantic agreement Sag as: Intuitively, the metric measures the degree to which top-K training spans align with a test span on semantics.

Other metrics
Label Agreement (Lag) label agreement (Hanawa et al., 2020) assumes that the label of an explanation z should agree with that of the text case z .Accordingly, we retrieve the top-K training instances from the ordered explanation list and calculate the label agreement (Lag) as follows: where I(•) is an indicator function.Lag measures the degree to which the top-ranked z agree with z on class label, e.g., if the sentiment of the test z and explanation z agree.
Re-training Accuracy Loss (Ral) Ral measures the loss of test accuracy after removing the top-K most influential explanations identified by an explanation method (Hanawa et al., 2020;Hooker et al., 2019;Han et al., 2020).The assumption is that the higher the loss the better the explanation method is.Formally, where θ is the model re-trained by the set D train /{z}| K 1 .Notice the re-training uses the same set of hyper-parameter settings as training (Section 6.1).To obtain {z}| K 1 , we combine the explanation lists for all test instances (by score addition) and then remove the top-K from this list.

Data
Our criteria for dataset selection are two folds: 1.The dataset should have relatively high classification accuracy so that the trained model can behave rationally; and 2. The dataset should allow for easy identification of critical/useful text spans to compare span-based explanation methods.We chose two aspect-based sentiment analysis (ABSA) datasets; one is ATSA, a subset of MAMS (Jiang et al., 2019) for product reviews, where aspects are the terms in the text.The other is sentihood (Saeidi et al., 2016) of location reviews.We can identify the relevant span of an aspect term semiautomatically and train models with high classification accuracy in both datasets.(see Section 6.1 for details).Data statistics and instances are in Table 1  and 2 The major reason is the explanation methods have a chance to rank the wrongly annotated spans lower (its importance score imp() of Eq. 3 can be lower and in turn for its influence scores.)Also, It is labor-intensive to do so.

Model Training Details
We train two separate models for MAMS and sentihood.The model's input is the concatenation of the aspect term and the entire text, and the output is a sentiment label.The two models share similar settings: 1. they both use ROBERTA-LARGE (Liu et al., 2019) from Huggingface (Wolf et al., 2019) which is fed into the BertForSequenceClassification function for initialization.We fine-tune the parameters of the last two layers and the output head using a batch size of 200 for ATSA and 100 for sentihood and max epochs of 100.We use AdamW optimizer (Loshchilov and Hutter, 2019) with weight decay 0.01 and learning rate 1E-4.Both models are written in Pytorch and are trained on a single Tesla V100 GPU and took less than 2 hours for each model to train.The models are selected on dev set performance, and both trained models are state-of-the-art: 88.3% on MAMS and 97.6% for sentihood at the time of writing.

Comparing Explanation Methods
We compare the six explanation methods on two datasets and three evaluation metrics in Table 3 from which we can draw the following conclusions: 1) TracIn family outperforms IF family according to Sag and Lag metrics.We see that both metrics are robust against the choice of K.It it worth noting that TracIn family methods are not only efficient, but also effective for extracting explanations compared to IF family as per Sag and Lag.
2) Span-based methods (with +) outperform Vanilla methods (w/o +).It is good news because an explanation can be much easier to comprehend if we can highlight essential spans in text, and IF ++ and TracIn ++ shows us that such highlighting can be justified by their superiority on the evaluation of Sag and Lag.
3) Sag and Lag shows a consistent trend of TracIn ++ and IF ++ being superior to the rest of the methods, while Ral results are inconclusive, which resonates with the findings in (Hooker et al., 2019) where they also observed randomness after removing examples under different explanation methods.This suggests that the re-training method may not be a reliable metric due to the randomness and intricate details involved in the re-training process.
4) The Sag measures TracIn + differently than Lag shows that Lag may be an over-simplistic measure by assuming that label y can represent the entire semantics of x, which may be problematic.But Sag looks into the x for semantics and can properly reflect and align with humans judgments.
The Impact of K on Metrics One critical parameter for evaluation metrics is the choice of K for Sag and Lag (We do not discuss K for Ral due to its randomness).Here we use 200 MAMS test instances as subjects to study the influence of K, as shown in Figure 1.We found that as K increases, all methods, except for IF and TracInF, decrease on Sag and Lag.The decrease is favorable because the explanation method is putting useful training instances before less useful ones.In contrast, the increase suggests the explanation method fails to rank useful ones on top.This again confirms that spanbased explanation can take into account the useful information in x and reduce the impact of noisy information involved in IF and TracInF.

Comparing Faithfulness
How faithful our proposed TracIn ++ to θ? To answer this question, we first define the notion of strictly faithful explanation and then test an explanation method's faithfulness against it.Note that none of the discussed methods is strictly faithful, since IF ++ used approximated inverse-Hessian and TracIn ++ is a δ away from being strictly faithful.To obtain ground truth, we modify TracIn ++ to use a single checkpoint θ as the "ultimately faithful" explanation method4 .Then, we obtain an explanation list for each test instance and compute its Spearman Correlation with the list obtained from the ground truth.The higher the correlation, the more faithful the method is.
In Table 4 we discovered that TracIn ++ has similar mean as IF ++ but has a much lower variance, showing its stability over IF ++ .This aligns with the finding of Basu et al. (2021)  showing that the model "ensemble" around θ may be a better choice than "checkpoint averaging" for model explanations.Further explorations may be needed since there are many variables in this comparison.The Discussion of Explanation Faithfulness in NLP The issue of Faithfulness of Explanations was primarily discussed under the explanation generation context (Camburu et al., 2018) where there is no guarantee that a generated explanation would be faithful to a model's inner-workings (Jacovi and Goldberg, 2020).In this work, we discuss faithfulness in the sample-based explanations framework.

A Case Study
The faithfulness to model either can be guaranteed only in theory but not in practice (Koh and Liang, 2017) or can not be guaranteed at all (Pruthi et al., 2020b).
Sample-based explanation methods for NLP Han et al. (2020) applied IF for sentiment analysis and natural language inference and also studied its utility on detecting data artefacts (Gururangan et al., 2019).Yang et al. (2020b) used Influence Functions to filter the generated texts.The one closest to our work is (Meng et al., 2020a) where a single word is used as the explanation unit.Their formation uses gradient-based methods for single words, while ours can be applied to any text unit granularity using text masking.

Explanation of NLP Models by Input Erasure
Input erasure has been a popular trick for measuring input impact for NLP models by replacing input by zero vector (Li et al., 2016) or by marginalization of all possible candidate tokens (Kim et al., 2020) that arguably dealt with the out of distribution issue introduced by using zero as input mask.Similar to (Kim et al., 2020;Li et al., 2020;Jacovi and Goldberg, 2021) we also use "[MASK]" token, with the difference that we allow masking of arbitrary length of an input sequence.

Evaluations of Sample-based Methods
A benchmark of evaluating sample-based explanation methods has not been agreed upon.For diagnostic purposes, Koh et al. ( 2017) proposed a selfexplanation method that uses the training instances to explain themselves; Hanawa et al. (2020) proposed the label and instance consistency as a way of model sanity check.On the non-diagnostic setting, sample removal and re-training (Han et al., 2020;Hooker et al., 2019) assumes that removing useful training instances can cause significant accuracy loss; input enhancement method assumes useful explanations can also improve model's decision making at model input side (Hao, 2020), and manual inspections (Han et al., 2020;Meng et al., 2020a) were also used to examine if the Test Case been here a few times and food has always been good but service really suffers when it gets crowded.+ TracIn ++ expected there to be more options for tapas the food was mediocre but the service was pretty good.+ TracIn + decor is simple yet functional and although the staff are not the most attentive in the world, ... + TracInF this place is the tourist fav of chinese food in the city, the service was fast, but the taste of the food is average, too much starch ... 0 IF ++ ... the host was rude to us as we walked in, we stayed because the decor is charming and we wanted french food.+ IF + the scene a dark refurbished dining car hosts plenty of hipsters in carefully selected thrift-store clothing.+ IF an unpretentious sexy atmosphere lends itself to the above average wine-list and a menu that can stand-up to any other restaurant ... + meanings of explanations align with that of the test instance.In this paper, we automate this semantic examination using the embedding similarities.

Future Work
TracIn ++ opens some new questions: 1) how can we generalize TracIn ++ to cases where test spans are unknown?2) Can we understand the connection between IF and TracIn which may spark discoveries on sample-based explanation methods?
3) How can we apply TracIn ++ to understand sequence generation models?
Suppose a model parameterized by θ is trained on classification dataset D = {D train , D test } by empirical risk minimization over D train .Let z = (x, y) ∈ D train and z = (x , y ) ∈ D test denote a training and a test instance respectively, where x is a token sequence, and y is a scalar.The goal of training instance based explanation is to provide for a given test z an ordered list of training instances as explanation.Two notable methods to calculate the influence score are IF and TracIn: IF (Koh and Liang, 2017) assumes the influence of z can be measured by perturbing the loss function L with a fraction of the loss on z, and obtain
Efficiency for each z , TracIn requires O(CG) where C is the number of models and G is the time spent for gradient calculation; whereas IF needs O(N 2 G) where N is the number of training instances, and N >> C in general. 1aithfulness IF is faithful to θ since all its calculation is based on a single final model, yet TracIn may be less faithful to θ since it obtains gradients from a set of checkpoints 2 .
Data Statistics.Note that we regard each training instance as aspect-specific, i.e., the concatenation of aspect term and the text x as model input.Automatic Span Annotation As shown in the colored text in Table2, we extract the spans for each term to serve as explanation units for IF + , IF ++ , TracIn + and TracIn ++ .To reduce

Table 2 :
Dataset instances.In text, each aspect has a supporting span which we annotate semi-automatically.We choose a subset where test instances

Table 3 :
+ IF ++ TracInF TracIn + TracIn ++ Performance of difference explanation methods on 200 test cases on each dataset.For Sag and Lag we set K ∈ {10, 100}; for Ral we set K ∈ {20%, 50%}, and Ral we consider removing the top 20% or 50% from the ordered training instance list.Computation time for IF family is about 20 minutes per test instance with recursion depth 1000 (the minimal value to guarantee convergence) on a Tesla V100 GPU.The time for TracIn family only depends on gradient calculation, which is trivial compared to IF family.

Table 4 :
Comparison of Correlation with Ground truth.The experiment is run 5 times each; "Control" is only different from TracIn ++ on the models used: "control" uses three checkpoints of the latest epochs, but TracIn ++ uses three δ-faithful model variants.
Table5demonstrate the differences of explanation methods.In action, TracIn ++ shows both the test span and explanation span to a user; TracIn + shows only the training span, and TracIn does not show spans.Interestingly we can observe the top-1 explanation found by TracIn ++ is more semantically related than others in the example, a common pattern among the test cases.

Table 5 :
Showcasing Top-1 Explanations.Aspect terms are in blue, and the spans are in bold font.TracInF do not highlight either training or testing span; TracIn + highlights training span; TracIn ++ highlights both training and test spans.TracIn ++ and IF ++ can help users understand which span of z influenced which span of z , which TracInF and IF do not provide.