Discover, Explain, Improve: An Automatic Slice Detection Benchmark for Natural Language Processing

Abstract Pretrained natural language processing (NLP) models have achieved high overall performance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDMs), which automatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. However, little research on SDMs and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named “Discover, Explain, Improve (DEIm)” for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIm then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIm shows that Edisa can accurately select error-prone datapoints with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.1


Introduction
While deep learning models (Kenton and Toutanova, 2019;Liu et al., 2019;Clark et al., 2020, inter alia) achieve high overall performance on many tasks, they often display systematic errors (Kayser-Bril, 2020;Stuart-Ulin, 2018;Hamilton, 2018) correlated with biases, challenging data points, and data collection issues.Investigating these errors and their associated 1 Code and Benchmark are available here: https:// github.com/Wenyueh/DEIM.features is crucial for understanding models' strengths and weaknesses.Although manual error analysis is typically employed for identifying biases and erroneous behaviors, its efficiency and quality are limited.Consequently, automatic slice detection models (SDM) are motivated to streamline the analysis process by identifying systematic errors in any trained machine learning model (Eyuboglu et al., 2022;Ribeiro et al., 2020Ribeiro et al., , 2016;;Wu et al., 2021), based on the observation that representations of error instances may share features and thus similar to each other.
In SDM, a slice refers to a set of datapoints sharing a specific attribute.An error slice is a slice characterized by low accuracy (Eyuboglu et al., 2022).Identifying these error slices serves three primary purposes: (1) locating error-prone datapoints to enable direct prediction adjustments, (2) gaining insights into model behavior to foster better comprehension and interpretation, and (3) guiding additional model training through strategies such as slice-specific modeling, data augmentation, and active learning.Therefore, an effective slice detection model should (1) accurately locate error-prone data points, (2) offer coherent error slices which help yield intelligible errorcorrelated features, and (3) enhance model performance when complemented with suitable tools.
In this study, we introduce a comprehensive benchmark that assesses SDMs with three modules: namely Discover, Explain, and Improve, each of which corresponds to a point previously discussed.We also propose a new SDM Edisa to serve as a baseline on the benchmark.The usage of Edisa and the evaluation pipeline of the benchmark DEIM are depicted in Figure 1.Here we briefly introduce the three modules: Discover: This module utilizes a tuned SDM to detect error-prone datapoints on any unlabeled datasets for a specific trained NLP model denoted as M. The evaluation of the discovery capabil-Figure 1: An SDM (Edisa) takes a trained model M and a labeled dataset as inputs.It is tuned on inputs, predictions, and labels.The tuned SDM generates multiple low-performing slicing functions such as ψ 1 and ψ 2 .Applying the SDM on any unlabeled dataset will group the datapoints based on slicing functions.The Discover module collects error-prone datapoints; The Explain module assigns features to each error slice, and the Improve module enhances model performance.Here, it simply flips the predictions of identified error-prone datapoints.
ity is straightforward, which verifies whether the located error-prone datapoints are indeed mispredicted by M. In the example in Figure 1, one sentence classified to ψ 1 and three sentences classified to ψ 2 are deemed as error-prone.Explain: This module employs linguistic tools to articulate why a model fails on a given error slice, consolidating the reasons into humancomprehensible concepts.For each identified error slice, it discerns linguistic features that occur substantially more often within the slice.These features potentially elucidate why the model inaccurately predicts these data points.In Figure 1, sentences in ψ 1 all contain gerund (verbal ending in -ing that functions as a noun) as sentence subject, indicating that it is likely to be the reason why these datapoints are mispredicted.In order to assess the cohesiveness of the discovered error slices, we evaluate measures such as homogeneity and completeness of each slice with respect to their error-correlated features.Improve: This module showcases how model improvement is realized based on discovered error slices utilizing three techniques: selective prediction (Varshney et al., 2022b,c), flipping, and active learning.For instance, as shown in Figure 1, we invert the prediction for each identified error-prone data point.These three model improvement methods also serve as external evaluations of SDMs.To verify the usefulness of the discovered error slices, we examine whether the model's performance escalates after implementing these techniques.
In the three-module benchmark DEIM, each module concentrates on one specific application of an SDM: (1) detection of error-prone data points, (2) interpretability of error slices, and (3) improvement of model performance.Each module provides evaluation tasks with the necessary tools and quantitative metrics.Each module incorporates evaluation tasks equipped with essential tools and quantitative metrics.Experimental results on Edisa indicate that it can effectively identify errorprone data points in unlabeled datasets and precisely detect error-correlated features, which contribute directly to enhanced model performance.
The paper is organized as follows: Section 2 discusses recent work on slice detection models; Section 3 introduces the model structure of Edisa model.Section 4 presents the details of the DEIM benchmark and all relevant tools.Section 5 presents experiment results and relevant ablation studies.Section 6 concludes this paper.

Related Work
Explainable model predictions are crucial in various research areas.Discovering error-correlated features in datapoints both increases model performance and delivers insights into future model design.In CV, research has been reported to use learned input representations to identify semantically meaningful slices where errors are made in prediction (Eyuboglu et al., 2022;d'Eon et al., 2022;Yeh et al., 2020;Sohoni et al., 2020;Kim et al., 2019;Singla et al., 2021).Eyuboglu et al. (2022) recently proposed the SOTA automatic error detection method DOMINO.In NLP, taskspecific automatic error analysis research has been conducted on tasks such as document-level information extraction (Das et al., 2022), coreference resolution (Kummerfeld and Klein, 2013), and machine translation (Popović and Ney, 2011).There is also extensive research conducted on different model evaluations to see whether models make erroneous datapoints in certain types of noising datapoints (Belinkov and Bisk, 2017;Rychalska et al., 2019) or adversarial datapoints (Ribeiro et al., 2018;Iyyer et al., 2018).Another line of work including Swayamdipta et al. (2020); Xu et al. (2020); Varshney et al. (2022a) focuses on evaluating the model-independent difficulty level of datapoints.Recently, Rajani et al. (2022) introduced an interactive visualization tool for underperforming slices using token-level features.
However, as far as we know, there has not been a comprehensive evaluation benchmark that circumvents all the aspects of SDM in NLP.Therefore in this project, we contribute to the research area by designing a benchmark DEIM for all classification tasks: it provides (1) task-independent comprehensive linguistic feature benchmark for potential explanations, (2) quantitative experiments for both error slice quality and error-prone datapoints detection efficacy in unlabeled datasets, (3) and corresponding metrics that facilitate future development.We also propose a new SDM model Edisa which performs fairly well to serve as the SDM baseline for DEIM benchmark in NLP field.Its simple structure and promising results show a good prospect of this field.

Edisa Model
Edisa model is a new model that we proposed for slice detection in NLP.This section describes the model structure, training objective, and inference procedure of Edisa.Subsequently, we compare this model with the current state-of-the-at SDM model DOMINO (Eyuboglu et al., 2022) to underscore why such model structure design is necessary.
In Edisa, we posit the existence of a set of k interpretable slices, each distinguished by one or more crucial features that differentiate the slice from other data points.Edisa specifically focuses on error-correlated features, that is, features cooccurring with incorrect predictions.Thus, for the same task and dataset, the set of features and the k slices may vary with respect to different NLP models.The objective of an SDM is to identify these k slices for a trained M in an unsupervised manner.Ideally, the discovery of these k slices requires a sufficiently large dataset where both input information and model prediction information are accessible.We mimic this setting by providing a labeled validation dataset, aiming to identify the k slices within it.
To formally introduce the model, Edisa can be seen as a function g that takes in a trained NLP model M and a labeled dataset D to generate k slicing functions {ψ i } i=k i=1 :

Edisa's Model Structure
Edisa is an Error-distance-aware multivariate Gaussian mixture model that models the datapoint representation, error-distance, and model prediction (e.g.confidence scores in classification tasks).
The observations of one datapoint from a model M include three components: {Z, E, Y}, where Z is an embedding representation, Y is predicted probabilities or confidence scores from the model, and error-distance E is the distance between the one-hot tensor of the gold label Y and Y: For each datapoint, Z encodes the task-relevant semantic information; E encodes both label information and confidence information, which represents whether the prediction is wrong, to what extent it deviates from the gold label, and how much change is still required to make a correct prediction; Y encodes the confidence score, which is added to the model to control the weights of label information and of confidence information. 2e perform PCA on representations to filter out redundant information before applying the SDM. Figure 2 illustrates the model structure.
The generative story of the Edisa model is as follows: in order to generate all the observations of one datapoint, one slice S j is first drawn S j ∼ P (S; θ) For each datapoint d, the joint likelihood of slice S j and the observations of d is a weighted product of the likelihoods from all distributions in the model with weights γ, λ E , λ Y on the Gaussians: Given the joint likelihoods, the conditional probability of slice assignment P (S j |d) for the datapoints can be computed as: Semantic information in the embedding, the errordistance, and the model predictions together determine the slice distribution.Thus datapoints that share some similar semantic features with the same gold label and similar model predictions are encouraged to be clustered into one slice.Given the joint likelihood, each slicing function ψ j is defined such that ∀d ∈ D, ψ j (d) = 1 if and only if:

Train
The model parameters are estimated with Expectation-Maximization by maximizing the sum of log-likelihood of all datapoints d ∈ D in each slice S j for j ∈ {1, ..., k}: in which the assignment likelihood and the model parameters are estimated iteratively.Edisa is tuned using the embeddings, error-distances, and confidence scores from the validation dataset of a task after M has been trained on the training dataset.
A slice S j is defined as an error slice, denoted as S e j if the accuracy of {d ∈ D|d ∈ S j } < δ for some threshold δ ∈ R. We call slicing functions corresponding to error slices as error slicing functions, denoted as ψ e j corresponding to S e j .

Inference
For inference, we apply the tuned Edisa to test datasets T where gold labels are unknown to the model.Since gold label information is not available, the error-distance needs to be marginalized over potential label values.Thus the joint likelihood of a test datapoint t ∈ T and slice S j is computed as below, where E ′ t ranges over all possible E values: Then for each datapoint t, ψ j (t) = 1 if and only if An unlabeled datapoint t is determined to be error-prone if ψ e j (t) = 1 for some j ∈ {1, ..., k}.

Comparison with DOMINO
The difference between Edisa and DOMINO has notable empirical effects while theoretically nuanced.
In Edisa, all distributions Z|S j , E|S j , Y|S j are continuous and thus modeled by Gaussian distributions.This is enabled by converting the discrete Y into a continuous E, which still preserves the label information.While in DOMINO, only the distribution of Z|S j is modeled as Gaussian, while both Y |S j and Y|S j are treated as categorical distributions because Y is a discrete variable and Y is usually treated in the same manner as Y .Consequently, Edisa comprises an array of Gaussian distributions, whereas DOMINO combines Gaussian and categorical distributions.This subtlety results in different levels of empirical difficulty during hyperparameter searching: a model consisting only of Gaussian distributions allows a much larger range of effective hyperparameters that can achieve good performance in the evaluation benchmark across all three SDM facets, especially in the Discover and Improvement parts.Thus empirically Edisa is much easier to tune and can obtain better performance.More detailed experimental results and comparative analysis will be discussed in Section 5.

DEIM Benchmark
The DEIM Benchmark evaluates the performance of a tuned SDM.This section elaborates on the specifics of the three modules within the DEIM Benchmark: (1) the process of error-prone data point detection (Discover), (2) the manner in which explanations are delivered (Explain), ( 3) the approach to model improvement (Improvement), and the evaluation metrics for each module.

Discover: Error-prone Datapoints Detection
In the Discover module, the objective is to ascertain if an SDM, after recognizing the error patterns present in the validation dataset, can accurately identify datapoints that are challenging for M. As such, we deploy a tuned SDM on unlabeled datasets, anticipating it to correctly pinpoint error-prone datapoints.The details of this process are elaborated in the preceding Inference subsection.To evaluate its efficacy, we simply resort to determining whether the selected datapoints are indeed mispredicted by M.

Explain: Slice Feature Detection
In the Explain module, the objective is to make errors more interpretable as well as actionable.Towards this end, we find features that significantly correlate with an error slice as explanations.Such features can be surface string features such as specific tokens, linguistic features such as part-ofspeech, and pragmatic indicators.Note that the Explain module seeks to interpret errors, which necessitates knowing which datapoints are indeed mispredicted.As such, this process is conducted on the validation dataset where Edisa is tuned on.Table 1 displays some instances of systematic errors in the CoLA dataset 3 , a dataset for the grammaticality judgment task, that are easily interpretable.Sentences 1 -3 are incorrectly predicted due to inappropriate preposition usage.The grammatically correct version would be "It is Kim on whom Sandy relies."Similarly, sentences 4 -6 are mispredicted due to incorrect usage of superlatives and the correct would be "That's the kindest answer that I ever heard."In order to elucidate possible explanations for systematic errors, we have constructed a feature benchmark consisting of 38 unique features, denoted as F .Each feature is associated with a corresponding function, denoted as f .This benchmark facilitates the intrinsic evaluation of slices pinpointed by an SDM.Table 2 presents all features grouped into three types in the benchmark: surface string features, syntactic features, and pragmatic features.Surface string features include features that can be detected based on surface strings such as sentence length, word frequency in the corpus, and whether the sentence contains foreign words.Synthetic features require a dependency parser or a constituency parser to detect, such as negation, reflexive, aspect, and so on.Pragmatic features include age, gender, nationality of people mentioned in the sentence, etc. detected by models trained on the corresponding task.Table 3 uses examples in the CoLA dataset to illustrate some syntactic features.
Each feature F corresponds to a feature function f : if F is binary such as negation and echo question, then f is a characteristic function such that f (sentence) = 1 indicates that the sentence contains the feature; if F is non-binary such  as multiple-preposition and long-distance dependency, then f (sentence) = d ∈ R indicating that the sentence has d-degree of the feature.
To evaluate whether an SDM is able to group datapoints sharing the same error-correlating features, we design two feature discovery tasks: Synthetic Feature Detection and Real Dataset Feature Detection, evaluating whether an SDM is able to group datapoints sharing the same errorcorrelating features.
The first task evaluates the feature discovery capability by using synthetic datasets where each dataset contains one gold error-correlated feature.A synthetic dataset with a feature F is generated by mixing a set of wrongly predicted datapoints featuring F : D target = {d ∈ D|M(d) ̸ = label(d) and f (d) = 1} (assuming f is a characteristic function here) and a set of randomly selected datapoints from the original dataset with the same number.Then we fit an SDM on the synthetic dataset to see how many target datapoints in D target can be grouped into error slices and then we can compute recall, precision, and F1.
The second task is to detect features in real datasets, which also characterizes how SDM can be utilized for general model analysis.For each datapoint, we apply all feature functions to find out the set of features that it exhibits.Then for each error slice, we leverage significance testing to an-alyze which features are distributed significantly differently between in-slice and out-of-slice data.For each feature's in-slice and out-of-slice distributions, if the p-value from a permutation test is smaller than 0.05 and the mean of the in-slice distribution is larger than that of the out-of-slice distribution (as their occurrences usually complicate sentence structures), this feature is strongly correlated with erroneous predictions.
Both tasks aim at finding the error-correlated features.These interpretable features describe these datapoints and provide insights into the behaviors of current models.

Improve: Downstream Tasks
The final module of the benchmark assesses the SDM's capacity to enhance model performance.Three automated improvement methods are utilized in this module: selective prediction, flipping, and active learning.When these techniques are deployed on a tuned SDM, they can boost model performance if the SDM can identify an ample amount of informative error patterns and errorprone datapoints.Consequently, these methods serve a dual function -demonstrating the feasibility of automated improvements using SDM and evaluating the SDM.

Selective Prediction
The selective prediction task aims at pointing out which datapoints are error-prone in a given unlabeled dataset T and rejecting them from being evaluated.An SDM predicts a datapoint t to be error-prone if t ∈ E where E = {t ∈ T | ψ e j (t) = 1 where j ≤ k} with ψ e j being some error slicing function.It reorders each t based on error probability P (e = 1|t) of t defined as below: .
where S e * is the set of all error slices (10) It refrains from evaluating these datapoints one by one to demonstrate the change of efficacy of the remaining datapoints.The more efficacy increases, the better this task is fulfilled.

Flipping
Flipping is a task to directly improve model performance by flipping the prediction of error-prone datapoints in an unlabeled dataset.If the dataset is binary, flipping changes its prediction from 1 to 0 or 0 to 1; if the dataset is multi-labeled, we select a label to flip the predicted label to.
For multi-labeled datasets, for each error-prone datapoint t, we select the new label as follows: if the confidence score of t is below some threshold and ψ e j (t) = 1 for some j, we find the majority of gold label l in S e in validation dataset and flip t's prediction to l; if the confidence of t is above the threshold, the predicted label remains the same.The confidence threshold is found with 10% of the validation dataset used to train the SDM.
For the confidence baseline, the label is flipped to the next confident label in the corresponding error slice.
In flipping, the predicted error-prone datapoints t are also flipped one by one ordered by P (e = 1|t) as in the selective prediction task.

Experiment Result
This section presents experiment results for all three modules on Edisa.It illustrates how this benchmark should be used and also demonstrate that Edisa is able to cluster error datapoints with similar features and detect error-prone datapoints accurately.
We apply DEIM on a variety of datasets in GLUE benchmark (Wang et al., 2019) and Kag-gle dataset Jigsaw4 : CoLA, QNLI, QQP, SST-2, MNLI, SST-5, Jigsaw-gender, Jigsaw-racial, Jigsaw-religion.Since GLUE test dataset labels are not publicly available, we split the original training dataset into training and validation, and treat the original validation dataset as a test dataset.For each dataset, we train three models based on three widely used models: BERT-large, RoBERTa-large, and ELECTRAlarge-discriminator with the following hyperparameters: {batch size = 32, learning rate = 1e-4, warm up proportion = 0.1, epochs = 10, gradient clip = 1.0, dropout rate = 0.1}.All models are trained on one A5000 GPU.To evaluate the performance of DEIM, we apply Edisa on each of the trained models and evaluate on results from these models.

Discover: Experiment Result
In the Discover module, the evaluation of an SDM's performance with respect to any model M is achieved by assessing its efficacy in identifying error-prone datapoints, that is, determining whether these points are indeed mispredicted by M. We compare the performance of Edisa with DOMINO which is the current SOTA slice detection model, confidence thresholding, and random sampling.
The hyperparameters for Edisa in the Discover module are {γ = 0.15, λ E = 0.1, λ Y = 1, PCA dimension = 128, number of slices = 128} for all datasets for all models BERT, RoBERTa, and ELECTRA.For DOMINO, we manually tune their hyperparameters for the best performance. 5oth sets of hyperparameters are tuned only on held-out sets in CoLA using BERT models.
Table 4 reports the test dataset results: (1) the number of error-prone datapoints found and (2) efficacy, which is defined by where E SDM is the set of predicted error-prone datapoints predicted by the SDM.  .4, the efficacies of Edisa are almost always much higher than other baselines and higher than 50.00%,indicating that it is effective in discovering datapoints that will be mispredicted by M.
Among the three types of models BERTlarge, RoBERTa-large and ELECTRA-largediscriminator, Edisa performs much better in the former two.It could be because ELECTRA-largediscriminator already performs very well in all the nine datasets and Edisa is not able to witness enough mispredicted datapoints in validation datasets in tuning time to generalize to test datapoints during inference.

Model Structure Ablation
We study the model structure based on the efficacy performance on CoLA and QNLI in Table 5.We compare models with (1) only Y edge (Edisa-Y) (2) both E and Y edge (Edisa-E, Y) and (3) all three edges (Edisa).
First, we notice that Edisa-Y detects errorprone datapoints more accurately than confidence baseline, indicating that selecting error slices with a certain range of confidence scores validated by validation datasets is better than directly choosing datapoints with the lowest confidence scores throughout the whole dataset, as efficacy is not always directly related to confidence score.Edisa-E, Y calibrates confidence scores, which performs more accurately.Edisa leveraging representation information selects error-prone datapoints most accurately, indicating that semantic information provides additional clues on difficulty levels of datapoints for a given model, which contributes to 6 We do not compute confidence baseline and random sampling baseline based on the number of error-prone datapoints discovered by DOMINO because, in all GLUE datasets, DOMINO's efficacy is lower than confidence baseline efficacy.
error detection.Table 5: Ablation study on model structure

Validation Size Ablation
We investigate how the size of the validation dataset, on which Edisa is tuned, impacts the efficacy of the model.Ideally, the larger the validation size, the more error patterns it potentially covers, and the better the result.Figure .3 uses the CoLA and QNLI datasets as examples to present the correlation between validation dataset size and the model's efficacy.The x-axis is the ratio of the validation dataset size to the test dataset size, and the y-axis is the efficacy.As a reference, the test dataset for CoLA contains 1043 datapoints, while that of QNLI contains 5463 datapoints.From the figures, we can draw the following conclusions: (1) In general, the larger the validation dataset used to train the Edisa, the higher the model's efficacy.
(2) If the validation dataset is smaller than the test dataset, Edisa's performance decreases a lot, especially for CoLA, which has a small test dataset.Based on this validation result, to ensure adequate coverage of error patterns, we recommend that the validation dataset be at least twice the size of the test dataset.

Hyperparameter Sensitivity
We explore different settings of Edisa and test functions of the following hyper-parameters: weights (λ E and γ with λ Y fixed) and PCA dimension.We conduct experiments on the BERT-based  In Table 6: (1) For λ E , efficacy is high with values smaller than 0.5 but decreases with large values.With large λ E , Edisa overfits on the validation dataset because there is a discrepancy between the training and the testing modeling scheme: when fitting an SDM on the validation dataset, the model leverages all information of input representations, error-distance, and model predictions; while in the test stage, it does not have access to the ground truth information.Thus focusing on error-distance information when tuning on validation misleads the model and negatively impacts the performance of the test dataset.( 2) For γ, it impacts negatively on efficacy with both small and large values.Large values are harmful because semantic representation does not have a straightforward relationship with prediction results for any given model M. Thus focusing mainly on semantic feature information while neglecting label and prediction information encourages a flatter accuracy distribution over slices, and thus it is more difficult to find high-quality error slices in validation to fit on the test dataset.Small values hurt performance may be because they render input representation information to be noise to Edisa and thus impact the performance negatively.For all three γ values, PCA dimensions 64 and 256 work well.Embeddings without PCA dimension reduction perform much worse: In the CoLA dataset, it almost completely fails Edisa and Edisa can discover almost no error-prone datapoints; In QNLI dataset, the model performs nontrivial efficacy results but is still worse than when using other PCA dimensions.Thus in general, we recommend removing redundant information and noise by PCA dimension reduction.

PCA dimensions
Furthermore, we notice that the models using small dimensions (32) tend to work better under relatively large γ values than small γ values; the model using large dimensions (256) performs better with small γ values than with large γ values.Thus the PCA dimension should be chosen inversely to γ.

Explain: Experiment Result
Synthetic Dataset Feature Discovery and Real Dataset Feature Discovery evaluate how reliably an SDM can find feature explanations for errors.In these tasks, DEIM explains slices discovered in the validation dataset instead of the test dataset because it is expected to explain why the models fail on some data points which require access to gold labels.Therefore a different set of hyperparameters is required: {γ = 0.15, λ E = 1, λ Y = 0.1}.Both experiment results demonstrate that Edisa performs better than DOMINO.8 presents cross-feature average precision, recall, and F1 for each dataset.In general Edisa performs better than DOMINO except in SST-2 where the average F1 of DOMINO result is +0.02 higher than that of Edisa.Edisa performs better in recall in all cases and better in precision in some cases.The last two rows of the table show the cross-dataset average precision, recall, and F1.We notice that Edisa performs better than DOMINO on all metrics, especially recall.
Hyperparameter Ablation We study the effect of hyperparameters in feature detection-related tasks on the CoLA dataset, which is a dataset focusing on grammaticality.Results on the effects of γ and λ Y with fixed λ E are presented in Table.9.We noticed that large γ improves precision but decreases the recall while large λ Y brings the reverse effect.γ = 1 and λ Y = 1 have low recall because the former fails to detect feature Comparison and the latter fails to detect feature NP_sub.GLUE datasets and pragmatic features on Jigsaw datasets.We compare with DOMINO model results, Edisa using only semantic embedding information (Edisa-Z) and that using only errordistance information (Edisa-E).

Real Dataset Feature Detection In
Each slice has one or more significant feature(s) as each datapoint may exhibit one or more errorprone feature(s).For each slice S, if F is significant in S, S is desirable to be as homogeneous as possible with regard to F as we do not want to put datapoints with different features in one slice; S is also desirable to be as complete as possible for F as we want all error-prone datapoints featuring F to be clustered in one slice.In addition,  we also want to find as many error-correlated features as possible.Thus we propose four evaluation metrics: feature-prop which is the proportion of features in the benchmark that are detected to be significant for some slice, average homogeneity (Homo) which is the average Homo for each F per slice featuring F , average completeness (Comp) which is the average Comp for each F per slice featuring F , and average V-score 7 .We compare the performance using the average weighted (ave.weighted) V-score which is computed as follows: We notice that Edisa performs the best on all metrics.Edisa-Z also performs well in homogeneity scores but poorly at the completeness scores, which may be due to the model tends to cluster all sentences with similar semantic information together.Edisa-E performs the worst on all metrics.

Improve: Experiment Result
In this section, we use selective prediction, flipping, and active learning tasks to evaluate an SDM performance externally.For all three tasks, we compare Edisa with confidence baseline as it is the second accurate baseline in finding error-prone datapoints shown in Table 4.

Selective Prediction Result
We evaluate selective prediction performance by two metrics: proportion and improvement.Proportion is the proportion of total step numbers where an SDM outperforms the baseline model in model accuracy.When the metric result is equal to 50.00%, it means only half of the time SDM performs better than the baseline and thus the SDM is no better than the baseline; when it is higher than 50.00%, then most of the time the SDM is more effective than the baseline model.Improvement is the final efficacy improvement compared with the original efficacy.C-proportion and Cimprovement are metrics that adopt confidence as the baseline model for comparison.
For the confidence baseline, we reorder the datapoints by the confidence score from low to high and rejects the top-|E SDM | datapoints.
Table 11 reports the result on Edisa across the three models on all datasets: the average C-proportion is 77.48 (higher than 50.00),Cimprovement is 1.36 and improvement is 3.12, all demonstrating the advantage of Edisa.
Visualization on the results of four diverse datasets based on BERT models are presented in Figure 4: CoLA, QNLI, SST-5, and Jigsawreligion, where CoLA and QNLI come from the GLUE benchmark; SST-5 is a multi-class dataset, and Jigsaw-religion comes from Jigsaw dataset.In each figure, the x-axis represents the number of datapoints rejected and the y-axis represents the efficacy of the remaining dataset.They demonstrate the change of efficacy stepwise comparing Edisa and confidence baseline: Edisa performs better at almost all steps, showing that it can always pick error-prone datapoints more accurately.

Flipping Result
The flipping task uses the same metrics as in the selective prediction task.Notice that SST-5 and MNLI are multi-class datasets: for SST-5, the validated confidence threshold to flip an error-prone datapoint is 0.35 for BERT, 0.37 for RoBERTa, and 0.5 for ELECTRA; for MNLI, the validated confidence threshold is 0.7 for BERT, 0.35 for RoBERTa, and 0.5 for ELECTRA.50.00), average C-improvement is 2.63 and average improvement is 1.84, showing that Edisa is able to improve the model directly.
Figure 5 contains four graphs of flipping on RoBERTa models: Edisa performs better on almost all steps and can indeed help model performance on the original test dataset.On the contrary, the confidence baseline and DOMINO baseline are not efficacious enough in selecting errorprone datapoints to improve model performance: the efficacy performance either holds almost constant or decreases.

Active Learning Result
In active learning simulation, we adopt three other simulations as baselines: DOMINO, confidence learning, and random learning.Confidence learning selects a certain number of low-confidence extra datapoints to train per step; random learning randomly selects a certain number of extra datapoints to train per step.
We demonstrate performance on this task by working on the QNLI BERT model in Figure .6.The x-axis is the number of datapoints used to train and the y-axis is the NLP model's accuracy.We use 1% datapoints of the original training dataset as seed.For confidence learning and random learning, we select 500 more datapoints for each step; for active learnings with Edisa and DOMINO, the SDM (Edisa or DOMINO) decides how many extra datapoints to train on.All active learning processes have run 10 times with different random seeds using up to 16k datapoints (about 30 learning steps) when active learning and confidence learning converge.The y-axis demonstrates the average accuracies in the 10 experiments.The figure demonstrates that active learning using Edisa performs noticeably better.Edisa and confidence learning converge to similar accuracy after learning on 16k datapoints.We perform the paired Student's t-test protocol with p-value < 0.05 to show that the Edisa process's accuracies on the steps from 3k to 10k datapoints are significantly higher than accuracies of baselines.

Conclusion
In this paper, we take the first step to build a comprehensive slice detection framework DEIM on NLP with principled evaluation tasks, linguistic tools, and metrics.It discovers error-prone datapoints, clusters datapoints in an error slice under the interpretable concept, and directly improves model performance on unlabeled datasets.It shows that discovering error slices can provide not only insights into model behaviors but also actionable and automatic model improvement methods.Experiments show that Edisa is a more efficacious model than current baselines.We hope this benchmark can facilitate further research in SDM.

Limitation and Future Work
This study presents an all-encompassing benchmark designed to evaluate slice detection models from three distinct perspectives.As a pioneering endeavor in benchmarking SDMs, it is important to recognize certain limitations, which provide avenues for improvement in subsequent research: 1.The Edisa model currently only works for encoder-only models, while not directly applicable to encoder-decoder models such as T5 and the prevalent decoder-only models such as GPT series models.Future work should extend slice detection models such as Edisa to more model structures.
2. The Edisa model currently focuses on classification datasets.Future work should consider extending to tasks such as logistic regression and text generation.
3. The Edisa model assigns each datapoint to one slice and DEIM benchmark assumes that a single feature can represent each slice.This oversimplification may not suffice for the intricacies of expansive language models prevalent in today's NLP landscape.Future endeavors should contemplate refining this approach in the SDM and evaluation by either (1) attributing each data point to multiple slices, or (2) denoting each slice with several features, or a combination of both.
We hope forthcoming research can be built based on Edisa model and the benchmark and thereby deepening our understanding of model performance.

Figure 2 :
Figure 2: Model Structure Active learning is an interactive learning algorithm that proactively selects examples to be labeled next from the pool of unlabeled data.Error-prone datapoints are also points with potential bias and training with them should promote time and data efficiency.Thus if an SDM can accurately select enough error-prone datapoints, simulating active learning in training time will help the model learn faster in training.The active learning simulation in DEIM is implemented as follows: Step 1: divide the whole training dataset into a small training seed set and an extra training data pool from which more training datapoints can be selected to train the model on.Step 2: fit an SDM on the validation dataset and select error-prone datapoints from the extra training data pool without using label information to replicate real-time scenario.Step 3: create a new training dataset combining original training data + selected training data and remove the selected datapoints from the extra training data pool.Step 4: retrain the model on the new training dataset.Repeat steps 2-4 until the model converges on the validation dataset.

Figure 3 :
Figure 3: Efficacy with different sizes of validation

7
For each F and each slice S featuring F , a Homo score is defined by dividing |SF | defined as |{d ∈ S | M(d) ̸ = label(d) and f (d) = 1}| by k; a Comp score is defined by dividing |SF | by |{d ∈ D | M(d) ̸ = label(d) and f (d) = 1}|; a V-measure is computed as 2 * Homo * CompHomo+Comp .

Figure 4 :
Figure 4: Graphs for selective prediction task using confidence baseline Edisa model: CoLA, QNLI, SST-5, Jigsaw-religion.The x-axis is the number of rejected datapoints; the y-axis is the model efficacy.

Figure 5 :
Figure 5: Graphs for the flipping task using confidence baseline and Edisa model: CoLA, QNLI, SST-5, Jigsaw-gender.The x-axis is the number of flipped datapoints; the y-axis is the model efficacy.

Table 2 :
Linguistic feature benchmark

Table 3 :
Syntactic feature examples

Table 4 :
The efficacy for confidence baseline is computed on |E Edisa | lowest confident datapoints; the Efficacy of predicted error-prone datapoints efficacy for random baseline is computed based on |E Edisa | randomly sampled datapoints.6Seenfrom Table

Table 6 :
Ablation study on the value of λ E and γ

Table 7 :
We test PCA dimension = 32,  64, 128, 256, and 1024 (without PCA dimension  reduction)under different weights of the embedding.The results are presented in Table7.Ablation study on PCA dimensions

Table 8 :
Synthetic feature detection result tree_depth, long-distance} for GLUE datasets and {female, male, Asian, Black, White, Latino, Atheist, Buddhist, Christian, Hindu, Jewish, Muslim} for Jigsaw datasets.We compare Edisa results with DOMINO results.Table

Table 10 ,
we report the error slice feature detection results with surface and syntactic features on

Table 9 :
Ablation study on synthetic feature detection

Table 10 :
Real datasets feature detection results

Table 11 :
Selective Prediction Result Based on Table 12, the average C-proportion is 79.83 (above -imp imp C-prop C-imp imp C-prop C-imp imp C-prop C-imp imp C-prop C-imp imp C-prop C-imp imp