Skip to main content

Simple Framework for Interpretable Fine-Grained Text Classification

  • Conference paper
  • First Online:
Artificial Intelligence. ECAI 2023 International Workshops (ECAI 2023)

Abstract

Fine-grained text classification with similar and many labels is a challenge in practical applications. Interpreting predictions in this context is particularly difficult. To address this, we propose a simple framework that disentangles feature importance into more fine-grained links. We demonstrate our framework on the task of intent recognition, which is widely used in real-life applications where trustworthiness is important, for state-of-the-art Transformer language models using their attention mechanism. Our human and semi-automated evaluations show that our approach better explains fine-grained input-label relations than popular feature importance estimation methods LIME and Integrated Gradient and that our approach allows faithful interpretations through simple rules, especially when model confidence is high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Intent recognition remains important in high-responsibility applications despite generative LM-based conversational tools like ChatGPT, which suffer from issues like hallucination, unpredictability, difficulty to control, and privacy concerns [2, 28, 39].

  2. 2.

    A concurrent work has used cross-attention of text-to-image stable diffusion models to interpret which parts of images correspond to which words [35]. This fits our general framework but our work differs in that we apply our framework to identify text-label relations for practical fine-grained classification tasks.

  3. 3.

    https://github.com/alexa/dialoglue.

  4. 4.

    BERT-fixed uses mean-pooling of the token encodings as sentence embedding unlike BERT-tuned, which instead uses the encoding of the special token “[CLS]”.

References

  1. Bastings, J., Filippova, K.: The elephant in the interpretability room: why use attention as explanation when we have saliency methods? In: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 149–155. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.blackboxnlp-1.14

  2. Brown, T.B., et al.: Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Curran Associates Inc., Red Hook, NY, USA (2020). https://dl.acm.org/doi/abs/10.5555/3495724.3495883

  3. Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., Vulić, I.: Efficient intent detection with dual sentence encoders. In: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp. 38–45. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/V1/2020.NLP4CONVAI-1.5

  4. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 276–286. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-4828

  5. Danilevsky, M., Qian, K., Aharonov, R., Katsis, Y., Kawas, B., Sen, P.: A survey of the state of explainable AI for natural language processing. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 447–459. Association for Computational Linguistics, Suzhou, China, December 2020. https://aclanthology.org/2020.aacl-main.46

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423

  7. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=XPZIaotutsD

  8. Jacovi, A., Goldberg, Y.: Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.386

  9. Jacovi, A., Swayamdipta, S., Ravfogel, S., Elazar, Y., Choi, Y., Goldberg, Y.: Contrastive explanations for model interpretability. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1597–1611. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.120

  10. Jain, S., Wallace, B.C.: Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3543–3556. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1357

  11. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015). https://arxiv.org/abs/1412.6980v9

  12. Krippendorff, K.: Computing Krippendorff’s alpha-reliability (2011)

    Google Scholar 

  13. Kumar, S., Talukdar, P.: NILE: natural language inference with faithful natural language explanations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8730–8742. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.771

  14. Lamanov, D., Burnyshev, P., Artemova, K., Malykh, V., Bout, A., Piontkovskaya, I.: Template-based approach to zero-shot intent recognition. In: Proceedings of the 15th International Conference on Natural Language Generation, pp. 15–28. Association for Computational Linguistics, Waterville, Maine, USA and Virtual Meeting, July 2022. https://aclanthology.org/2022.inlg-main.2

  15. Larson, S., et al.: An evaluation dataset for intent classification and out-of-scope prediction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1311–1316. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1131

  16. Li, Z., et al.: A unified understanding of deep NLP models for text classification. IEEE Trans. Visual Comput. Graphics 28(12), 4980–4994 (2022). https://doi.org/10.1109/TVCG.2022.3184186

    Article  Google Scholar 

  17. Liu, H., Yin, Q., Wang, W.Y.: Towards explainable NLP: a generative explanation framework for text classification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5570–5581. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1560

  18. Liu, X., Eshghi, A., Swietojanski, P., Rieser, V.: Benchmarking natural language understanding services for building conversational agents. In: Marchi, E., Siniscalchi, S.M., Cumani, S., Salerno, V.M., Li, H. (eds.) Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. LNEE, vol. 714, pp. 165–183. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-9323-9_15

    Chapter  Google Scholar 

  19. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv (2019). https://arxiv.org/abs/1907.11692

  20. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://dl.acm.org/doi/10.5555/3295222.3295230

  21. Marasovic, A., Beltagy, I., Downey, D., Peters, M.: Few-shot self-rationalization with natural language prompts. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 410–424. Association for Computational Linguistics, Seattle, United States, July 2022. https://doi.org/10.18653/v1/2022.findings-naacl.31

  22. Mehri, S., Eskenazi, M.: DialoGLUE: a natural language understanding benchmark for task-oriented dialogue. arXiv (2020). https://arxiv.org/abs/2009.13570

  23. Nguyen, D.: Comparing automatic and human evaluation of local explanations for text classification. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1069–1078. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1097

  24. Nuruzzaman, M., Hussain, O.K.: A survey on chatbot implementation in customer service industry through deep neural networks. In: 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE), pp. 54–61 (2018). https://doi.org/10.1109/ICEBE.2018.00019

  25. Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1534

  26. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why Should I Trust You?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1135–1144. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939778

  27. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021). https://doi.org/10.1162/tacl_00349

  28. Roller, S., et al.: Recipes for building an open-domain chatbot. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 300–325. EACL, April 2021. https://doi.org/10.18653/v1/2021.eacl-main.24

  29. Saha, S., Hase, P., Rajani, N., Bansal, M.: Are hard examples also harder to explain? A study with human and model-generated explanations. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2121–2131. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, December 2022. https://aclanthology.org/2022.emnlp-main.137

  30. Sahu, G., Rodriguez, P., Laradji, I., Atighehchian, P., Vazquez, D., Bahdanau, D.: Data augmentation for intent classification with off-the-shelf large language models. In: Proceedings of the 4th Workshop on NLP for Conversational AI, pp. 47–57. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.nlp4convai-1.5

  31. Slack, D., Hilgard, A., Lakkaraju, H., Singh, S.: Counterfactual explanations can be manipulated. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 62–75. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/009c434cab57de48a31f6b669e7ba266-Abstract.html

  32. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, pp. 3319–3328 (2017). https://dl.acm.org/doi/10.5555/3305890.3306024

  33. Suresh, H., Lewis, K.M., Guttag, J., Satyanarayan, A.: Intuitively assessing ML model reliability through example-based explanations and editing model inputs. In: 27th International Conference on Intelligent User Interfaces, IUI 2022, pp. 767–781. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3490099.3511160

  34. Suresh, V., Ong, D.: Not all negatives are equal: label-aware contrastive loss for fine-grained text classification. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4381–4394. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.359

  35. Tang, R., et al.: What the DAAM: interpreting stable diffusion using cross attention. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5644–5659. Toronto, Canada, July 2023. https://aclanthology.org/2023.acl-long.310

  36. Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1452

  37. Theodoropoulos, P., Alexandris, C.: Fine-grained sentiment analysis of multi-domain online reviews. In: Kurosu, M. (ed.) Human-Computer Interaction. Technological Innovation, vol. 13303, pp. 264–278. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05409-9_20

  38. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  39. Weidinger, L., et al.: Taxonomy of risks posed by language models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2022, pp. 214–229. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3531146.3533088

  40. Wiegreffe, S., Hessel, J., Swayamdipta, S., Riedl, M., Choi, Y.: Reframing human-AI collaboration for generating free-text explanations. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 632–658. Association for Computational Linguistics, Seattle, United States, July 2022. https://doi.org/10.18653/v1/2022.naacl-main.47

  41. Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 11–20. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1002

  42. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6

  43. Ye, X., Durrett, G.: The unreliability of explanations in few-shot prompting for textual reasoning. In: NeurIPS (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/c402501846f9fe03e2cac015b3f0e6b1-Paper-Conference.pdf

  44. Yin, W., Hay, J., Roth, D.: Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3914–3923. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1404

  45. Zhang, X., Wang, H.: A joint model of intent determination and slot filling for spoken language understanding. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, pp. 2993–2999. AAAI Press (2016). https://dl.acm.org/doi/10.5555/3060832.3061040

Download references

Acknowledgement

This work was supported by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Munkhtulga Battogtokh .

Editor information

Editors and Affiliations

Appendices

A Model Fine-Tuning

For each of the three datasets (BANKING77, CLINC150, and HWU64), we train two models: a conventional single-sentence (1sent) model and a sentence-pair (2sent) model. The default initial model checkpoint (which we fine-tune) in the main body of this paper is BERT-​base [6], distributed on Huggingface’s Transformers library [42] with apache-​2.0 license. The datasets BANKING, CLINC, and HWU were accessed from Amazon Alexa AI’s DialoGLUE benchmark’s repository [22] with CC-BY-4.0, CC-BY-SA 3.0, and CC-BY-SA 3.0 licenses respectively.Footnote 3

For both formulations, we fine-tuned BERT-​base on all examples in the training splits and evaluated on the test sets. The training script was run on NVIDIA Quadro T1000 4GB GPU, which allowed for a batch size of 8 (for both 1sent and 2sent) for the BERT-​base model with nearly 110 million parameters. The 1sent models took approximately 20 min to train on average across the three datasets, while the 2sent models took approximately one hour on average. For each dataset, we optimized using an Adam optimizer [11] with learning rate \(3 \times 10^{-5}\) for 3 epochs, with the rest of the training hyperparameters set to default (by the Transformers library version 4.8.2).

Given an example x from the test or the training set, the 1sent model \(m_1\) calculates |C| (the number of intents in the set of all candidate intents C) prediction scores \([y_i] \in \mathbb {R}^{|C|}\) and outputs the index i with the highest prediction score. On the other hand, the 2sent model \(m_2\) outputs a match score (\(m_2\) makes a binary prediction; we treat the prediction score for the positive class as this match score) given x and the natural language name \(c_i\) of the intent concatenated together with a special separator token “[SEP]”. During the evaluation, we generate |C| inputs per x, pairing x with each intent \(c_j\), and identify the intent with the highest match score as the 2sent prediction.

However, during training, it is not required to pair an example x with all intents. Therefore, given an example x annotated with an intent \(c_{true}\), we generate \(2 \times N + 1\) inputs, where \(N < |C|\). The N of those are x paired with N random intents (negative labels) that are not \(c_{true}\). Another N are random N example texts from the training split that do not have the intent \(c_{true}\) (negative examples), paired with \(c_{true}\). We set N equal to 5 to limit training time according to our resource constraints.

Table 5 reports the accuracy achieved by our main sentence-pair model (BERT-2sent) along with benchmark performances (BERT-fixed and BERT-tuned) from [3] and the accuracy of a single-sentence BERT model (BERT-1sent) that we trained for reference. BERT-2sent is on par with the benchmark model BERT-fixedFootnote 4 and approaches BERT-1sent in accuracy (Table 4).

Table 4. Overview of the datasets
Table 5. Accuracy on the three datasets

B Crowdsourcing Details

Our study design followed those of previous work [29, 40]. We presented crowdworkers (on Prolific) with two alternative types of explanations and asked them to select their preference. We screened participants based on their English language proficiency and location (also as a proxy of language proficiency following [40]) because our evaluation datasets were in English, and based on their education level (at least undergraduate level) because our study involved non-trivial language understanding and familiarity with visualizations. We employed participants from the United Kingdom, the United States, Canada, Australia, and New Zealand with a minimum approval rate of 98% and a minimum of 100 previous submissions on Prolific. All submissions were anonymous and all participants were presented with an information sheet providing the details of our study and were asked for their consent at the beginning of our study.

Quality Control. Before our main head-to-head comparisons, we selected participants based on two preliminary rounds of questions. In the first preliminary round, we presented the crowdworkers with 12 multiple-choice questions based on example texts from fine-grained classification datasets (intent recognition and sentiment analysis) in order to test their ability to understand the fine-grained differences between English texts. We only selected participants who answered at least 11 questions correctly. In the second preliminary round, we presented the participants with 12 explanations, half of them feature importance and the other half feature relation (ours), and asked them to rate how well those explanations explained the correspondence between a given text and its label. We acknowledged that subjectivity can play a significant role here and participants can disagree with us on non-trivial cases or be more or less generous than us (but with similar relative ratings of the different explanations). Therefore, we passed participants through this round based on manually checking their submissions mostly for only low-quality responses (same choice for all questions, too short time spent per question, obvious random clicking on trivial cases, etc.).

Fig. 9.
figure 9

User-interface of our head-to-head comparison study (example)

Payment. Our two preliminary rounds paid £1 (20£/h) and £1.5 (15£/h) for 3 and 6 min respectively. Our main survey was expected to take 8–13 minutes and participants were paid £3 (at least 12£/h).

User Interface. Figure 9 shows the user interface of our study, which is implemented on Qualtrics.

C Further Evaluation

In this section, we report the results from our initial (less computationally expensive) evaluation setting, which we refer to as the lite evaluation setting. The results from this setting are on both full test splits and the error sets (the subset of misclassified examples) of the three datasets BANKING77, HWU64, and CLINC150. The key difference from the main evaluation setting is that the lite setting asks the contrastive question “Why \(c_1\) rather than \(c_2\)?” only once per example whereas the main setting asks it for all possible values of \(c_2\). Moreover, the lite setting only evaluates the subset of the rules.

Within the lite setting, for each example in the error set, we aim to explain why the model made an error, i.e., “Why \(c_1\) rather than \(c_2\)?”, where \(c_1\) is the predicted intent and \(c_2\) is the correct (ground-truth) intent. On the full sets, our aim is the same for misclassified examples, but for correctly classified examples, it is to explain “why not \(c_2?\)”, i.e., “Why \(c_1\) rather than \(c_2\)?” where \(c_1\) is predicted and \(c_2\) is an intent similar to \(c_1\). For this work, we pick such \(c_2\) from the set of all intents other than \(c_1\), by the highest vector similarity between the intent names (using the spaCy library).

1.1 C.1 Faithfulness

Tables 6 and 7 show the lite reproduction accuracy of our rules on the error and full sets respectively. We used the parameter \(K = 1\) in the concentration rule since reproduction rate has a strong negative and monotonic correlation with K (see Appendix C.4).

Table 6. Reproduction accuracy (error set)
Table 7. Reproduction accuracy (full set)

All rules consistently achieved higher reproduction accuracy than random guesses. Magnitude consistently outperformed concentration with an average of 62.8% on the error sets and 82.5% on the full sets. Combining magnitude and concentration (mag. + con.) achieved the highest values with an average of 89.1% on the error sets and 97.8% on the full sets. The highest reproduction accuracy scores were 98.7% and 98.6%, which were on the full sets of CLINC150 and BANKING77 respectively. These high values suggest that our explanations faithfully explain model predictions.

Table 8. Spearman correlations

1.2 C.2 Correlation between Confidence and Reproduction Accuracy

Figure 10 shows that the reproduction rate of our best rule mag. + con. increased up to perfect accuracy on the error sets and near perfect on the full sets as we filtered out evaluation examples with increasing confidence threshold. This was the case despite decreasing number of examples, i.e., a higher drop in accuracy with each failure (see Fig. 11).

Just like in our main results, there was a strong tendency for positive and monotonic correlation (see Table 8, in which most correlation values are close to +1) between the two variables: reproduction accuracy and confidence threshold. The error set of BANKING77 had a weak correlation (value near 0) against this tendency, but Fig. 10a shows that the reproduction accuracy increased even for this case, though non-monotonically.

1.3 C.3 Importance of Special Tokens

We also hypothesized that certain reproductions may have failed because our cross-sentence relations did not always fully capture model reasoning. Importantly, we ignored the special tokens “[CLS]” and “[SEP]”, which are known for aggregating the input and receiving high levels of attention [4], to focus only on human-readable tokens (see Fig. 4). There are three special tokens in each concatenated input that we feed into our model: “[CLS]” as the first token, and two “[SEP]” tokens (one for delimiting the input pairs and one at the end of the input). We experimented with treating the “[CLS]” and the first “[SEP]” as features of text x, while treating the last “[SEP]” as a feature of intent \(c_i\). This led to higher reproduction rates on all datasets (see Table 9).

Fig. 10.
figure 10

Graphs showing the correlation between confidence threshold and reproduction accuracy

Fig. 11.
figure 11

Graphs showing the correlation between confidence threshold and number of examples

1.4 C.4 Correlation Between Concentration Parameter K and Reproduction Accuracy

Figure 12 shows the relation between the parameter K (see Sect. 4.3) in our concentration rule and the reproduction accuracy (under the lite evaluation setting; see Appendix C.1) of the rule.

As Table 10 shows, there is a strong negative and monotonic correlation between the parameter K and reproduction accuracy.

Fig. 12.
figure 12

Correlation between concentration parameter K and reproduction accuracy

Table 9. Reproduction accuracy (full set) when special tokens are treated as features of text x and intent \(c_i\) (with confidence threshold) under lite evaluation setting (see Appendix C.1)
Table 10. Spearman correlations between K and reproduction accuracy. There is a strong negative correlation between K and reproduction accuracy across all datasets.

D Experiments with DeBERTa

Model training followed. Model training followed the same process for both DeBERTa and BERT except for a few details (see Appendix A). We used microsoft/deberta-​base [7] with approximately 140 million parameters, distributed on Huggingface’s Transformers library [42] with an MIT license. We used learning rates of \(1 \times 10^-2\) or \(2 \times 10^-2\) selected by greedy search (due to resource constraints), and batch size of 1 or 2 with our DeBERTa models. Training took approximately one hour for the 1sent models across the three datasets, and 4–6 h for the 2sent models. On the other hand, our main evaluation loop takes approximately less than 5 min on the full set for both 1sent and 2sent models. The DeBERTa model trained with single-sentence formulation (DeBERTa-1sent) achieved the highest accuracy among the models we trained. However, DeBERTa trained with the sentence-pair formulation (DeBERTa-2sent) achieved scores lower than BERT-2sent, possibly due to non-optimal choices of hyperparameters (especially learning rate). Nevertheless, we proceeded to use this DeBERTa-2sent model since its performance is adequate for our primary purpose of evaluating interpretability (Table 11).

Table 11. Accuracy of our fine-tuned models including DeBERTa trained with single-sentence (DeBERTa-1sent) and sentence-pair (DeBERTa-2sent) approaches

Faithfulness Results. The faithfulness results with DeBERTa-2sent were similar to those with BERT-2sent (under the lite evaluation setting; see Appendix C), since the mag. + con. rule achieved high reproduction accuracy on both the error and full sets (see Tables 12 and 13). However, an interesting difference was that magnitude was better at reproducing model prediction than concentration with BERT, but it was the other way around with DeBERTa.

Table 12. Reproduction accuracy (error set) with DeBERTa
Table 13. Reproduction accuracy (full set) with DeBERTa

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Battogtokh, M., Luck, M., Davidescu, C., Borgo, R. (2024). Simple Framework for Interpretable Fine-Grained Text Classification. In: Nowaczyk, S., et al. Artificial Intelligence. ECAI 2023 International Workshops. ECAI 2023. Communications in Computer and Information Science, vol 1947. Springer, Cham. https://doi.org/10.1007/978-3-031-50396-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-50396-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-50395-5

  • Online ISBN: 978-3-031-50396-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics