Simple Framework for Interpretable Fine-Grained Text Classification

Battogtokh, Munkhtulga; Luck, Michael; Davidescu, Cosmin; Borgo, Rita

doi:10.1007/978-3-031-50396-2_23

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1947))

Included in the following conference series:

European Conference on Artificial Intelligence

246 Accesses

Abstract

Fine-grained text classification with similar and many labels is a challenge in practical applications. Interpreting predictions in this context is particularly difficult. To address this, we propose a simple framework that disentangles feature importance into more fine-grained links. We demonstrate our framework on the task of intent recognition, which is widely used in real-life applications where trustworthiness is important, for state-of-the-art Transformer language models using their attention mechanism. Our human and semi-automated evaluations show that our approach better explains fine-grained input-label relations than popular feature importance estimation methods LIME and Integrated Gradient and that our approach allows faithful interpretations through simple rules, especially when model confidence is high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Intent recognition remains important in high-responsibility applications despite generative LM-based conversational tools like ChatGPT, which suffer from issues like hallucination, unpredictability, difficulty to control, and privacy concerns [2, 28, 39].
2.
A concurrent work has used cross-attention of text-to-image stable diffusion models to interpret which parts of images correspond to which words [35]. This fits our general framework but our work differs in that we apply our framework to identify text-label relations for practical fine-grained classification tasks.
3.
https://github.com/alexa/dialoglue.
4.
BERT-fixed uses mean-pooling of the token encodings as sentence embedding unlike BERT-tuned, which instead uses the encoding of the special token “[CLS]”.

References

Bastings, J., Filippova, K.: The elephant in the interpretability room: why use attention as explanation when we have saliency methods? In: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 149–155. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.blackboxnlp-1.14
Brown, T.B., et al.: Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Curran Associates Inc., Red Hook, NY, USA (2020). https://dl.acm.org/doi/abs/10.5555/3495724.3495883
Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., Vulić, I.: Efficient intent detection with dual sentence encoders. In: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp. 38–45. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/V1/2020.NLP4CONVAI-1.5
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 276–286. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-4828
Danilevsky, M., Qian, K., Aharonov, R., Katsis, Y., Kawas, B., Sen, P.: A survey of the state of explainable AI for natural language processing. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 447–459. Association for Computational Linguistics, Suzhou, China, December 2020. https://aclanthology.org/2020.aacl-main.46
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=XPZIaotutsD
Jacovi, A., Goldberg, Y.: Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.386
Jacovi, A., Swayamdipta, S., Ravfogel, S., Elazar, Y., Choi, Y., Goldberg, Y.: Contrastive explanations for model interpretability. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1597–1611. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.120
Jain, S., Wallace, B.C.: Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3543–3556. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1357
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015). https://arxiv.org/abs/1412.6980v9
Krippendorff, K.: Computing Krippendorff’s alpha-reliability (2011)
Google Scholar
Kumar, S., Talukdar, P.: NILE: natural language inference with faithful natural language explanations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8730–8742. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.771
Lamanov, D., Burnyshev, P., Artemova, K., Malykh, V., Bout, A., Piontkovskaya, I.: Template-based approach to zero-shot intent recognition. In: Proceedings of the 15th International Conference on Natural Language Generation, pp. 15–28. Association for Computational Linguistics, Waterville, Maine, USA and Virtual Meeting, July 2022. https://aclanthology.org/2022.inlg-main.2
Larson, S., et al.: An evaluation dataset for intent classification and out-of-scope prediction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1311–1316. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1131
Li, Z., et al.: A unified understanding of deep NLP models for text classification. IEEE Trans. Visual Comput. Graphics 28(12), 4980–4994 (2022). https://doi.org/10.1109/TVCG.2022.3184186
Article Google Scholar
Liu, H., Yin, Q., Wang, W.Y.: Towards explainable NLP: a generative explanation framework for text classification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5570–5581. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1560
Liu, X., Eshghi, A., Swietojanski, P., Rieser, V.: Benchmarking natural language understanding services for building conversational agents. In: Marchi, E., Siniscalchi, S.M., Cumani, S., Salerno, V.M., Li, H. (eds.) Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. LNEE, vol. 714, pp. 165–183. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-9323-9_15
Chapter Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv (2019). https://arxiv.org/abs/1907.11692
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://dl.acm.org/doi/10.5555/3295222.3295230
Marasovic, A., Beltagy, I., Downey, D., Peters, M.: Few-shot self-rationalization with natural language prompts. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 410–424. Association for Computational Linguistics, Seattle, United States, July 2022. https://doi.org/10.18653/v1/2022.findings-naacl.31
Mehri, S., Eskenazi, M.: DialoGLUE: a natural language understanding benchmark for task-oriented dialogue. arXiv (2020). https://arxiv.org/abs/2009.13570
Nguyen, D.: Comparing automatic and human evaluation of local explanations for text classification. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1069–1078. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1097
Nuruzzaman, M., Hussain, O.K.: A survey on chatbot implementation in customer service industry through deep neural networks. In: 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE), pp. 54–61 (2018). https://doi.org/10.1109/ICEBE.2018.00019
Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1534
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why Should I Trust You?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1135–1144. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939778
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021). https://doi.org/10.1162/tacl_00349
Roller, S., et al.: Recipes for building an open-domain chatbot. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 300–325. EACL, April 2021. https://doi.org/10.18653/v1/2021.eacl-main.24
Saha, S., Hase, P., Rajani, N., Bansal, M.: Are hard examples also harder to explain? A study with human and model-generated explanations. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2121–2131. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, December 2022. https://aclanthology.org/2022.emnlp-main.137
Sahu, G., Rodriguez, P., Laradji, I., Atighehchian, P., Vazquez, D., Bahdanau, D.: Data augmentation for intent classification with off-the-shelf large language models. In: Proceedings of the 4th Workshop on NLP for Conversational AI, pp. 47–57. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.nlp4convai-1.5
Slack, D., Hilgard, A., Lakkaraju, H., Singh, S.: Counterfactual explanations can be manipulated. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 62–75. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/009c434cab57de48a31f6b669e7ba266-Abstract.html
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, pp. 3319–3328 (2017). https://dl.acm.org/doi/10.5555/3305890.3306024
Suresh, H., Lewis, K.M., Guttag, J., Satyanarayan, A.: Intuitively assessing ML model reliability through example-based explanations and editing model inputs. In: 27th International Conference on Intelligent User Interfaces, IUI 2022, pp. 767–781. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3490099.3511160
Suresh, V., Ong, D.: Not all negatives are equal: label-aware contrastive loss for fine-grained text classification. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4381–4394. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.359
Tang, R., et al.: What the DAAM: interpreting stable diffusion using cross attention. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5644–5659. Toronto, Canada, July 2023. https://aclanthology.org/2023.acl-long.310
Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1452
Theodoropoulos, P., Alexandris, C.: Fine-grained sentiment analysis of multi-domain online reviews. In: Kurosu, M. (ed.) Human-Computer Interaction. Technological Innovation, vol. 13303, pp. 264–278. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05409-9_20
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Weidinger, L., et al.: Taxonomy of risks posed by language models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2022, pp. 214–229. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3531146.3533088
Wiegreffe, S., Hessel, J., Swayamdipta, S., Riedl, M., Choi, Y.: Reframing human-AI collaboration for generating free-text explanations. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 632–658. Association for Computational Linguistics, Seattle, United States, July 2022. https://doi.org/10.18653/v1/2022.naacl-main.47
Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 11–20. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1002
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6
Ye, X., Durrett, G.: The unreliability of explanations in few-shot prompting for textual reasoning. In: NeurIPS (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/c402501846f9fe03e2cac015b3f0e6b1-Paper-Conference.pdf
Yin, W., Hay, J., Roth, D.: Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3914–3923. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1404
Zhang, X., Wang, H.: A joint model of intent determination and slot filling for spoken language understanding. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, pp. 2993–2999. AAAI Press (2016). https://dl.acm.org/doi/10.5555/3060832.3061040

Download references

Acknowledgement

This work was supported by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org).

Author information

Authors and Affiliations

King’s College London, London, UK
Munkhtulga Battogtokh, Michael Luck & Rita Borgo
ContactEngine, Southampton, UK
Cosmin Davidescu

Authors

Munkhtulga Battogtokh
View author publications
You can also search for this author in PubMed Google Scholar
Michael Luck
View author publications
You can also search for this author in PubMed Google Scholar
Cosmin Davidescu
View author publications
You can also search for this author in PubMed Google Scholar
Rita Borgo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Munkhtulga Battogtokh .

Editor information

Editors and Affiliations

Halmstad University, Halmstad, Sweden
Sławomir Nowaczyk
Warsaw University of Technology, Warsaw, Poland
Przemysław Biecek
Warsaw University, Warsaw, Poland
Neo Christopher Chung
University of Huddersfield, Huddersfield, UK
Mauro Vallati
AGH University of Science and Technology, Kraków, Poland
Paweł Skruch
AGH University of Science and Technology, Kraków, Poland
Joanna Jaworek-Korjakowska
University of Huddersfield, Huddersfield, UK
Simon Parkinson
University of Huddersfield, Huddersfield, UK
Alexandros Nikitas
Universität Osnabrück, Osnabrück, Germany
Martin Atzmüller
University of Economics Prague, Prague, Czech Republic
Tomáš Kliegr
University of Bamberg, Bamberg, Germany
Ute Schmid
Jagiellonian University, Kraków, Poland
Szymon Bobek
Jožef Stefan Institute, Ljubljana, Slovenia
Nada Lavrac
HU University of Applied Sciences Utrecht, Utrecht, The Netherlands
Marieke Peeters
Rotterdam University of Applied Sciences, Rotterdam, The Netherlands
Roland van Dierendonck
Amsterdam University of Applied Sciences, Amsterdam, The Netherlands
Saskia Robben
University of Reims Champagne-Ardenne, Reims, France
Eunika Mercier-Laurent
Istanbul Technical University, Istanbul, Türkiye
Gülgün Kayakutlu
Wroclaw University of Economics and Business, Wrocław, Poland
Mieczyslaw Lech Owoc
University of Galway, Galway, Ireland
Karl Mason
University of Galway, Galway, Ireland
Abdul Wahid
University of Calabria, Rende, Italy
Pierangela Bruno
University of Calabria, Rende, Italy
Francesco Calimeri
Marche Polytechnic University, Ancona, Italy
Francesco Cauteruccio
University of Calabria, Rende, Italy
Giorgio Terracina
University of Bamberg, Bamberg, Germany
Diedrich Wolter
Coburg University of Applied Sciences, Coburg, Germany
Jochen L. Leidner
FAU Erlangen-Nürnberg, Erlangen, Germany
Michael Kohlhase
University of Leeds, Leeds, UK
Vania Dimitrova

Appendices

A Model Fine-Tuning

For each of the three datasets (BANKING77, CLINC150, and HWU64), we train two models: a conventional single-sentence (1sent) model and a sentence-pair (2sent) model. The default initial model checkpoint (which we fine-tune) in the main body of this paper is BERT-base [6], distributed on Huggingface’s Transformers library [42] with apache-2.0 license. The datasets BANKING, CLINC, and HWU were accessed from Amazon Alexa AI’s DialoGLUE benchmark’s repository [22] with CC-BY-4.0, CC-BY-SA 3.0, and CC-BY-SA 3.0 licenses respectively.^{Footnote 3}

For both formulations, we fine-tuned BERT-base on all examples in the training splits and evaluated on the test sets. The training script was run on NVIDIA Quadro T1000 4GB GPU, which allowed for a batch size of 8 (for both 1sent and 2sent) for the BERT-base model with nearly 110 million parameters. The 1sent models took approximately 20 min to train on average across the three datasets, while the 2sent models took approximately one hour on average. For each dataset, we optimized using an Adam optimizer [11] with learning rate \(3 \times 10^{-5}\) for 3 epochs, with the rest of the training hyperparameters set to default (by the Transformers library version 4.8.2).

Given an example x from the test or the training set, the 1sent model \(m_1\) calculates |C| (the number of intents in the set of all candidate intents C) prediction scores \([y_i] \in \mathbb {R}^{|C|}\) and outputs the index i with the highest prediction score. On the other hand, the 2sent model \(m_2\) outputs a match score (\(m_2\) makes a binary prediction; we treat the prediction score for the positive class as this match score) given x and the natural language name \(c_i\) of the intent concatenated together with a special separator token “[SEP]”. During the evaluation, we generate |C| inputs per x, pairing x with each intent \(c_j\), and identify the intent with the highest match score as the 2sent prediction.

However, during training, it is not required to pair an example x with all intents. Therefore, given an example x annotated with an intent \(c_{true}\), we generate \(2 \times N + 1\) inputs, where \(N < |C|\). The N of those are x paired with N random intents (negative labels) that are not \(c_{true}\). Another N are random N example texts from the training split that do not have the intent \(c_{true}\) (negative examples), paired with \(c_{true}\). We set N equal to 5 to limit training time according to our resource constraints.

Table 5 reports the accuracy achieved by our main sentence-pair model (BERT-2sent) along with benchmark performances (BERT-fixed and BERT-tuned) from [3] and the accuracy of a single-sentence BERT model (BERT-1sent) that we trained for reference. BERT-2sent is on par with the benchmark model BERT-fixed^{Footnote 4} and approaches BERT-1sent in accuracy (Table 4).

Table 4. Overview of the datasets

Full size table

Table 5. Accuracy on the three datasets

Full size table

B Crowdsourcing Details

Our study design followed those of previous work [29, 40]. We presented crowdworkers (on Prolific) with two alternative types of explanations and asked them to select their preference. We screened participants based on their English language proficiency and location (also as a proxy of language proficiency following [40]) because our evaluation datasets were in English, and based on their education level (at least undergraduate level) because our study involved non-trivial language understanding and familiarity with visualizations. We employed participants from the United Kingdom, the United States, Canada, Australia, and New Zealand with a minimum approval rate of 98% and a minimum of 100 previous submissions on Prolific. All submissions were anonymous and all participants were presented with an information sheet providing the details of our study and were asked for their consent at the beginning of our study.

Quality Control. Before our main head-to-head comparisons, we selected participants based on two preliminary rounds of questions. In the first preliminary round, we presented the crowdworkers with 12 multiple-choice questions based on example texts from fine-grained classification datasets (intent recognition and sentiment analysis) in order to test their ability to understand the fine-grained differences between English texts. We only selected participants who answered at least 11 questions correctly. In the second preliminary round, we presented the participants with 12 explanations, half of them feature importance and the other half feature relation (ours), and asked them to rate how well those explanations explained the correspondence between a given text and its label. We acknowledged that subjectivity can play a significant role here and participants can disagree with us on non-trivial cases or be more or less generous than us (but with similar relative ratings of the different explanations). Therefore, we passed participants through this round based on manually checking their submissions mostly for only low-quality responses (same choice for all questions, too short time spent per question, obvious random clicking on trivial cases, etc.).

Payment. Our two preliminary rounds paid £1 (20£/h) and £1.5 (15£/h) for 3 and 6 min respectively. Our main survey was expected to take 8–13 minutes and participants were paid £3 (at least 12£/h).

User Interface. Figure 9 shows the user interface of our study, which is implemented on Qualtrics.

C Further Evaluation

In this section, we report the results from our initial (less computationally expensive) evaluation setting, which we refer to as the lite evaluation setting. The results from this setting are on both full test splits and the error sets (the subset of misclassified examples) of the three datasets BANKING77, HWU64, and CLINC150. The key difference from the main evaluation setting is that the lite setting asks the contrastive question “Why \(c_1\) rather than \(c_2\)?” only once per example whereas the main setting asks it for all possible values of \(c_2\). Moreover, the lite setting only evaluates the subset of the rules.

Within the lite setting, for each example in the error set, we aim to explain why the model made an error, i.e., “Why \(c_1\) rather than \(c_2\)?”, where \(c_1\) is the predicted intent and \(c_2\) is the correct (ground-truth) intent. On the full sets, our aim is the same for misclassified examples, but for correctly classified examples, it is to explain “why not \(c_2?\)”, i.e., “Why \(c_1\) rather than \(c_2\)?” where \(c_1\) is predicted and \(c_2\) is an intent similar to \(c_1\). For this work, we pick such \(c_2\) from the set of all intents other than \(c_1\), by the highest vector similarity between the intent names (using the spaCy library).

1.1 C.1 Faithfulness

Tables 6 and 7 show the lite reproduction accuracy of our rules on the error and full sets respectively. We used the parameter \(K = 1\) in the concentration rule since reproduction rate has a strong negative and monotonic correlation with K (see Appendix C.4).

Table 6. Reproduction accuracy (error set)

Full size table

Table 7. Reproduction accuracy (full set)

Full size table

All rules consistently achieved higher reproduction accuracy than random guesses. Magnitude consistently outperformed concentration with an average of 62.8% on the error sets and 82.5% on the full sets. Combining magnitude and concentration (mag. + con.) achieved the highest values with an average of 89.1% on the error sets and 97.8% on the full sets. The highest reproduction accuracy scores were 98.7% and 98.6%, which were on the full sets of CLINC150 and BANKING77 respectively. These high values suggest that our explanations faithfully explain model predictions.

Table 8. Spearman correlations

Full size table

1.2 C.2 Correlation between Confidence and Reproduction Accuracy

Figure 10 shows that the reproduction rate of our best rule mag. + con. increased up to perfect accuracy on the error sets and near perfect on the full sets as we filtered out evaluation examples with increasing confidence threshold. This was the case despite decreasing number of examples, i.e., a higher drop in accuracy with each failure (see Fig. 11).

Just like in our main results, there was a strong tendency for positive and monotonic correlation (see Table 8, in which most correlation values are close to +1) between the two variables: reproduction accuracy and confidence threshold. The error set of BANKING77 had a weak correlation (value near 0) against this tendency, but Fig. 10a shows that the reproduction accuracy increased even for this case, though non-monotonically.

1.3 C.3 Importance of Special Tokens

We also hypothesized that certain reproductions may have failed because our cross-sentence relations did not always fully capture model reasoning. Importantly, we ignored the special tokens “[CLS]” and “[SEP]”, which are known for aggregating the input and receiving high levels of attention [4], to focus only on human-readable tokens (see Fig. 4). There are three special tokens in each concatenated input that we feed into our model: “[CLS]” as the first token, and two “[SEP]” tokens (one for delimiting the input pairs and one at the end of the input). We experimented with treating the “[CLS]” and the first “[SEP]” as features of text x, while treating the last “[SEP]” as a feature of intent \(c_i\). This led to higher reproduction rates on all datasets (see Table 9).

1.4 C.4 Correlation Between Concentration Parameter K and Reproduction Accuracy

Figure 12 shows the relation between the parameter K (see Sect. 4.3) in our concentration rule and the reproduction accuracy (under the lite evaluation setting; see Appendix C.1) of the rule.

As Table 10 shows, there is a strong negative and monotonic correlation between the parameter K and reproduction accuracy.

Table 9. Reproduction accuracy (full set) when special tokens are treated as features of text x and intent \(c_i\) (with confidence threshold) under lite evaluation setting (see Appendix C.1)

Full size table

Table 10. Spearman correlations between K and reproduction accuracy. There is a strong negative correlation between K and reproduction accuracy across all datasets.

Full size table

D Experiments with DeBERTa

Model training followed. Model training followed the same process for both DeBERTa and BERT except for a few details (see Appendix A). We used microsoft/deberta-base [7] with approximately 140 million parameters, distributed on Huggingface’s Transformers library [42] with an MIT license. We used learning rates of \(1 \times 10^-2\) or \(2 \times 10^-2\) selected by greedy search (due to resource constraints), and batch size of 1 or 2 with our DeBERTa models. Training took approximately one hour for the 1sent models across the three datasets, and 4–6 h for the 2sent models. On the other hand, our main evaluation loop takes approximately less than 5 min on the full set for both 1sent and 2sent models. The DeBERTa model trained with single-sentence formulation (DeBERTa-1sent) achieved the highest accuracy among the models we trained. However, DeBERTa trained with the sentence-pair formulation (DeBERTa-2sent) achieved scores lower than BERT-2sent, possibly due to non-optimal choices of hyperparameters (especially learning rate). Nevertheless, we proceeded to use this DeBERTa-2sent model since its performance is adequate for our primary purpose of evaluating interpretability (Table 11).

Table 11. Accuracy of our fine-tuned models including DeBERTa trained with single-sentence (DeBERTa-1sent) and sentence-pair (DeBERTa-2sent) approaches

Full size table

Faithfulness Results. The faithfulness results with DeBERTa-2sent were similar to those with BERT-2sent (under the lite evaluation setting; see Appendix C), since the mag. + con. rule achieved high reproduction accuracy on both the error and full sets (see Tables 12 and 13). However, an interesting difference was that magnitude was better at reproducing model prediction than concentration with BERT, but it was the other way around with DeBERTa.

Table 12. Reproduction accuracy (error set) with DeBERTa

Full size table

Table 13. Reproduction accuracy (full set) with DeBERTa

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Battogtokh, M., Luck, M., Davidescu, C., Borgo, R. (2024). Simple Framework for Interpretable Fine-Grained Text Classification. In: Nowaczyk, S., et al. Artificial Intelligence. ECAI 2023 International Workshops. ECAI 2023. Communications in Computer and Information Science, vol 1947. Springer, Cham. https://doi.org/10.1007/978-3-031-50396-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-50396-2_23
Published: 21 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50395-5
Online ISBN: 978-3-031-50396-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Simple Framework for Interpretable Fine-Grained Text Classification

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Model Fine-Tuning

B Crowdsourcing Details

C Further Evaluation

1.1 C.1 Faithfulness

1.2 C.2 Correlation between Confidence and Reproduction Accuracy

1.3 C.3 Importance of Special Tokens

1.4 C.4 Correlation Between Concentration Parameter K and Reproduction Accuracy

D Experiments with DeBERTa

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation