Context-Rich Evaluation of Machine Common Sense

Kejriwal, Mayank; Santos, Henrique; Shen, Ke; Mulvehill, Alice M.; McGuinness, Deborah L.

doi:10.1007/978-3-031-33469-6_17

Mayank Kejriwal ORCID: orcid.org/0000-0001-5988-8305¹⁰,
Henrique Santos¹¹,
Ke Shen¹⁰,
Alice M. Mulvehill¹¹ &
…
Deborah L. McGuinness¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13921))

Included in the following conference series:

International Conference on Artificial General Intelligence

796 Accesses

Abstract

Building machines capable of common sense reasoning is an important milestone in achieving Artificial General Intelligence (AGI). While recent advances, such as large language models, are promising, systematic and sufficiently robust evaluations of these models on common sense have been inadequate, and designed for an earlier generation of models. One criticism of prior evaluation protocols is that they have been too narrow in scope e.g., by restricting the format of questions posed to the model, not being theoretically grounded, and not taking the context of a model’s responses in constructing follow-up questions or asking for explanations. In this paper, we aim to address this gap by proposing a context-rich evaluation protocol designed specifically for evaluating machine common sense. Our protocol can subsume popular evaluation paradigms in machine common sense as special cases, and is suited for evaluating both discriminative and generative large language models. We demonstrate the utility of the protocol by using it to conduct a pilot evaluation of the ChatGPT system on common sense reasoning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://docs.google.com/document/d/1yNrjTOt0imJW5OVajTDNcAJ0PJxmM6B7Dpe3N8YFMD4/edit?usp=sharing.
2.
The log for this session may be found at https://docs.google.com/document/d/1a-CDcijT2an0XiYF-JQ0i2ZvFiUpUB-Xb4wZBUkBVkg/edit?usp=sharing.
3.
The logs for these sessions may be found at https://docs.google.com/document/d/1tLseMBfGVEhdpcm4jGNg_9Dr4ruGihFX9ncY3240k5Y/edit?usp=sharing and https://docs.google.com/document/d/1HWma7MuZkaCeqq6aVmXtBzP9pqudmF7YH1GAl9z2xoc/edit?usp=sharing, respectively.

References

Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM 26(11), 832–843 (1983). https://doi.org/10.1145/182.358434
Article MATH Google Scholar
Ammanabrolu, P., Broniec, W., Mueller, A., Paul, J., Riedl, M.: Toward automated quest generation in text-adventure games. In: Proceedings of the 4th Workshop on Computational Creativity in Language Generation, pp. 1–12. Association for Computational Linguistics, Tokyo, Japan (2019). https://aclanthology.org/2019.ccnlg-1.1
Bang, Y., et al.: A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023)
Blagec, K., Dorffner, G., Moradi, M., Samwald, M.: A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv preprint arXiv:2008.02577 (2020)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423
Gordon, A.S., Hobbs, J.R.: A Formal Theory of Commonsense Psychology: How People Think People Think. Cambridge University Press, Cambridge (2017). https://doi.org/10.1017/9781316584705
Book Google Scholar
Gunning, D.: Machine common sense concept paper. arXiv preprint arXiv:1810.07528 (2018)
Kahneman, D.: Thinking, Fast and Slow. Macmillan (2011). Google-Books-ID: SHvzzuCnuv8C
Google Scholar
Kejriwal, M., Santos, H., Mulvehill, A.M., McGuinness, D.L.: Designing a strong test for measuring true common-sense reasoning. Nat. Mach. Intell. 4(4), 318–322 (2022). https://doi.org/10.1038/s42256-022-00478-4
Article Google Scholar
Kejriwal, M., Shen, K.: Do fine-tuned commonsense language models really generalize? arXiv preprint arXiv:2011.09159 (2020)
Lin, B.Y., et al.: CommonGen: a constrained text generation challenge for generative commonsense reasoning. In: Findings of the Association for Computational Linguistics (EMNLP 2020), pp. 1823–1840. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.165
Mitra, A., Banerjee, P., Pal, K.K., Mishra, S., Baral, C.: How additional knowledge can improve natural language commonsense question answering? arXiv preprint arXiv:1909.08855 (2020)
Neelakandan, N.: Creating multiple-choice questions: the dos and don’ts (2019). https://elearningindustry.com/creating-multiple-choice-questions
Santos, H., Kejriwal, M., Mulvehill, A.M., Forbush, G., McGuinness, D.L., Rivera, A.R.: An experimental study measuring human annotator categorization agreement on commonsense sentences. Exp. Res. 2, e19 (2021)
Article Google Scholar
Santos, H., Shen, K., Mulvehill, A.M., Razeghi, Y., McGuinness, D.L., Kejriwal, M.: A theoretically grounded benchmark for evaluating machine commonsense (2022). https://doi.org/10.48550/arXiv.2203.12184
Shen, K., Kejriwal, M.: An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks. Expert Syst. e13243 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Southern California, Los Angeles, CA, USA
Mayank Kejriwal & Ke Shen
Rensselaer Polytechnic Institute, Troy, NY, USA
Henrique Santos, Alice M. Mulvehill & Deborah L. McGuinness

Authors

Mayank Kejriwal
View author publications
You can also search for this author in PubMed Google Scholar
Henrique Santos
View author publications
You can also search for this author in PubMed Google Scholar
Ke Shen
View author publications
You can also search for this author in PubMed Google Scholar
Alice M. Mulvehill
View author publications
You can also search for this author in PubMed Google Scholar
Deborah L. McGuinness
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayank Kejriwal .

Editor information

Editors and Affiliations

Department of Psychology, Stockholm University, Stockholm, Sweden
Patrick Hammer
Örebro University, Örebro, Sweden
Marjan Alirezaie
University of Gothenburg, Gothenburg, Sweden
Claes Strannegård

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kejriwal, M., Santos, H., Shen, K., Mulvehill, A.M., McGuinness, D.L. (2023). Context-Rich Evaluation of Machine Common Sense. In: Hammer, P., Alirezaie, M., Strannegård, C. (eds) Artificial General Intelligence. AGI 2023. Lecture Notes in Computer Science(), vol 13921. Springer, Cham. https://doi.org/10.1007/978-3-031-33469-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-33469-6_17
Published: 24 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33468-9
Online ISBN: 978-3-031-33469-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics