Skip to main content
Log in

Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Much recent effort has been devoted to creating large-scale language models. Nowadays, the most prominent approaches are based on deep neural networks, such as BERT. However, they lack transparency and interpretability, and are often seen as black boxes. This affects not only their applicability in downstream tasks but also the comparability of different architectures or even of the same model trained using different corpora or hyperparameters. In this paper, we propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena. The dataset that was developed for these tasks is composed of a series of sentences with a single masked word and a cue phrase that helps in narrowing down the context. This dataset is divided into MWEs and grammatical structures, and the latter is subdivided into 6 tasks: impersonal verbs, subject agreement, verb agreement, nominal agreement, passive and connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau Base and mBERT. For the grammatical structures, we used only BERTimbau Large, because it yielded the best results in the MWE task. In both cases, we evaluated the results considering the best candidates and the top ten candidates. The evaluation was done both automatically (for MWEs) and manually (for grammatical structures). The results obtained for MWEs show that BERTimbau Large surpassed both the other models in predicting the correct masked element. However, the average accuracy of the best model was only 52% when only the best candidates were considered for each sentence, going up to 66% when the top ten candidates were taken into account. As for the grammatical tasks, the results presented better prediction, but also varied depending on the type of morphosyntactic agreement. On the one hand, cases such as connectors and impersonal verbs, which do not require any agreement in the produced candidates, had precision of 100% and 98.78% among the best candidates. On the other hand, tasks that require morphosyntactic agreement had results consistently below 90% overall precision, with the lowest scores being reported for nominal agreement and verb agreement, both having scores below 80% in overall precision among the best candidates. Therefore, we identified that a critical and widely adopted resource for Brazilian Portuguese NLP presents issues concerning MWE vocabulary and morphosyntactic agreement, even if it is prolific in most cases. These models are a core component in many NLP systems, and our findings demonstrate the need of additional improvements in these models and the importance of widely evaluating computational representations of language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. We highlight that the model by Abdaoui et al. (2020) is a distillation from mBERT, therefore aiming to achieve a model that has similar representations.

  2. Github page of the dataset: https://github.com/rswilkens/Assessing-Linguistic-Generalisation-in-Language-Models.

  3. https://huggingface.co/models?filter=en. Accessed on 31/05/2022.

  4. https://huggingface.co/models?filter=pt. Accessed on 31/05/2022.

  5. On 31 May 2022, BERTimbau Base had already been downloaded 229 k times, and BERTimbau Large, 51.3 k times.

  6. We will explain each of the different grammatical tests in more detail in Sect. 5.2, as we go through each of the tests. Here in the methodology, we will focus on the general process used to create the test sets.

  7. As a general reference for the examples of grammatical tests presented in this paper, we will use angle brackets (< >) and italics for indicating the seeds, and square brackets ([ ]) to indicate the words predicted by the model. As a reference, we also present a translation that is very close to literal in most of the cases.

  8. Due to the nature of the UNITEX-PB dictionary, we generated seeds based mainly on traditional-grammar-based verb tenses, which, for Portuguese, are all simple tenses, so we do not look at compound tenses for these cases in our study.

  9. Given the size of the manual evaluation task, with more than 12 k items, we only used the BERTimbau Large model, as it performed best in the MWE task, as will be discussed in the next section.

  10. We do not use a parser in this step for simplicity.

  11. We compared the three models using McNemar’s test, where we observed p-values < 0.0001. Analogously, we compared the models regarding the task of predicting W1 and W2 separately, where we observed p-values of 0.0001 and 0.034 respectively for the Base to Large comparison and 0 for the other comparisons.

  12. See Sect. 4 for a discussion of why we chose this model.

  13. Usually the third person singular and third person plural are used instead of the respective second person conjugations.

  14. In the English translation of the second example, the negation not was included in the seed for simplicity, but the seed in Portuguese is only the verb, and the negation não is part of the base sentence.

  15. Available in this Github repository: https://github.com/rswilkens/Assessing-Linguistic-Generalisation-in-Language-Models.

References

  • Abdaoui, A., Pradel, C., & Sigel, G. (2020). Load what you need: Smaller versions of mutililingual bert. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 119–123.

  • Bacon, G., & Regier, T. (2019). Does bert agree? Evaluating knowledge of structure dependence through agreement relations. arXiv:1908.09892.

  • Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv:1801.09536.

  • Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Curran Associates Inc., pp. 4356–4364,

  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116.

  • Constant, M., Eryiğit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword expression processing: A survey. Computational Linguistics, 43(4), 837–892.

    Article  MathSciNet  Google Scholar 

  • Cordeiro, S., Villavicencio, A., Idiart, M., & Ramisch, C. (2019). Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1), 1–57.

    Article  MathSciNet  Google Scholar 

  • De Beaugrande, R.-A., & Dressler, W. U. (2011). Einführung in die textlinguistik. In Einführung in die Textlinguistik. Max Niemeyer Verlag.

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

  • Dinan, E., Fan, A., Wu, L., Weston, J., Kiela, D., & Williams, A. (2020). Multi-dimensional gender bias classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 314–331.

  • Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34–48.

    Article  Google Scholar 

  • Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., & Villavicencio, A. (2021a). Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, , pp. 2730–2741.

  • Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., & Villavicencio, A. (2021b). Probing for idiomaticity in vector space models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, , pp. 3551–3564.

  • Goldberg, Y. (2019). Assessing bert’s syntactic abilities. arXiv:1901.05287.

  • Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202.

    Article  Google Scholar 

  • Gulordava, K., Bojanowski, P., Grave, É., Linzen, T., & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1195–1205.

  • Halliday, M. A. K., & Hasan, R. (2014). Cohesion in English. London: Routledge.

    Book  Google Scholar 

  • Kassner, N., & Schütze, H. (2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 7811–7818.

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwel, D. (2004). The sketch engine. In Proceedings of the Eleventh EURALEX International Congress.

  • Koch, I. G. V. (1988). Principais mecanismos de coesão textual em português. Cadernos de Estudos Linguísticos, 15, 73–80.

    Google Scholar 

  • Koch, I. G. V. (1999). A coesão textual. London: Editora Contexto.

    Google Scholar 

  • Kumar, V., Bhotia, T. S., Kumar, V., & Chakraborty, T. (2020). Nurse is closer to woman than surgeon? Mitigating gender-biased proximities in word embeddings. Transactions of the Association for Computational Linguistics, 8, 486–503.

    Article  CAS  Google Scholar 

  • Kurita, K., Vyas, N., Pareek, A., Black, A. W., & Tsvetkov, Y. (2019). Quantifying social biases in contextual word representations. In 1st ACL Workshop on Gender Bias for Natural Language Processing.

  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. CoRR, arXiv:1909.11942.

  • Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4, 521–535.

    Article  Google Scholar 

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.

  • Louwerse, M. (2002). An analytic and cognitive parameterization of coherence relations.

  • Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting : a review and empirical validation. Journal of Memory and Language, 92, 57–78.

    Article  Google Scholar 

  • Marcus, G. (2020). The next decade in AI: four steps towards robust artificial intelligence. CoRR, arXiv:2002.06177.

  • Marvin, R., & Linzen, T. (2018). Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202.

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244.

    Article  Google Scholar 

  • Mueller, A., Nicolai, G., Petrou-Zeniou, P., Talmina, N., & Linzen, T. (2020). Cross-linguistic syntactic evaluation of word prediction models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5523–5539.

  • Muniz, M. C., Maria das Graças, V. N., & Laporte, E. (2005). Unitex-pb, a set of flexible language resources for brazilian portuguese. In Workshop on Technology on Information and Human Language (TIL), pp. 2059–2068.

  • Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M. J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L., & others (2017). Universal dependencies 2.1.

  • Oshikawa, R., Qian, J., & Wang, W. Y. (2020). A survey on natural language processing for fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6086–6093.

  • Pasquer, C., Savary, A., Ramisch, C., & Antoine, J.-Y. (2020). Verbal multiword expression identification: Do we need a sledgehammer to crack a nut? In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, pp. 3333–3345.

  • Perini, M. A. (2010). Gramática do português brasileiro. London: Parábola Ed.

    Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2002). Springer, pp. 1–15.

  • Şahin, G. G., Vania, C., Kuznetsov, I., & Gurevych, I. (2020). LINSPECTOR: Multilingual probing tasks for word representations. Computational Linguistics, 46(2), 335–385.

    Article  Google Scholar 

  • Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv:1910.01108.

  • Sardinha, T. B. (2010). Corpus brasileiro. Informática, 708, 0–1.

    Google Scholar 

  • Savoldi, B., Gaido, M., Bentivogli, L., Negri, M., & Turchi, M. (2021). Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9, 845–874.

    Article  Google Scholar 

  • Scarton, C., & Aluısio, S. M. (2010). Coh-metrix-port: A readability assessment tool for texts in brazilian portuguese. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR, vol. 10.

  • Schneider, E. T. R., de Souza, J. V. A., Knafou, J., Oliveira, L. E. S. e., Copara, J., Gumiel, Y. B., Oliveira, L. F. A. d., Paraiso, E. C., Teodoro, D., & Barra, C. M. C. M. (2020). BioBERTpt: A Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop. Association for Computational Linguistics, pp. 65–72.

  • Scholivet, M., & Ramisch, C. (2017). Identification of ambiguous multiword expressions using sequence models and lexical resources. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017). Association for Computational Linguistics, pp. 167–175.

  • Schrimpf, M., Blank, I., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J., & Fedorenko, E. (2020). Artificial neural networks accurately predict language processing in the brain. BioRxiv.

  • Souza, F., Nogueira, R., & Lotufo, R. (2019). Portuguese named entity recognition using bert-crf. arXiv:1909.10649.

  • Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In Brazilian Conference on Intelligent Systems. Springer, pp. 403–417.

  • Su, Q., Wan, M., Liu, X., & Huang, C.-R. (2020). Motivations, methods and metrics of misinformation detection: An nlp perspective. Natural Language Processing Research, 1, 1–13.

    Article  Google Scholar 

  • Sylak-Glassman, J. (2016). The composition and use of the universal morphological feature schema (unimorph schema). Johns Hopkins University.

  • Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., & Villavicencio, A. (2022). SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics, pp. 107–121.

  • Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., & Villavicencio, A. (2021). AStitchInLanguageModels: Dataset and methods for the exploration of idiomaticity in pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, pp. 3464–3477.

  • Vale, O., & Baptista, J. (2015). Novo dicionário de formas flexionadas do unitex-pb: Avaliação da flexão verbal (new dictionary of inflected forms of unitex-pb: Evaluation of verbal inflection). In Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology, pp. 171–180.

  • Vulić, I., Baker, S., Ponti, E. M., Petti, U., Leviant, I., Wing, K., Majewska, O., Bar, E., Malone, M., Poibeau, T., Reichart, R., & Korhonen, A. (2020). Multi-SimLex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity. Computational Linguistics, 46(4), 847–897.

    Article  Google Scholar 

  • Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 4339–4344.

  • Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., & Bowman, S. R. (2020). BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8, 377–392.

    Article  Google Scholar 

  • Wilkens, R., Zilio, L., Cordeiro, S. R., Paula, F., Ramisch, C., Idiart, M., & Villavicencio, A. (2017). LexSubNC: A dataset of lexical substitution for nominal compounds. In IWCS 2017: 12th International Conference on Computational Semantics: Short papers.

  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32, 1–10.

    Google Scholar 

Download references

Acknowledgements

This research has partially been funded by a research convention with France Éducation International. It is also partly funded by EPSRC (project EP/T02450X/1 Modeling Idiomaticity in Human and Artificial Language Processing), by the Royal Society (project NAF/R2/202209), and by Research England, in the form of the Expanding Excellence in England (E3) programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Wilkens.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wilkens, R., Zilio, L. & Villavicencio, A. Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese. Lang Resources & Evaluation 58, 175–201 (2024). https://doi.org/10.1007/s10579-023-09664-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09664-1

Keywords

Navigation