Skip to main content
Log in

SsciBERT: a pre-trained language model for social science texts

Scientometrics Aims and scope Submit manuscript

Abstract

The academic literature of social sciences records human civilization and studies human social problems. With its large-scale growth, the ways to quickly find existing research on relevant issues have become an urgent demand for researchers. Previous studies, such as SciBERT, have shown that pre-training using domain-specific texts can improve the performance of natural language processing tasks. However, the pre-trained language model for social sciences is not available so far. In light of this, the present research proposes a pre-trained model based on the abstracts published in the Social Science Citation Index (SSCI) journals. The models, which are available on GitHub (https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT), show excellent performance on discipline classification, abstract structure–function recognition, and named entity recognition tasks with the social sciences literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Asada, M., Miwa, M., & Sasaki, Y. (2020). Using drug descriptions and molecular structures for drug–drug interaction extraction from literature. Bioinformatics, 37(12), 1739–1746. https://doi.org/10.1093/bioinformatics/btaa907

    Article  Google Scholar 

  • Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Paper presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong.

  • Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Paper presented at the Neural Information Processing Systems 2000 (NIPS 2000), Denver, Colorado.

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  • Brack, A., D’Souza, J., Hoppe, A., Auer, S., & Ewerth, R. (2020). Domain-independent extraction of scientific concepts from research articles. In G. Kazai & A. R. Fuhr (Eds.), Advances in information retrieval (pp. 251–266). Springer.

    Chapter  Google Scholar 

  • Cattan, A., Johnson, S., Weld, D., Dagan, I., Beltagy, I., Downey, D., & Hope, T. (2021). SciCo: Hierarchical cross-document conference for scientific concepts. Paper presented at the 3rd Conference on Automated Knowledge Base Construction (AKBC 2021), Irvine.

  • Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.

  • Chen, S. F., Beeferman, D., & Rosenfeld, R. (1998). Evaluation metrics for language models (pp. 2–8). Paper presented at the Workshop of DARPA Broadcast News Transcription and Understanding.

  • D’Souza, J., Auer, S., & Pedersen, T. (2021, August). SemEval-2021 Task 11: NLPContributionGraph—Structuring scholarly NLP contributions for a research knowledge graph. Paper presented at the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online.

  • D’Souza, J., Hoppe, A., Brack, A., Jaradeh, M. Y., Auer, S., & Ewerth, R. (2020, May). The STEM-ECR dataset: Grounding scientific entity references in STEM scholarly content to authoritative encyclopedic and lexicographic sources. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Paper presented at the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, Minnesota.

  • Dong, Q., Wan, X., & Cao, Y. (2021, April). ParaSCI: A Large scientific paraphrase dataset for longer paraphrase generation. Paper presented at the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online.

  • Ferreira, D., & Freitas, A. (2020, May). Natural language premise selection: Finding supporting statements for mathematical text. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.

  • Friedrich, A., Adel, H., Tomazic, F., Hingerl, J., Benteau, R., Marusczyk, A., & Lange, L. (2020). The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain. Paper presented at the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online.

  • Graetz, N. (1982). Teaching EFL students to extract structural information from abstracts. Paper presented at the International Symposium on Language for Special Purposes, Eindhoven.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, Nevada.

  • Hebbar, S., & Xie, Y. (2021, 04/18). CovidBERT-biomedical relation extraction for covid-19. Paper presented at the Florida Artificial Intelligence Research Society Conference, North Miami Beach, Florida.

  • Huang, K.-H., Yang, M., & Peng, N. (2020). Biomedical event extraction with hierarchical knowledge graphs. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.

  • Kononova, O., He, T., Huo, H., Trewartha, A., Olivetti, E. A., & Ceder, G. (2021). Opportunities and challenges of text mining in materials research. iScience, 24(3), 102155. https://doi.org/10.1016/j.isci.2021.102155

    Article  Google Scholar 

  • Kotonya, N., & Toni, F. (2020). Explainable automated fact-checking for public health claims. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online.

  • Koutsikakis, J., Chalkidis, I., Malakasiotis, P., & Androutsopoulos, I. (2020). GREEK-BERT: The Greeks visiting Sesame Street. Paper presented at the 11th Hellenic Conference on Artificial Intelligence (SETN 2020), Athens.

  • Kuniyoshi, F., Makino, K., Ozawa, J., & Miwa, M. (2020). Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.

  • Lauscher, A., Ko, B., Kuehl, B., Johnson, S., Jurgens, D., Cohan, A., & Lo, K. (2021). MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting. Preprint at http://arXiv.org/2107.00414.

  • Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

    Article  Google Scholar 

  • Medić, Z., & Šnajder, J. (2020). A survey of citation recommendation tasks and methods. Journal of Computing and Information Technology, 28(3), 183–205. https://doi.org/10.20532/cit.2020.1005160

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Preprint at http://arXiv.org/1301.3781.

  • Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts.

  • Murty, S., Koh, P. W., & Liang, P. (2020, July). ExpBERT: Representation engineering with natural language explanations, Online.

  • Nicholson, J. M., Mordaunt, M., Lopez, P., Uppala, A., Rosati, D., Rodrigues, N. P., Grabitz, P., & Rife, S. C. (2021). Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3), 882–898. https://doi.org/10.1162/qss_a_00146

    Article  Google Scholar 

  • Park, S., & Caragea, C. (2020). Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. Paper presented at the 28th International Conference on Computational Linguistics (COLING’2020), Barcelona (Online).

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Paper presented at the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha.

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Louisiana.

    Book  Google Scholar 

  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

  • Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86. https://doi.org/10.1038/s41746-021-00455-y

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.https://doi.org/10.48550/arXiv.1409.1556

  • Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364–367.

    Google Scholar 

  • Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.

    Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,…, Polosukhin, I. (2017). Attention is all you need. Paper presented at the The 31 Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, California.

  • van Dongen, T., Maillette de Buy Wenniger, G., & Schomaker, L. (2020, November). SChuBERT: Scholarly document chunks with BERT-encoding boost citation count prediction. Paper presented at the 1st Workshop on Scholarly Document Processing (SDP 2020), Online.

  • Viswanathan, V., Neubig, G., & Liu, P. (2021, August). CitationIE: Leveraging the citation graph for scientific information extraction. Paper presented at the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)), Online.

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A.,..., & Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. Preprint at http://arXiv.org/1910.03771.

  • Wright, D., & Augenstein, I. (2021). CiteWorth: Cite-worthiness detection for improved scientific document understanding. Paper presented at the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online.

  • Yang, Y., Siy U. Y., & Huang, A. (2020). FinBERT: A pretrained language model for financial communications. https://doi.org/10.48550/arXiv.2006.08097

Download references

Acknowledgements

The authors acknowledge the National Natural Science Foundation of China (Grant Numbers: 71974094, 72004169) for financial support and the data annotation team of Nanjing Agricultural University and Nanjing University of Science and Technology. Thanks to the students and researchers for their help in the revision and polishing of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Si Shen.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, S., Liu, J., Lin, L. et al. SsciBERT: a pre-trained language model for social science texts. Scientometrics 128, 1241–1263 (2023). https://doi.org/10.1007/s11192-022-04602-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-022-04602-4

Keywords

Navigation