Skip to main content

Software Entity Recognition Method Based on BERT Embedding

  • Conference paper
  • First Online:
Book cover Machine Learning for Cyber Security (ML4CS 2020)

Abstract

The global open source software ecosystem contains rich information in the field of software engineering. The existing analysis methods for the text content of the knowledge community in this field are mainly focus on the structural relationship and rule-based association and mining. This paper proposes a software entity recognition method based on BERT word embedding. Firstly, the BiLSTM-CRF model is constructed, and the entity recognition model is constructed by combining the word vector embedding in software engineering field. Then, the word vector in the input layer of the model is improved by introducing the BERT pre-training language model. In the process of pre-training of BERT, the pre-training data should be constructed based on the discussion content of Stack Overflow software Q & A community. Then, we use these data to pre-training the BERT model, so as to obtain the word vector representation suitable for software engineering field, improving the effect of entity recognition in software engineering field, and solving the problem that the traditional word vector embedding is mostly based on the general domain data training, which is not fully suitable for software engineering field, and can’t well represent the context semantic information. At the same time, to solve the problem that there are few annotated data in the field of software, this paper tries to extends the data appropriately by the method of model prediction and dictionary matching, and carries out experimental test. Finally, this paper uses the method of deep learning to realize the entity recognition in the field of software engineering, so as to provide support for the extraction of software entities, the construction of software knowledge base, and the intelligent application of software engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sen, R., Singh, S.S., Borle, S.: Open source software success: measure and analysis. Decis. Supp. Syst. 52(2), 364–372 (2016)

    Article  Google Scholar 

  2. Yin, G., et al.: A review of the research on software data mining technology for open source ecology. J. Softw. 29(08), 2258–2271 (2018)

    Google Scholar 

  3. Ye, D.H., Xing, Z.C., Chee, F., Zi, A., Li, J., Nachiket, K.: Software-specific named entity recognition in software engineering social content. In: IEEE International Conference on Software Analysis, pp. 90–101 (2016)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. Amirreza, S., Bowen, X., David, L., Solorio, T., Alipour, A.: Question relatedness on stack overflow: the task, dataset, and corpus-inspired models. In: Proceedings of the AAAI Rea-soning for Complex Question Answering Workshop (2019)

    Google Scholar 

  6. Gias, U., Foutse, K., Chanchal, K.R.: Mining API usage scenarios from stack overflow. Inf. Softw. Technol. 122, 106277 (2020)

    Article  Google Scholar 

  7. Luis, A.C.D., Nik, B., Ioannis, K.: Classifying emotions in Stack Overflow and JIRA using a multi-label approach. Knowl.-Based Syst. 195, 105633 (2020)

    Article  Google Scholar 

  8. Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K., Schneider, K.A.: CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empirical Softw. Eng. 25(2), 1493–1532 (2019). https://doi.org/10.1007/s10664-019-09743-4

    Article  Google Scholar 

  9. Santos, C.N.D, Guimarães, V.: Boosting named entity recognition with neural character embeddings. Comput. Sci. (2015)

    Google Scholar 

  10. Strubell, E., Verga, P., Belanger D.: Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098 (2017)

  11. Maimathiev, A.Y.F., Umur, S., Paridan, M.: Uighur named entity recognition based on bilstm-cnn-crf model. Comput. Eng. 44(8), 230–236 (2008)

    Google Scholar 

  12. Shen, Y., Yun, H., Lipton, Z.C., et al.: Deep Active Learning for Named Entity Recognition, pp. 252–256 (2008)

    Google Scholar 

  13. Bharadwaj, A., Mortensen, D., Dyer, C.: Phonologically aware neural model for named entity recognition in low resource transfer settings. In: Conference on Empirical Methods in Natural Language Processing, pp. 1462–1472 (2016)

    Google Scholar 

  14. Zhang, H., Guo, Y.B., Li, T.: Domain named entity recognition combining GAN and BiLSTM-attention-CRF. Comput. Res. Dev. 56(9), 1851 (2019)

    Google Scholar 

  15. Wang, J.N., Xu, W.J., Fu, X.Y., Xu, G.L., Wu, Y.R.: ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl.-Based Syst. 197, 105842 (2020)

    Article  Google Scholar 

  16. Wang, J., Zhang, R.D., Wu, C.S.: Named entity recognition method based on GRU. Comput. Syst. Appl. 27(09), 18–24 (2008)

    Google Scholar 

  17. Li, L.H., Guo, Y.K.: Biomedical named entity recognition based on CNN-BLSTM-CRF model. Chin. J. Inf. (2018)

    Google Scholar 

  18. Zhou, X.L., Zhao, X.J., Liu, T.L., Zong, Z.X., Wang, Q.L., Li, J.Q.: Named entity identification method for property dispute based on SVM-BiLSTM-CRF model. Comput. Syst. Appl. 28(01), 245–250 (2019)

    Google Scholar 

  19. Yang, W.M., Chu, W.J.: Named entity recognition of online medical Q & A texts. Comput. Syst. Appl. 28(02), 8–14 (2019)

    Google Scholar 

  20. Li, G., Pan, R.Q., Mao, J., Cao, Y.J.: Entity identification of Chinese electronic medical records integrating BiLSTM-CRF network and dictionary resources. Modern Intell. 40(04), 3–12 + 58 (2020)

    Google Scholar 

  21. Yang, X.M., et al.: Bidirectional LSTM-CRF for biomedical named entity recognition. In: 2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2018)

    Google Scholar 

  22. Chen, C.Y., Xing, Z.C., Wang, X.M.: Unsupervised software-specific morphological forms inference from informal discussions. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (2017)

    Google Scholar 

  23. Vasiliki, E., Christos, C., Diomidis, S.: Word embeddings for the software engineering domain. In: 2018 ACM/IEEE 15th International Conference on Mining Software Repositories (2018)

    Google Scholar 

Download references

Acknowledgment

This work was supported by Yunnan Key Laboratory of Smart Education, Yunnan Innovation Team of Education Informatization for Nationalities, Scientific Technology Innovation Team of Educational Big Data Application Technology in University of Yunnan Province, and Kunming Key Laboratory of Education Informatization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, C., Tang, M., Liang, L., Zou, W. (2020). Software Entity Recognition Method Based on BERT Embedding. In: Chen, X., Yan, H., Yan, Q., Zhang, X. (eds) Machine Learning for Cyber Security. ML4CS 2020. Lecture Notes in Computer Science(), vol 12488. Springer, Cham. https://doi.org/10.1007/978-3-030-62463-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62463-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62462-0

  • Online ISBN: 978-3-030-62463-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics