Abstract
The global open source software ecosystem contains rich information in the field of software engineering. The existing analysis methods for the text content of the knowledge community in this field are mainly focus on the structural relationship and rule-based association and mining. This paper proposes a software entity recognition method based on BERT word embedding. Firstly, the BiLSTM-CRF model is constructed, and the entity recognition model is constructed by combining the word vector embedding in software engineering field. Then, the word vector in the input layer of the model is improved by introducing the BERT pre-training language model. In the process of pre-training of BERT, the pre-training data should be constructed based on the discussion content of Stack Overflow software Q & A community. Then, we use these data to pre-training the BERT model, so as to obtain the word vector representation suitable for software engineering field, improving the effect of entity recognition in software engineering field, and solving the problem that the traditional word vector embedding is mostly based on the general domain data training, which is not fully suitable for software engineering field, and can’t well represent the context semantic information. At the same time, to solve the problem that there are few annotated data in the field of software, this paper tries to extends the data appropriately by the method of model prediction and dictionary matching, and carries out experimental test. Finally, this paper uses the method of deep learning to realize the entity recognition in the field of software engineering, so as to provide support for the extraction of software entities, the construction of software knowledge base, and the intelligent application of software engineering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sen, R., Singh, S.S., Borle, S.: Open source software success: measure and analysis. Decis. Supp. Syst. 52(2), 364–372 (2016)
Yin, G., et al.: A review of the research on software data mining technology for open source ecology. J. Softw. 29(08), 2258–2271 (2018)
Ye, D.H., Xing, Z.C., Chee, F., Zi, A., Li, J., Nachiket, K.: Software-specific named entity recognition in software engineering social content. In: IEEE International Conference on Software Analysis, pp. 90–101 (2016)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Amirreza, S., Bowen, X., David, L., Solorio, T., Alipour, A.: Question relatedness on stack overflow: the task, dataset, and corpus-inspired models. In: Proceedings of the AAAI Rea-soning for Complex Question Answering Workshop (2019)
Gias, U., Foutse, K., Chanchal, K.R.: Mining API usage scenarios from stack overflow. Inf. Softw. Technol. 122, 106277 (2020)
Luis, A.C.D., Nik, B., Ioannis, K.: Classifying emotions in Stack Overflow and JIRA using a multi-label approach. Knowl.-Based Syst. 195, 105633 (2020)
Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K., Schneider, K.A.: CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empirical Softw. Eng. 25(2), 1493–1532 (2019). https://doi.org/10.1007/s10664-019-09743-4
Santos, C.N.D, Guimarães, V.: Boosting named entity recognition with neural character embeddings. Comput. Sci. (2015)
Strubell, E., Verga, P., Belanger D.: Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098 (2017)
Maimathiev, A.Y.F., Umur, S., Paridan, M.: Uighur named entity recognition based on bilstm-cnn-crf model. Comput. Eng. 44(8), 230–236 (2008)
Shen, Y., Yun, H., Lipton, Z.C., et al.: Deep Active Learning for Named Entity Recognition, pp. 252–256 (2008)
Bharadwaj, A., Mortensen, D., Dyer, C.: Phonologically aware neural model for named entity recognition in low resource transfer settings. In: Conference on Empirical Methods in Natural Language Processing, pp. 1462–1472 (2016)
Zhang, H., Guo, Y.B., Li, T.: Domain named entity recognition combining GAN and BiLSTM-attention-CRF. Comput. Res. Dev. 56(9), 1851 (2019)
Wang, J.N., Xu, W.J., Fu, X.Y., Xu, G.L., Wu, Y.R.: ASTRAL: adversarial trained LSTM-CNN for named entity recognition. Knowl.-Based Syst. 197, 105842 (2020)
Wang, J., Zhang, R.D., Wu, C.S.: Named entity recognition method based on GRU. Comput. Syst. Appl. 27(09), 18–24 (2008)
Li, L.H., Guo, Y.K.: Biomedical named entity recognition based on CNN-BLSTM-CRF model. Chin. J. Inf. (2018)
Zhou, X.L., Zhao, X.J., Liu, T.L., Zong, Z.X., Wang, Q.L., Li, J.Q.: Named entity identification method for property dispute based on SVM-BiLSTM-CRF model. Comput. Syst. Appl. 28(01), 245–250 (2019)
Yang, W.M., Chu, W.J.: Named entity recognition of online medical Q & A texts. Comput. Syst. Appl. 28(02), 8–14 (2019)
Li, G., Pan, R.Q., Mao, J., Cao, Y.J.: Entity identification of Chinese electronic medical records integrating BiLSTM-CRF network and dictionary resources. Modern Intell. 40(04), 3–12 + 58 (2020)
Yang, X.M., et al.: Bidirectional LSTM-CRF for biomedical named entity recognition. In: 2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2018)
Chen, C.Y., Xing, Z.C., Wang, X.M.: Unsupervised software-specific morphological forms inference from informal discussions. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (2017)
Vasiliki, E., Christos, C., Diomidis, S.: Word embeddings for the software engineering domain. In: 2018 ACM/IEEE 15th International Conference on Mining Software Repositories (2018)
Acknowledgment
This work was supported by Yunnan Key Laboratory of Smart Education, Yunnan Innovation Team of Education Informatization for Nationalities, Scientific Technology Innovation Team of Educational Big Data Application Technology in University of Yunnan Province, and Kunming Key Laboratory of Education Informatization.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, C., Tang, M., Liang, L., Zou, W. (2020). Software Entity Recognition Method Based on BERT Embedding. In: Chen, X., Yan, H., Yan, Q., Zhang, X. (eds) Machine Learning for Cyber Security. ML4CS 2020. Lecture Notes in Computer Science(), vol 12488. Springer, Cham. https://doi.org/10.1007/978-3-030-62463-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-62463-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62462-0
Online ISBN: 978-3-030-62463-7
eBook Packages: Computer ScienceComputer Science (R0)