A Study on Vulnerability Code Labeling Method in Open-Source C Programs

Zheng, Yaning; Wang, Dongxia; Cao, Huayang; Qian, Cheng; Kuang, Xiaohui; Zhuang, Honglin

doi:10.1007/978-3-031-39847-6_4

Yaning Zheng ORCID: orcid.org/0009-0001-1232-4489¹²,
Dongxia Wang¹²,
Huayang Cao¹²,
Cheng Qian¹²,
Xiaohui Kuang¹² &
…
Honglin Zhuang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14146))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

532 Accesses

Abstract

Various existing vulnerability databases and open-source code platforms have accumulated a large amount of vulnerability information, and extracting vulnerability code samples from this information can help research the causes of vulnerabilities, develop vulnerability detection technologies, and detect potential vulnerabilities. In this work, we collected 13 vulnerability code datasets involving various applications and analyzed these datasets in seven aspects, such as data sources, labeling methods, application scenarios, etc. We found several defects in these datasets, including duplicated data, incomplete information, and inaccurate labels. We also analyzed the extraction and labeling methods of these datasets and proposed three labeling technology frameworks: labeling based on text description, labeling based on patch analysis, and labeling based on vulnerability scanning. The proposed frameworks can be used to evaluate existing labeling methods and guide the future work on labeling vulnerability code samples, which can help form a better vulnerability code dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks (2019)
Google Scholar
Wang, X., Wang, S., Feng, P., Sun, K., Jajodia, S.: PatchDB: a large-scale security patch dataset. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 149–160 (2021)
Google Scholar
Ghadhab, L., Jenhani, I., Mkaouer, M.W., Messaoud, M.B.: Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Inf. Softw. Technol. 135, 106566 (2021)
Article Google Scholar
NVD. https://nvd.nist.gov/
CVE. https://cve.mitre.org/
Gu, Z., Wu, J., Liu, J., Zhou, M., Gu, M.: An empirical study on API-misuse bugs in open-source C programs. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 11–20 (2019)
Google Scholar
SARD. https://samate.nist.gov/SARD
Semasaba, A., Zheng, W., Wu, X., Agyemang, S.: Literature survey of deep learning-based vulnerability analysis on source code. IET Softw. 14, 654–664 (2020)
Article Google Scholar
Lin, G., Wen, S., Han, Q.-L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
Article Google Scholar
Jimenez, M., Rwemalika, R., Papadakis, M., Sarro, F., Traon, Y.L., Harman, M.: The importance of accounting for real-world labelling when predicting software vulnerabilities. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, New York, NY, USA, pp. 695–705. Association for Computing Machinery (2019)
Google Scholar
Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 1 (2022)
Google Scholar
Croft, R., Ali Babar, M., Chen, H.: Noisy label learning for security defects (2022)
Google Scholar
Grahn, D., Zhang, J.: An analysis of C/C++ datasets for machine learning-assisted software vulnerability detection. In: Conference on Applied Machine Learning for Information Security, Arlington, VA (2021)
Google Scholar
Lin, Y., et al.: Vulnerability dataset construction methods applied to vulnerability detection: a survey. In Undefined (2022)
Google Scholar
Liu, L., Li, Z., Wen, Y., Chen, P.: Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors. PeerJ Comput. Sci. 8, e975 (2022)
Article Google Scholar
Lin, G., Xiao, W., Zhang, J., Xiang, Y.: Deep learning-based vulnerable function detection: a benchmark. In: Zhou, J., Luo, X., Shen, Q., Xu, Z. (eds.) ICICS 2019. LNCS, vol. 11999, pp. 219–232. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41579-2_13
Chapter Google Scholar
Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Secure Comput. 1 (2021)
Google Scholar
Russell, R.L., et al.: Automated vulnerability detection in source code using deep representation learning. In: Automated Vulnerability Detection in Source Code Using Deep Representation Learning, pp. 757–762 (2018)
Google Scholar
Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet? (2020)
Google Scholar
Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 508–512. Association for Computing Machinery, New York (2020)
Google Scholar
Zheng, Y., et al.: D2A: a dataset built for AI-based vulnerability detection methods using differential analysis. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 111–120 (2021)
Google Scholar
Raducu, R., Esteban, G., Lera, F.J.R., Fernández, C.: Collecting vulnerable source code from open-source repositories for dataset generation. Appl. Sci. 10(4), 1270 (2020)
Article Google Scholar
Liu, B., et al.: A large-scale empirical study on vulnerability distribution within projects and the lessons learned. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp. 1547–1559 (2020)
Google Scholar
Nikitopoulos, G., Dritsa, K., Louridas, P., Mitropoulos, D.: CrossVul: a cross-language vulnerability dataset with commit data. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, New York, NY, USA, pp. 1565–1569. Association for Computing Machinery (2021)
Google Scholar
Harer, J.A., et al.: Automated software vulnerability detection with machine learning (2018)
Google Scholar
Min, Y.: 2022 Beijing cyber security conference (BCS). https://bcs.qianxin.com/speaker/detail?id=63
Berger, E.D., Hollenbeck, C., Maj, P., Vitek, O., Vitek, J.: On the impact of programming languages on code quality: a reproduction study. ACM Trans. Program. Lang. Syst. 41(4), 21:1–21:24 (2019)
Google Scholar
Zafar, S., Malik, M.Z., Walia, G.S.: Towards standardizing and improving classification of bug-fix commits. In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–6 (2019)
Google Scholar
Tan, X., et al.: Locating the security patches for disclosed OSS vulnerabilities with vulnerability-commit correlation ranking. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS 2021, New York, NY, USA, pp. 3282–3299. Association for Computing Machinery (2021)
Google Scholar
Wang, X., et al.: PatchRNN: a deep learning-based system for security patch identification. In: MILCOM 2021–2021 IEEE Military Communications Conference (MILCOM) (2021)
Google Scholar
Hong, H., Woo, S., Lee, H.: Dicos: discovering insecure code snippets from stack overflow posts by leveraging user discussions. In: Annual Computer Security Applications Conference, ACSAC, New York, NY, USA, pp. 194–206. Association for Computing Machinery (2021)
Google Scholar
Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. In: Proceedings 2018 Network and Distributed System Security Symposium (2018)
Google Scholar
SEVulDet: A Semantics-Enhanced Learnable Vulnerability Detector (2022)
Google Scholar
Falleri, J.-R., Morandat, F., Blanc, X., Martinez, M., Monperrus, M.: Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE 2014, New York, NY, USA, pp. 313–324. Association for Computing Machinery (2014)
Google Scholar
Dotzler, G., Philippsen, M.: Move-optimized source code tree differencing. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, New York, NY, USA, pp. 660–671. Association for Computing Machinery (2016)
Google Scholar
Frick, V., Grassauer, T., Beck, F., Pinzger, M.: Generating accurate and compact edit scripts using tree differencing. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 264–274 (2018)
Google Scholar
Tsantalis, N., Mansouri, M., Eshkevari, L.M., Mazinanian, D., Dig, D.: Accurate and efficient refactoring detection in commit history. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, New York, NY, USA, pp. 483–494. Association for Computing Machinery (2018)
Google Scholar
FixMiner: Mining relevant fix patterns for automated program repair. Empirical Software Engineering
Google Scholar
Fluri, B., Wuersch, M., Inzger, M.P., Gall, H.: Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans. Softw. Eng. 33(11), 725–743 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Key Laboratory of Science and Technology on Information System Security, Beijing, China
Yaning Zheng, Dongxia Wang, Huayang Cao, Cheng Qian, Xiaohui Kuang & Honglin Zhuang

Authors

Yaning Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Dongxia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huayang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Qian
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Honglin Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Honglin Zhuang .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Ibaraki, Japan
Toshiyuki Amagasa
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Y., Wang, D., Cao, H., Qian, C., Kuang, X., Zhuang, H. (2023). A Study on Vulnerability Code Labeling Method in Open-Source C Programs. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-39847-6_4
Published: 18 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39846-9
Online ISBN: 978-3-031-39847-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Study on Vulnerability Code Labeling Method in Open-Source C Programs