Skip to main content
Log in

Software system comparison with semantic source code embeddings

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

This paper presents a novel approach for comparing software systems by calculating the robust Hausdorff distance between semantic source code embeddings of individual software components, i.e., methods. The proposed approach represents each software as a set of vectors, where every vector is a semantic source code embedding of a particular method. The code embeddings are constructed from abstract syntax trees of the methods with the help of attention-based neural network models that capture the semantics of the methods. Previous research has shown that comparing semantic source code embeddings can reveal semantic relationships between the two methods. We utilize this characteristic to estimate the semantic similarity between the two software systems by computing the robust Hausdorff distance. In the experiment, a pre-trained code2vec neural network model is used to create the source code vector representations of several open-source Java-based libraries. Several variations of the robust Hausdorff distance are evaluated. The results show that the proposed approach can effectively estimate the semantic similarity, reflecting the software library’s scopes, software evolution, and individual parts (e.g., packages) of those libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Ain QU, Butt WH, Anwar MW, Azam F, Maqbool B (2019) A systematic review on code clone detection. IEEE Access 7:86121–86144. https://doi.org/10.1109/ACCESS.2019.2918202

    Article  Google Scholar 

  • Al-Debagy O, Martinek P (2021) A microservice decomposition method through using distributed representation of source code. Scalable Comput Pract Experience 22(1):39–52. https://doi.org/10.12694/scpe.v22i1.1836

    Article  Google Scholar 

  • Alon U, Brody S, Levy O, Yahav E (2019) code2seq: Generating sequences from structured representations of code. In: International conference on learning representations

  • Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. Association for Computing Machinery, New York, pp 404–419

  • Alon U, Zilberstein M, Levy O, Yahav E (2019) code2vec: Learning distributed representations of code. Proc ACM Program Lang 3 (POPL):1–29. https://doi.org/10.1145/3290353

    Article  Google Scholar 

  • Barr JR, Shaw P, Abu-Khzam FN, Yu S, Yin H, Thatcher T (2020) Combinatorial code classification vulnerability rating. In: 2020 second international conference on transdisciplinary AI (TransAI), pp 80–83

  • Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Proceedings of international conference on software maintenance, pp 368–377

  • Becht E, McInnes L, Healy J, Dutertre C-A, Kwok Immanuel WH, Ng LG, Ginhoux F, Newell EW (2019) Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol 37(1):38–44. https://doi.org/10.1038/nbt.4314

    Article  Google Scholar 

  • Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591. https://doi.org/10.1109/TSE.2007.70725

    Article  Google Scholar 

  • Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: A learnable representation of code semantics. In: Proceedings of the 32nd international conference on neural information processing systems. Curran Associates Inc., Red Hook, pp 3589–3601

  • Capiluppi A, Di Ruscio D, Di Rocco J, Nguyen PT, Ajienka N (2020) Detecting java software similarities by using different clustering techniques. Inf Softw Technol 122:106279. https://doi.org/10.1016/j.infsof.2020.106279

    Article  Google Scholar 

  • Chae D-K, Ha J, Kim S-W, Kang B, Im EG (2013) Software plagiarism detection: A graph-based approach. In: Proceedings of the 22nd ACM international conference on information & knowledge management. Association for Computing Machinery, New York, pp 1577–1580

  • Cheers H, Lin Y, Smith SP (2019) A novel approach for detecting logic similarity in plagiarised source code. In: 2019 IEEE 10th international conference on software engineering and service science (ICSESS). IEEE, pp 1–6

  • Chen K, Liu P, Zhang Y (2014) Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In: Proceedings of the 36th international conference on software engineering. Association for Computing Machinery, New York, pp 175–186

  • Compton R, Frank E, Patros P, Koay A (2020) Embedding java classes with code2vec: Improvements from variable obfuscation. In: Proceedings of the 17th international conference on mining software repositories. MSR ’20. Association for Computing Machinery, New York, pp 243–253

  • Csuvik V, Kicsi A, Vidács L (2019) Evaluation of textual similarity techniques in code level traceability. In: Computational science and its applications. Springer, pp 529–543

  • Dann A, Hermann B, Bodden E (2019) Sootdiff: Bytecode comparison across different java compilers. In: Proceedings of the 8th ACM SIGPLAN international workshop on state of the art in program analysis. Association for Computing Machinery, New York, pp 14–19

  • Decker MJ, Collard ML, Volkert LG, Maletic JI (2020) srcdiff: A syntactic differencing approach to improve the understandability of deltas. J Softw Evol Process 32(4). https://doi.org/10.1002/smr.2226

  • DeFreez D, Thakur AV, Rubio-González C (2018) Path-based function embedding and its application to error-handling specification mining. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. Association for Computing Machinery, New York, pp 423–433

  • Deza MM, Deza E (2009) Encyclopedia of distances. In: Encyclopedia of distances. Springer, pp 1–583

  • Dubuisson M-P, Jain AK (1994) A modified hausdorff distance for object matching. In: Proceedings of 12th international conference on pattern recognition, vol 1. IEEE, pp 566–568

  • Durić Z, Gašvić D (2012) A source code similarity system for plagiarism detection. Comput J 56(1):70–86. https://doi.org/10.1093/comjnl/bxs018

    Article  Google Scholar 

  • Faidhi JAW, Robinson SK (1987) An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput Educ 11(1):11–19. https://doi.org/10.1016/0360-1315(87)90042-X

    Article  Google Scholar 

  • Falleri J-R, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering, ASE ’14. Association for Computing Machinery, New York, pp 313–324

  • Figalli A, Gigli N (2010) A new transportation distance between non-negative measures, with applications to gradients flows with dirichlet boundary conditions. J Math Appl 94(2):107–130. https://doi.org/10.1016/j.matpur.2009.11.005

    MathSciNet  MATH  Google Scholar 

  • Gardner A, Kanno J, Duncan CA, Selmic R (2014) Measuring distance between unordered sets of different sizes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 137–143

  • Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc., USA

    MATH  Google Scholar 

  • Hemel A, Kalleberg KT, Vermaas R, Dolstra E (2011) Finding software license violations through binary code clone detection. In: Proceedings of the 8th working conference on mining software repositories. Association for Computing Machinery, New York, pp 63–72

  • Henkel J, Lahiri SK, Liblit B, Reps T (2018) Code vectors: Understanding programs through embedded abstracted symbolic traces. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. Association for Computing Machinery, New York, pp 163–174

  • Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15 (9):850–863. https://doi.org/10.1109/34.232073

    Article  Google Scholar 

  • Jhi Y-C, Wang X, Jia X, Zhu S, Liu P, Wu D (2011) Value-based program characterization and its application to software plagiarism detection. In: Proceedings of the 33rd international conference on software engineering, pp 756–765

  • Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670. https://doi.org/10.1109/TSE.2002.1019480

    Article  Google Scholar 

  • Kang HJ, Bissyandé TF, Lo D (2019) Assessing the generalizability of code2vec token embeddings. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE), pp 1–12

  • Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: A case study. Evol Large Scale Ind Softw Architectures 16:107–113

    Google Scholar 

  • Kobak D, Linderman GC (2021) Initialization is critical for preserving global data structure in both t-sne and umap. Nat Biotechnol 39(2):156–157. https://doi.org/10.1038/s41587-020-00809-z

    Article  Google Scholar 

  • Kovalenko V, Bogomolov E, Bryksin T, Bacchelli A (2019) Pathminer: A library for mining of path-based representations of code. In: Proceedings of the 16th international conference on mining software repositories, pp 13–17

  • Krinke J (2001) Identifying similar code with program dependence graphs. In: Proceedings eighth working conference on reverse engineering, pp 301–309

  • Levina E, Bickel P (2001) The earth mover’s distance is the mallows distance: Some insights from statistics. In: Proceedings Eighth IEEE international conference on computer vision. ICCV 2001, vol 2. IEEE, pp 251–256

  • Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) Cclearner: A deep learning-based clone detection approach. In: 2017 IEEE international conference on software maintenance and evolution (ICSME), pp 249–260

  • Li X, Zhong XJ (2010) The source code plagiarism detection using ast. In: 2010 international symposium on intelligence information processing and trusted computing, pp 406–408

  • Luan S, Yang D, Barnaby C, Sen K, Chandra S (2019) Aroma: Code recommendation via structural code search. Proc ACM on Program Lang 3(OOPSLA):1–28. https://doi.org/10.1145/3360578

    Article  Google Scholar 

  • Luo L, Ming J, Wu D, Liu P, Zhu S (2017) Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. IEEE Trans Softw Eng 43(12):1157–1177

    Article  Google Scholar 

  • Mathur A, Choudhary H, Vashist P, Thies W, Thilagam S (2012) An empirical study of license violations in open source projects. In: Proceedings of the 2012 35th annual IEEE software engineering workshop. IEEE Computer Society, pp 168–176

  • McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426

  • McInnes L, Healy J, Saul N, Grossberger L (2018) Umap: Uniform manifold approximation and projection. J Open Source Softw 3(29):861. https://doi.org/10.21105/joss.00861

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  • Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) Clcdsa: Cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 1026–1037

  • Nguyen PT, Di Rocco J, Rubei R, Di Ruscio D (2020) An automated approach to assess the similarity of github repositories. Softw Qual J 28:595–631. https://doi.org/10.1007/s11219-019-09483-0

    Article  Google Scholar 

  • Ottenstein KJ (1976) An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull 8(4):30–41. https://doi.org/10.1145/382222.382462

    Article  Google Scholar 

  • Palo HK, Sahoo S, Subudhi AK (2021) Dimensionality reduction techniques: Principles, benefits, and limitations. Wiley, chap 4, pp 77–107

  • Pauzi Z, Capiluppi A (2020) Text similarity between concepts extracted from source code and documentation. In: International conference on intelligent data engineering and automated learning. Springer, pp 124–135

  • Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), vol 14. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543

  • Pigazzini I (2019) Automatic detection of architectural bad smells through semantic representation of code. In: Proceedings of the 13th european conference on software architecture, vol 2. Association for Computing Machinery, New York, pp 59–62

  • Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with jplag. J Univers Comput Sci 8(11)

  • Rabin MRI, Mukherjee A, Gnawali O, Alipour MA (2020) Towards demystifying dimensions of source code embeddings. In: Proceedings of the 1st ACM SIGSOFT international workshop on representation learning for software engineering and program languages. Association for Computing Machinery, New York, pp 29–38

  • Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519. https://doi.org/10.1007/s10664-017-9564-7

    Article  Google Scholar 

  • Roy CK, Cordy JR (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008 16th IEEE international conference on program comprehension, pp 172–181

  • Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering. Association for Computing Machinery, New York, pp 1157–1168

  • Schleimer S, Wilkerson D S, Aiken A (2003) Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, pp 76–85

  • Shan SQ, Tian ZG, Guo FJ, Ren JX (2014) Similarity detection’s application using chi-square test in the property of counting method. In: Advances in computers, electronics and mechatronics, Trans Tech Publications Ltd, Applied Mechanics and Materials, vol 667, pp 32–35

  • Sheneamer A, Kalita J (2016) Semantic clone detection using machine learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA), pp 1024–1028

  • Shi K, Lu Y, Chang J, Wei Z (2020) Pathpair2vec: An ast path pair-based code representation method for defect prediction. J Comput Lang 59. https://doi.org/10.1016/j.cola.2020.100979

  • Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: 2018 IEEE/ACM 15th international conference on mining software repositories (MSR), pp 542–553

  • Turian J, Ratinov L-A, Bengio Y (2010) Word representations: A simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, vol 2010. Association for Computational Linguistics, Uppsala, Sweden, pp 384–394

  • White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering. Association for Computing Machinery, New York, pp 87–98

  • Ye F, Zhou S, Venkat A, Marucs R, Tatbul N, Tithi JJ, Petersen P, Mattson T, Kraska T, Dubey P et al (2021) Misim: A novel code similarity system

  • Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 286–289

  • Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: Proceedings of the 41st international conference on software engineering. IEEE Press, pp 783–794

  • Zhao J, Xia K, Fu Y, Cui B (2015) An ast-based code plagiarism detection algorithm. In: 2015 10th international conference on broadband and wireless computing, communication and applications (BWCCA), pp 178–182

Download references

Acknowledgements

Conceptualization: all authors; Methodology: all authors; Formal analysis and investigation: all authors; Writing - original draft preparation: Sašo Karakatič and Tjaša Heričko; Writing - review and editing: all authors; Supervision: Sašo Karakatic.̌

Funding

This work was supported by the Slovenian Research Agency (Research Core Funding No. P2-0057).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tjaša Heričko.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interest.

Additional information

Communicated by: Meiyappan Nagappan and Tim Menzies

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Inventing the Next Generation of Software Analytics

This work was supported by the Slovenian Research Agency (Research Core Funding No. P2-0057).

Appendices

Appendix: A

The additional software libraries included in the extended experiments were the following.

The basic source code, usage, and repository statistics of the additional software libraries used in the extended experiment are shown in Table 9.

Table 9 Descriptive statistics for the additional software libraries used in the extended experiment

The differences between the software library distances with test parts and without test parts are similar on the smaller set of eight software libraries as is in this larger set of 22 software libraries (△Mdistances = 0.034, △SDdistances = 0.019; Wilcoxon signed-rank test, W = 149, p < 0.001).

Appendix B

The undirected distances of software libraries from Table 10 are also presented in scatter plot (Fig. 9) after multidimensional scaling. Note, json simple was not included in this visualization, as it is the most unusual of the libraries included and thus skewed the multidimensional scaling. It is clear that the JSON libraries cluster in the same space, which is also the case for the XML libraries and testing libraries. The general-purpose libraries are in the middle of the three clusters.

Table 10 Undirected Hausdorff distances between the software libraries
Table 11 Undirected Hausdorff distances between the software libraries without test parts
Fig. 9
figure 9

Visualization distances after multidimensional scaling and without json simple

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karakatič, S., Miloševič, A. & Heričko, T. Software system comparison with semantic source code embeddings. Empir Software Eng 27, 70 (2022). https://doi.org/10.1007/s10664-022-10122-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-022-10122-9

Keywords

Navigation