KG4Py: A toolkit for generating Python knowledge graph and code semantic search

In the era of big data, there are numerous duplicate code snippets on the Internet, it is especially necessary to make use of them to build new software projects. In this paper, we present a toolkit (KG4Py) for generating a knowledge graph of Python files in GitHub repositories and conducting semantic search with the knowledge graph. In KG4Py, we remove all duplicate files in 317 K Python files and perform static code analyses of these files by using a concrete syntax tree (CST) to build a code knowledge graph of Python functions. We integrate a pre-trained model with an unsupervised model to generate a new model, and combine this new model with a code knowledge graph for the purpose of searching code snippets with natural language descriptions. The experimental results show that KG4Py achieves good performance in both the construction of the code knowledge graph and the semantic search of code snippets.


Introduction
Software reusability is an important part of software engineering. Software reuse not only reduces the duplicate efforts in software development, but also improves the quality of project development (Wang et al., 2019). The core of software reuse is the utilisation of duplicate code snippets, and code search just solves this problem. Traditional code search is mainly based on keywords, and it is impossible to mine the deep semantic information of search statements. Currently, searching for code snippets on GitHub is limited to the keyword search, and this is based on users being able to predict what keywords might be in the comments around the code snippets they are looking for. However, this approach suffers from poor portability and interpretability, and it is impossible to conduct a semantic search on the code snippets. For these reasons, we try to introduce knowledge graphs to solve various challenges faced in code semantic search.
Knowledge Graph (KG) was formally introduced by Google in 2012, which is a symbolic representation of the physical world as a graph to describe real-world entities (people, objects, concepts) and a network-like knowledge base connected by entities and relationships. Numerous different knowledge graphs, such as Freebase (Kurt et al., 2008), YAGO (Manu et al., 2012), Wikidata (Denny & Markus, 2014) and OpenKG  have been built in recent years. These knowledge graphs have brought convenience to people in different fields, such as search engines , recommendation systems (Arul & Razia, 2021;Xie et al., 2021;Yu et al., 2020), intelligent question and answer (Zhang et al., 2018) and decision analysis (Hao et al., 2021;Qiao et al., 2020).
Inspired by these knowledge graphs, researchers have thought about how to build knowledge graphs in software engineering. Big data of code provides data sources for knowledge graph construction, and deep learning-based methods provide assistance for automatic knowledge graph construction (Wang et al., 2020a).
With the growth of open-source software in the past few years, more and more software projects are appearing on major code hosting platforms such as GitHub, Gitlab and Bitbucket. In order to leverage these source codes, APIs, documents, etc., researchers have built various knowledge graphs in software domains. Meng et al. (2017) constructed an Android applications knowledge graph for automatic analysis. Wang et al. (2020b) used Wikipedia taxonomy to build a software development knowledge graph. Inspired by these knowledge graphs, we thought about how to create a knowledge graph of Python functions.
For knowledge graph search systems, we can dig more hidden information about what we want. Wang et al. (2019) proposed and implemented a knowledge graph-based interface for natural language queries in projects. It took the meta-model from the knowledge base and built the inference subgraph associated with the questions, and then automatically converted the natural language questions into structured query statements and returned the relevant answers. This also provides inspiration for our code semantic search.
For the selection of graph databases, knowledge graphs typically use Neo4j, GraphDB and other graph databases to store data and use specific statements to retrieve the data. In this paper, we choose Neo4j to store the data, because it supports rich semantic tag descriptions, it reads and writes data faster, has readable query statements and also represents semi-structured data easily. This gives us the possibility to build code knowledge graphs and code semantic search.
The structure of this paper is as follows: In Section 2, we review the research related to knowledge graph construction and search. In Section 3, we present the general framework and each component of our proposed method in details. In Section 4, we introduce the steps of the experiment and analyse the results. In Section 5, We discuss some of the limitations of our toolkit. In Section 6, we briefly summarise our work and the prospects for the future work.
The main work and contributions of this study are listed below: • We develop a lightweight static code parser to extract the information of Python files for building knowledge graphs. • We construct an unsupervised model for parsing sentence-level semantic information and use it for code search.

Related work
In recent years, researchers have made researches on knowledge graphs in the field of software engineering, but there are relatively few studies have combined programming language knowledge graphs with question-and-answer (Q&A) systems.

Knowledge graphs for programming languages
Knowledge graphs in programming languages domains usually focus on code analysis and apply them to tasks such as code search, question and answer, and recommendation. Liu et al. (2019) crawled the structural and descriptive knowledge of API documents from Java and Android official documents to construct API knowledge graphs and applied to the comparison between APIs. Abdelaziz et al. (2020Abdelaziz et al. ( , 2021 used machine learning methods to build the first large-scale code knowledge graph. They analysed Python codes on GitHub and posts on StackOverflow, and demonstrated one application of this knowledge graph, which was a code recommendation engine for programmers within an IDE. Although their model works well in some ways, they ignore semantic information between code snippets and the models do not work well for code semantic search tasks. In this work, we extend KG4Py to the Python programming language, and use concrete syntax trees which enable the model to better extract the code snippets.

Knowledge graphs for question-and-answer systems
Researchers typically use knowledge graphs to enhance the understanding and inference of users' queries in Q&A systems. Feng et al. (2021) crawled the data of the business incubation domain to build the knowledge graph, and used a pattern matching algorithm that combined the natural language and entities from the knowledge base to parse the query. Li and Zhao (2021) crawled the physical fitness dataset matched with the template to build a physical fitness knowledge graph question and answer system. Wang et al. (2020b) used Wikipedia taxonomy to create an open-source community-based software knowledge graph, and linked entities and relationships from query statements to this knowledge graph to get the answers. With their models, we further understand the steps to build a semantic search of the code snippets.

Method
In this section, we introduce the methods of building the code knowledge graph and constructing the semantic search of the code snippets.

Building the code knowledge graph
Before building the code knowledge graph, we need to perform code analysis on Python files. Thus, we use the LibCST 1 (a concrete syntax tree parser and serializer library for Python) to parse the code instead of the abstract syntax tree (AST). The AST does a great job of preserving the semantics of the original code, and the structure of the tree is relatively simple. However, much of the semantics of the code is difficult to understand and extract. Given only the AST, it would not be possible to reprint the original source code. Like a JPEG, the abstract syntax tree is lossy, it cannot capture the annotation information we leave. Concrete syntax tree (CST) retains enough information to reprint the exact input code, but it is hard to implement complex operations. LibCST takes a compromise between the two formats outlined above. Like an AST, LibCST parses the source code into nodes that represent the semantics of the code snippets. Like a CST, LibCST retains all annotation information and can be accurately reprinted. The differences in code analysis between AST, CST and LibCST are shown in Figure 1. We use the LibCST to do a static code analysis of the Python files and identify the "import", "class" and "function" in each file. For each function, we also need to identify its parameters, variables and return values. To make the search results more accurate, we use the codeT5  model to generate a description for each function, which is similar to text summarisation (Fang et al., 2020;Lin et al., 2022). Finally, we save them in JSON-formatted files. The pipeline we used is shown Figure 2.
We extract the relevant entities and attributes from the processed JSON-format files and use them to build the code knowledge graph.

Semantic search in the model
Traditional search engines only retrieve answers by matching the keywords, while semantic search systems retrieve answers by dividing and understanding sentences. Before semantic search, the questions and answers in the database are embedded in a vector space. When searching, we embed the divided and parsed question into the same vector space, and calculate the similarity between the vectors to display the answers with high similarity. Next, we introduce the selection of the semantic search model.
Researchers have begun to input individual sentences into BERT (Devlin et al., 2018) and derive fixed-size sentence embeddings. The Bert model has shown to be a strong player in all major Natural Language Processing (NLP) tasks. It is no exception in the semantic similarity computation task. However, the BERT model specifies that two sentences need to be entered into the model at the same time for information interaction when computing semantic similarity, which causes a large computational overhead. Therefore, we choose and fine-tune the Sentence-BERT (Reimers & Gurevych, 2019) model to perform the code semantic search, as shown in Figure 3. In simple and general terms, it draws on the framework of twin network model to input different sentences into two BERT models (but these two BERT models share parameters, which can also be understood as the same BERT model) to obtain the sentence representation vector of each sentence, and the final sentence representation vector obtained can be used for semantic similarity calculation or unsupervised clustering task. For the same 10,000 sentences, we only need to compute them 10,000 times to find the most similar sentence pairs, which takes about 5 s to compute them completely while BERT takes about 65 h.

Encoders in the semantic search model
We use the maximum value pooling strategy, which is to maximise all the word vectors in the sentence obtained by the sentence through the BERT model. The maximum vector is used as the sentence vector of the whole sentence.
For Cross-encoders, they connect questions and answers by full self-attention, so they are more accurate than Bi-encoders. Cross-coders need to take a lot of time to compute the contextual relationship between each question and answer, while Bi-encoders save much time by encoding separately. What's more, Cross-encoders do not scale when the number of questions and answers is large. Therefore, we use it to parse query statements. The structure of Cross-encoders is shown in Figure 4. We use Bi-encoders to perform self-attention on the input and candidate labels separately, map them to a dense vector space, and then combine them at the end to obtain the final representation. The Bi-encoders are able to index the encoded candidates and compare these representations for each input, thus speeding up the prediction time. The time reduction from 65 h (using Cross-encoders) to about 5 s for the same complexity of clustering 10,000 sentences. 2 Therefore, we combine it with unsupervised methods to train label-free question and answer pairs. The structure of Bi-encoders is shown in Figure 5.

Distribution of encoders
In case of regression tasks, such as asymmetric semantic search, we compute sentence embeddings u, v and the cosine similarity of the respective pairs of sentences, then multiply them by a trainable weight W t . We use Mean Squared Error (MSE) loss as an objective function: In asymmetric semantic search, the user provides a query like some keywords or a question, but wants to retrieve a long text passage that provides the answer (Do & Nguyen,

2021
). So, we use unsupervised learning methods to solve asymmetric semantic search tasks such as using natural language descriptions to search code snippets. These methods have in common that they do not require labelled training data. Instead, they can learn semantically meaningful sentence embeddings just from the text itself. Cross-encoders are only suitable for reranking a small set of natural language descriptions. For retrieval of suitable natural language descriptions from a large collection, we have to use Bi-encoders. These queries and descriptions are independently encoded as fixed-size embeddings in the same vector space. Relevant natural language descriptions can then be found by calculating the distance between vectors. Therefore, we combine Bi-encoders with unsupervised methods to train tasks in the field of label-free code search, use Cross-encoders to receive user input and compute the cosine similarity between the question and the natural language description.

Data de-duplication
We take the dataset used by Mir et al. (2021aMir et al. ( , 2021b, Jiang et al. (2021), Feng et al. (2020), Guo et al. (2020) and Lu et al. (2021), which is a JSON-formatted file containing the URLs of repositories for Python projects in GitHub ranked by Star. We use 3065 Python projects and perform static code analyses on them.  Before training a Machine Learning (ML) model, it is essential to de-duplicate the code corpus (Allamanis, 2019). Therefore, we tokenise Python files, use Term Frequency-Inverse Document Frequency (TF-IDF) to vectorise the files, and perform a k-nearest neighbour search to identify candidate duplicate files. 3 We spend more than 4 h detecting duplicate files and writing the results in a text file. We remove the files in the text file that correspond to the file path for the purpose of de-duplication, and perform all the experiments on an Ubuntu 16.04 server with an NVIDIA Tesla P100 GPU. The duplication information is shown in Table 1.

Code analysis
We construct a pipeline to implement code information extraction, which is divided into two parts: • generating structured information of the code with concrete syntax trees; • analysing and integrating the extracted structured information.
In this pipeline, we take a user's repository folder as the smallest input unit and perform static code analyses of all Python files in the folder. We also use the codeT5 (Wang et al., The source code of a function description The description of a function 2021) model to generate code descriptions (maximum length is set to 20) for each function and store the results in a JSON-formatted file. Table 2 shows the field definitions in each JSON-formatted file.

Building the knowledge graph
After defining the entities and relationships in the code knowledge graph, as shown in Table  3 and Table 4, we extract the relevant information from the JSON-formatted files and save them in the CSV-formatted files. We use Cypher statements to build the knowledge graph, as shown in Figure 6. The number of entities is shown in The Cypher statements used to create relationships are as follows:

Steps of the search system
The knowledge graph-based search system is divided into three modules: query analysis, information retrieval and answer extraction (Feng et al., 2021). Our search model is also divided into three parts as shown in Figure 7.
(1) The query analysis module pre-processes the query statement and vectorises it.
(2) The information retrieval module translates the results of question understanding into query statements for the graph database.
3) The answer extraction module  outputs the similarity scores based on the matching vectors and returns them. When searching, the model conducts semantic analysis of the question asked by the user, and then puts them into the knowledge graph for recognition. If the query entities are identified, we will connect these entities in the questions to the particular entities in KG, or else carry out the expansion of the knowledge graph and replenish it. The information retrieval module generates specific query statements by matching the query templates of the graph database to get the results. Finally, the answer extraction module sorts the results by their similarity and returns them.

Training and fine-tuning the search system
We use the Python dataset in CodeSearchNet (Husain et al., 2019) to rebuild a dataset on code search. We select 138 K (124 K training and 14 K test samples) code snippets with their natural language descriptions, and then use the T5 model (Thakur et al., 2021) to generate interrogative sentences for each natural language description, and for each natural language description we generate three of these interrogative-natural language description pairs. The maximum length of each query statement is 64, and the length of the answer  to each query statement is 100. Finally, we save the generated 276 K statement pairs in a TSV-formatted file. The T5 model is shown in Figure 8. For the Multiple Negatives Ranking Loss, it is important that the batch does not contain duplicate entries, i.e. no two equal queries and no two equal answers. We de-duplicate these generated pairs (query, relevant_answer) and put them into the Bi-encoders in the Sentence-Transformer for training. Finally, we fine-tune the parameters of the model to make it suitable for semantic search.

Evaluation results of the search model
We use Normalised Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) as the evaluation metrics, which have been widely used in various search and recommendation systems (Lian & Tang, 2022). The results are shown in Table 6.

Performance on the search model
The search steps of our model are listed below: (1) We need to convert the relevant function names and natural language descriptions into vectors and input them into the model. (2) Inspired by Tang et al. (2021), we use NLTK 4 to remove stop words from the query statement such as "???" or some strange words, convert it into vectors and input them into the model as well.
(3) We compute the cosine similarity among the query statement and the natural language description of the interrogative in the model and return the Top-K function names and natural language descriptions that are most similar to it. (4) We convert the Top-K function names into Cypher statements and display them in the knowledge graph.
We perform the search model on a machine with the Intel(R) Xeon(R) CPU @ 2.30 GHz and NVIDIA Tesla P100 GPU separately. For each search statement, we return the running time (average value), three results and their cosine similarity scores. We test a number of queries and repeat each query for 1000 times. Table 7 shows a part of the search results and time. Figure 9 shows the search results of the knowledge graph.

Industrial values
The main work of this paper is to do a semantic search of code snippets by building a knowledge graph about Python functions. During the code analysis phase, we find a large number of duplicate code snippets in GitHub, which not only cause a waste of resources, but also increase the development costs of the software companies. These problems can be solved easily by our toolkit, and the industrial values of our toolkit can be divided into two parts: • For individual developers, our toolkit can be used not only to search for code, but also to deepen the understanding of code snippets through our code knowledge graph. • For software companies, our toolkit can be used to find the similar code snippets in the enterprise code base by understanding the semantics of the function's annotation, and recommends them to developers if they exist. In this way, it not only reduces the duplication of developers' work, but also cuts down the development costs of software companies.

Threats to validity
The first threat to validity is the limitation of programming languages. Our toolkit only performs static code analysis for Python among programming languages, but does not work for others. Therefore, our toolkit only supports the code search in Python. Another threat is the depth of code analysis. We want to build an enterprise code base to reduce software development costs. But we only analyse the attributes of the code snippets such as the return value, return type, and function parameters, which are not sufficient to compare the similarity between the two functions. In the future, we will analyse the data flow and control flow of code snippets and use them to determine the similarity of code snippets instead of text similarity.
In addition, the third threat is the accuracy of search results. We only remove the stop words from the query, but do not subdivide it according to its lexical nature. In the future, we will sub-phrase the query statement and set the corresponding weight according to the word nature to improve the accuracy of the search system.

Conclusion
In our research, we perform a static code analysis on the Python-based repositories in GitHub, extract the functions from the files and build a code knowledge graph based on these functions. We construct an unsupervised model, and combine this model with the code knowledge graph for semantic search of code snippets. In the future, we will integrate the data flow and control flow of the function into the code knowledge graph to give the users a deeper understanding of the function. For semantic search, there is still room for improving the speed of question retrieval and matching. We will study these issues deeply in the hope of parsing query statements in a simpler way while reducing the retrieval time of the search model.