An Importance Assessment Model of Open-Source Community Java Projects Based on Domain Knowledge Graph

With the rise of open-source software, the social development paradigm occupies an indispensable position in the current software development process. This paper puts forward a variant of the PageRank algorithm to build the importance assessment model, which provides quantifiable importance assessment metrics for new Java projects based on Java open-source projects or components. The critical point of the model is to use crawlers to obtain relevant information about Java open-source projects in the GitHub open-source community to build a domain knowledge graph. According to the three dimensions of the Java opensource project’s project influence, project activity and project popularity, the project is measured. A modified PageRank algorithm is proposed to construct the importance evaluation model. Thereby providing quantifiable importance evaluation indicators for new Java projects based on or components of Java opensource projects. This article evaluates the importance of 4512 Java open-source projects obtained on GitHub and has a good effect.


Introduction
In recent years, with the deep integration of software collaborative development technology and social networks, the social development paradigm has occupied an indispensable position in the current software development process 0. As of April 2018, the Java programming language topped the TIOBE ranking with 15.777% usage, up 0.21% from April 2017. Meanwhile, in Java project development, most project developers choose project construction tools when developing Java projects [2]. According to the statistics of the DZone website, Apache Maven [3], Apache Ant [4], and Gradle [5] are currently widely used project development tools. Based on the mechanisms such as watch, fork, and star provided by the GitHub opensource community, combined with the pom.xml, build.xml, and build.gradle attributes configuration files in the Java project development tools, the domain knowledge graph of Java open-source projects were constructed [6], and the relationship between projects was more intuitively displayed by using the visualization technology.
As an open-source community, GitHub for groups of all kinds, they are from different industries, different positions, interests, experiences, and level of education also varies, to participate in the project of development with its field do not match, lead to this widespread group creation behavior [7] tend to have some disadvantages. Project developers sometimes need to use the open-source community project to secondary development as the basic version of the project, and there may be multiple open-source projects with the same or similar functions. Due to the lack of evaluation basis, if the selected project maintenance personnel are few, and the reliability is low, there may be many problems in the iterative development process of the project developer. Therefore, the development of the open-source community will be affected more or less if the public creative development activities are not properly guided [8][9][10].
Based on the above problems, this paper presents an open-source community Java project importance evaluation model based on domain knowledge graph, which measures Java open-source projects from three dimensions of project influence, project activity, and project popularity, and evaluates its importance with PageRank algorithm. Specifically, the main contribution of this paper is to build a knowledge graph of open-source community Java projects and propose a project importance assessment model. The research significance of this paper mainly includes the following two aspects: (1) Assist the open-source community in quantitative analysis of project importance. Through the quantitative evaluation and scoring of each Java open-source project, and then ranking and comparing them, it provides a reference for project developers; (2) Guide project development. When project developers select projects from several open-source projects as the basic version for iterative development, it can recommend high-quality projects to developers through project importance assessment and quantitative analysis of project reliability and importance.

Background
In the open-source community, project developers release a wide variety of projects, and it is not easy to search the most popular and representative projects among the numerous open-source projects. If we can integrate various attributes of open-source projects in GitHub and evaluate the importance of open-source projects, we can save a lot of time for project developers, and provide a scientific reference. OSSEAN [11] is an analytical search platform for global open-source software. It has crawled many open-source software resources from the collaborative development community and the knowledge sharing community of opensource software, ranked open-source software according to the group wisdom contained in the discussions and feedback of participants, and then recommend high-quality open-source software. In recent years, Google knowledge graph technology has attracted widespread attention, and "open knowledge graph" has also emerged. The openness of open-source community and knowledge graph have similar relationships, and common goal pursuit, both of them are to integrate knowledge. However, due to the limited public disclosure technology of knowledge graph, there are few related researches on knowledge graph in the software domain. However, the knowledge graph in the software domain has a high degree of deficiency. Knowledge graph can effectively guide software development, update, maintenance, and so on. Liu et al. [12] made a comprehensive analysis of the key technologies involved in the construction of knowledge graph based on the definition and technical framework of knowledge graph. The importance assessment of complex network node has always been the focus of many scholars. Zhang et al. [13] drew on the PageRank algorithm, and combined the characteristics of node importance evaluation in complex networks, reflected the influence of the overall link relationship on node importance in directed weighted complex networks. Zhang [16] proposed an algorithm for quantitative evaluation of the importance of complex network nodes, which took the grey relational degree as a measure to evaluate the relevance of each node in the network with the ideal "core node". An evaluation of the importance of open-source projects can provide important clues for developers to search for high-quality projects.

Importance Assessment Model
This section first introduces the construction process of the domain knowledge graph, then introduces the impact factors of project importance assessment, and finally gives the improved PageRank importance assessment method.

Building Domain Knowledge Graph
According to the difference and highly scattered characteristics between the development experience of the project developer and the open-source project, the difficulty of the importance assessment of the open-source project lies in how to reasonably make use of the relevant attribute data in GitHub for assessment, to provide a reference for project developers.
The main idea of the knowledge graph construction in this paper is to make full use of the activity records of project developers in GitHub open-source community, and integrate the data such as the project's attributes, the correlation between projects and the participation of project contributions into the relationship network diagram, to construct domain knowledge graph. It is mainly divided into the following three steps: (1) Obtain Java open-source project data. Firstly, got all Java project data developed by Apache Maven, Apache Ant, and Gradle in the GitHub open-source community, including watch, fork, star, commits, etc. Secondly, analyzed the configuration files of three project development tools: pom.xml, build.xml, and build.gradle to acquire the data that determined the unique identity, dependencies, and project description of the project. (2) Build relationship network diagram. With Java open-source project as the node, the dependency relationship between the projects is used to build the directed relationship network diagram, and the in-degree and the out-degree are determined through the dependencies between the projects. Taking Fig. 1 as an example, the in-degree and out-degree of PMD of project node are 0 and 13, respectively.

Obtain the Influential Factors of Importance
In this paper, Java open-source projects are measured from three dimensions of project influence, project activity and, project popularity. The specific acquisition methods of each dimension are as follows.

Project Influence
Java open-source projects have situations where they depend on or are dependent on other projects. According to the degree centrality theory [18], the more neighbors a node has, the greater its influence will be. Based on this theory, the project relational network diagram is first constructed according to the dependencies between projects. In this network, each node represents a Java open-source project and each edge represents the dependency between two projects. Then, the influence degree of the project is calculated by using in-degree and out-degree of the project node. The specific formula is as follows: where, the degree of the project node , denoted as , refers to the number of nodes directly connected to the node , = ∑ , In the adjacency matrix A, if two nodes are connected by edges, the value is 1; otherwise, the value is 0.
represents the element of row and column in the adjacency matrix A. represents the number of nodes in the network. The denominator represents the maximum possible value of the node.

Project activity
As an open-source community of social programming, GitHub provides various social mechanisms for project developers to interact, of which commit mechanism [19] is one. Commit is an essential operation in open-source project version control. Every commit can be considered as a modified node or a targeted operation. The number of commits submitted by the commit and the latest commit time of the project can reflect the active degree of the project.
This paper designed Eq. (2) and Eq. (3) to calculate project activity: where, represents the time interval from the last commit time to now. T( ) represents the weight coefficient corresponding to the commit time interval. Through continuous tracking of open-source projects on GitHub, it is most reasonable to take the above time node as the failure cycle. AC( , ) represents the activity of open-source project , ( ) represents the number of commits for project . In order to make the activity measurement more accurate, first normalized the commit number, then assigned different weights based on the impact of the commit number and the latest update time on project activity, and finally made linear combination.

Project Popularity
In GitHub open-source community, the mechanism of watch, fork, and star are provided. If the project developers are interested in an open-source project, they can use the mechanism of watch, fork, and star to track the project and develop it locally. This paper uses these three mechanisms to measure project popularity.
Watch: The most straightforward way to be interested in an open-source project is to focus on all the dynamics of the project. With the help of watch mechanism, users' notification center will receive the updated information of the project once the project that users follow changes, such as other project developers submitted a pull request, initiated issue comments and so on. The watch number of open-source projects can reflect the level of project developers' attention to the project.
Fork: Fork is the number of copies of the project. Using fork mechanism, users can make a copy of the project to their repository to facilitate bug repair or continue to optimize the project. The fork number of open-source projects can reflect the popularity of the project.
Star: Star can be colloquially translated as "like". The star mechanism gives users the ability to collect projects for later lookup. Users can use the personal center's "Your Stars" feature to view all items in their collections. The star number of open-source projects can reflect the degree of support from users.
Yang et al. [20] proposed a method to calculate the popularity of an open-source project based on watch number and fork number. In order to better quantify the popularity of an open-source project, this paper designed the following Eq. (4) to calculate the popularity of an open-source project by integrating the above three mechanisms of watch, fork, and star.
Among them, pp( ) indicates the popularity of open-source project , ℎ( ), ( ) and ( ) represent the number of watches, forks, and stars corresponding to the open-source project , respectively. In order to make the measurement more accurate, the data of watch, fork, and star were normalized at first. Secondly, due to the different influence degrees of the three mechanisms of watch, fork and, star on the opensource projects, they were given different weights respectively, and finally, the sum operation was conducted.

Importance Assessment Model Construction
The construction of the importance assessment model mainly includes two steps: Data preprocessing and model calculation. Firstly, ranked them according to the dependency complexity and public participation of different Java open-source projects, and then conduct feature extraction [21], which mainly includes project influence data, project activity data and project popularity data of the project itself. After that, this paper proposed an improved form of PageRank as the importance assessment method. By labeling the ranking data set, it trained the relationship between feature set and labeled value, optimized the parameters between features and constructed the importance assessment model.
PageRank algorithm is based on web link analysis to process keyword matching search results [22]. According to the basic idea of the PageRank algorithm for sorting web pages, the algorithm is improved.
User access to open-source community project in behavior have random, and there is an interdependent relationship between open-source projects, but because the PageRank algorithm without considering the influence of dynamic information of sorting, led to the sort will be disturbed by historical data. Therefore, this paper introduced the feedback factor to solve this problem, and the feedback factor is a linear combination of project influence, project activity, and project popularity. The specific formula is as Eq. ( ) represents the popularity of project . Improved PageRank algorithm. We combine the feedback factor with the PageRank algorithm, the evaluation of project importance can be carried out more accurately. Based on the above ideas, this paper designs Eq. (6) to improve the PageRank algorithm.
( ) means the importance of level about open-source project . T (j = 1,2, ⋯ , n) represents other items that depend on item . represents probability of random access to items for users, it is between 0 and 1.
represents the number of all project nodes. � � is the number of projects which project depends on.
represents PR value given to by the dependent item . We assume that the initial PR value for each project is 1.
Above all, we can get a flowchart of the importance assessment model of the open-source community Java project based on the domain knowledge graph, it is shown in Fig. 2.

Dataset
This paper use python [23] to obtain information about Java open-source projects in GitHub opensource community to build a data set. In the experiment process, representative experimental samples were selected by removing repeated data, and we evaluate the model by analyzing and comparing these data.
When acquiring data, we get the data from GitHub, all Java projects developed with three project development tools: Apache Maven, Apache Ant, and Gradle. Then we analyze the configuration files of three kinds of project development tools: pom.xml, build.xml, and build.gradle. In the end, the unique identification of the project was used to obtain data on watch, fork, star, and commits on GitHub to complete the dataset. Through simple screening of acquisition projects (at least one non-empty project attribute was selected) as the experimental data set, and 4512 open-source projects were finally selected.

Knowledge Graph
The obtained Java open-source project is taken as the network node, the dependencies between projects as the edge, and the project attribute as the node attribute. Furthermore, the Neo4j graph database is used to build the knowledge graph, as shown in Fig. 3.

Result Analyzes
PageRank algorithm is iterative [24], and the improved PageRank importance assessment model in this paper still has the characteristic of iterative until finally reaching a steady state. We calculated for the first time according to the set initial PR value of 1 for each project and then iterated until the data is stable. The PR value initial calculation comparison is shown in Fig. 4.

Figure 4: PR comparison graph
To ensure the accuracy of the assessment model, so we make the relationship between PR value and feedback factor S( ) is shown in Fig. 5. The data in the graph is divided into two parts. In part 1, when S( ) ∈ (0,0.007), all item attributes in the interval contain only the in-degree and out-degree, and all other attributes are empty; in this case, the project influence has a more significant correlation with the feedback factor S( ) and the importance of level PR value. In part 2, when S( ) = (0.007,0.6), all items within the interval contain multiple attributes; in this case, the project's importance assessment is affected by project influence, project activity, and project popularity. After multiple hyperparameter adjustments, the experimental results show that 3> 3> 3, the model accuracy rate is higher, indicating that the project popularity has the most significant impact on feedback factor S( ).   When S( ) = (0.007,0.6), the TOP-1 project name is "jsqlparser", this project influence degree is 0.001039192, activity is 0.510770559, popularity is 0.044144581, all of which were high in the influence factor. Moreover, both the project which dependent and dependent projects have a high PR value, which confirms the correlation between hot projects. It can be seen from the above figure that feedback factor S( ) and importance of level PR value present a non-linear positive correlation. When project A can be linked to project B, it is considered that project B has obtained the value of project A's contribution to it. The value of this value depends on the importance assessment of project A itself; therefore, projects with high PR value are more likely to link to projects with high PR value.
From the above data, it can be found that the influence, activity, and popularity of a project significantly influence its importance assessment. Moreover, there are dependencies among hot projects. The importance assessment model designed in this paper can be used to measure Java open-source projects and also show a good effect.

Conclusion
In recent years, with the rise of open-source software, software and related technologies have significantly development. GitHub as a social development platform allows developers to create and browse the code, and it also provides community-based software development function. Users interact by watch, pull request and issue mechanisms, and they can fix bug, optimization and perfection for the open-source project in GitHub.
In this paper, we present an importance assessment model of open-source community Java projects based on the domain knowledge graph. This model makes full use of multiple associations between Java open-source projects, projects, and project developers; it combines attributes of multiple dimensions and improved the PageRank algorithm to evaluate the importance of open-source projects. Specifically, this article first constructs a domain knowledge graph through the dependencies between open-source projects and the attributes of related mechanisms in the GitHub open source community. Then integrate the three dimensions of the Java open source project's project influence, project activity, and project popularity to measure the project, and use the improved PageRank algorithm to establish an open-source project importance evaluation model. Finally, use the test data set to conduct an importance assessment experiment for a given project. The method proposed in this paper comprehensively considers the multi-dimensional factors that affect the project importance assessment and can evaluate the importance of open-source projects with high accuracy.
In the future, we will consider further work from other aspects. Now, the data we obtained are all realtime data; the life cycle of open-source projects is not taken into consideration. There will be misevaluation for the finished projects. Secondly, this paper only uses the activity records of users participating in the project, but does not delve into the specific contents of the project, such as technical dependence, technical parameters and so on. The next step is to explore the social connection between the project lifecycle and the project developers; then, we gradually improve the project importance assessment model.