Analyzing Promising Researchers Using Network Centralities of Co-authorship Networks from Academic Literature

We propose a new way of using the betweenness centrality measure with co-author networks from an academic literature database to evaluate young researchers. It is difficult to discover and evaluate promising young researchers with indexes based on the number of cited papers, such as the h-index to which published papers introduce a lag and whose impact only becomes apparent after they have been cited by other papers. We validated the effectiveness of the measure as an index for evaluating young researchers. Our investigation of 1.92 million publications in the biological sciences shows that Research Fellows with the Japan Society for the Promotion of Science (JSPS) have higher rankings and progress more quickly than other researchers. In addition, differences between JSPS Research Fellows and other researchers were observed at earlier stages using the proposed method than with the h-index and with centralities from literature published in the past 4 years. We expect that the proposed use of the betweenness centrality measure can be applied effectively to extract promising young researchers.


Introduction
Talented researchers are indispensable to innovation and development in science and technology. However, it is difficult to discover and evaluate promising young researchers such as doctoral and postdoctoral students. Indexes based on the number of cited papers, such as the h-index, can be used to evaluate researchers and the impact of publications, yet already-published papers introduce a lag to such indexes, and their impact only becomes apparent after they have been cited by other papers. As such, these indexes are unsuitable for evaluating young researchers and students who have only conducted research for short periods and have yet to accumulate achievements. For this reason, peer review by experts is generally adopted when evaluating young researchers. However, peer reviews lack breadth and quantitativeness.
Therefore, to find and evaluate promising young researchers, especially those involved in group research, we focused on the betweenness centrality of coauthorship networks composed of academic literature databases and the characteristics of its time series variation. We propose a novel use of this index to supplement the lagging nature of citation indexes [1,2]. The betweenness centrality of co-authorship networks is not an index that directly measures the impact of research results, like an index of citations. Rather, the betweenness centrality quantitatively shows the degree of contribution to the betweenness of a research community. It indicates whether a researcher is in a key position to connect research groups or researchers in a particular field. In addition, the betweenness centrality of co-authorship networks is an index that can be measured simultaneously with the publication of a paper. Therefore, an index that uses the betweenness centrality of co-authorship networks is more immediate than an index of citations, and it is quantitative, unlike peer review. Further, it allows for a broad comparison of researchers recorded in a database.
The purpose of this study is to determine whether the betweenness centrality of co-authorship networks comprising academic literature databases is useful for evaluating promising young researchers. To this end, we clarify the betweenness centrality of Research Fellows with the Japan Society for the Promotion of Science (JSPS), who were evaluated as promising according to the existing JSPS peer review process, and according to the characteristics of time series variation with the h-index. We show that researchers are able to be evaluated earlier by using the betweenness centrality than by using the h-index.
The paper is organized as follows. First, we survey related research. Following this, we describe how we extracted 15 years of bibliometric information in the biological sciences from an academic literature database (JSTPlus) provided by Japan Science and Technology Agency (JST). The information amounted to approximately 1.92 million items of data, which we used to analyze temporal changes to the betweenness centrality of co-author networks and the h-index. Finally, we introduce a statistical model that we developed to characterize JSPS Research Fellows and comprehensively search the database for researchers with similar characteristics.

Related Research
Representative evaluation indexes for researchers include the number of articles published in academic journals with a high impact factor based on the citation network of the written documents, and citation indexes such as the h-index. The impact factor is not an index that directly evaluates the researcher or the paper itself. Rather, it represents an evaluation of the journal based on the number of citations of articles published in each journal [3]. The h-index is defined as the maximum value of h such that the given author has published h papers that have each been cited at least h times. It is an index that indicates the impact of an individual researcher's activity [4]. These citation indexes are obtained based on the number of citations of published documents, and they are quantitative, revealing the impact of research results. They enable evaluations of talented researchers who have already accumulated research experience and produced results. However, these indexes merely follow past research results. As such, it is difficult to evaluate young researchers who have a promising future, yet still have insufficient research achievements [5,6]. A typical example of a social network other than a citation network composed of citation relationships in a document database is a co-authorship network composed of document co-authorship relationships. Here, we focus on the centrality of co-authorship networks because the measure has immediacy, whereas citation indexes such as h-index, based on a citation network, are necessarily delayed.
In a related study of co-authorship networks, Melin clarified the characteristics of collaboration within research institutions, within countries, and internationally, based on the co-authorship of papers, arguing that the analysis of co-authorship relationships is useful for science and technology policy [7]. Barabasi et al. analyzed a co-authorship network of papers in two fields (viz., mathematics and cognitive science), and Newman studied three fields (viz., mathematics, biology, and physics). They clarified the differences in collaborative patterns among academic fields [8,9].
The relationship between the centrality of co-authorship networks and the achievements of researchers has also been discussed. For example, Shinoda constructed a co-authorship network focusing on the order of authors of articles published in Transactions of the Japanese Society for Artificial Intelligence, and showed the identification and transition of central researchers from the society based on a centrality index [10]. Abbasi et al. constructed a co-authorship network of papers published in information and library sciences, showing that researchers with a high betweenness centrality achieved good results [11]. Several methods have already been proposed to search for promising researchers using a co-authorship network. These include evaluation indexes that combine parameters such as the co-authorship of papers and the names of journals [12,13], along with classification methods using machine learning [14,15]. However, these studies do not focus on young researchers who have only just started to publish their research results, and they are not designed to find and actively evaluate promising young researchers and students.
In this paper, JSPS Research Fellows are taken to be promising young researchers. By comparing the characteristics of the temporal changes of the betweenness centrality of the JSPS Research Fellows of co-authorship networks with their h-index, we determined whether the betweenness centrality of co-authorship networks is a useful index for evaluating and finding promising young researchers.

Academic Literature Database
To construct the co-authorship networks, we used an academic literature database called JSTPlus provided by the Japan Science and Technology Agency (JST). JST-Plus is a database of academic literature from more than 50 countries, covering all fields of science and technology. It contains over 24 million papers, and approximately 700,000 new papers are added annually.
We selected the biological sciences for our analysis, designated with the JST classification code "E". All of the information was extracted for publications with classification code "E" issued in the 15 years from 2001 to 2015. The identification numbers (IDs) of the extracted documents totaled 1,920,191. IDs are assigned to authors, institutions, and cited literature, and name aggregation processing was applied to these IDs. Table 1 shows the number of paper IDs and author IDs extracted. The "New extractions" column in Table 1 shows the number of IDs extracted for the first time that year from among the author IDs totaled each year. The publication year of the cited literature is needed to calculate the time series variation of the h-index, so the corresponding document information was extracted separately.

Extraction of JSPS Research Fellows
The JSPS Fellowship Program is a subsidy program for young researchers who are Ph.D. candidates or postdoctoral researchers. The JSPS Fellowship Program has been ongoing since 1985. Selection is mainly based on screenings of proposals. JSPS asks the researchers selected for the program to indicate that they have received a JSPS grant at the time of publication, for example by noting "JSPS Research Fellow" or the equivalent when stating their affiliations in papers. We extracted authors for whom the words "Japan Society for the Promotion of Science" or "JSPS" were included in the affiliated institution column of the database as researchers with a history of obtaining a JSPS Research Fellowship. As a result of searching the affiliated institution in JSTPlus, the total number of JSPS researchers extracted from 2001 to 2015 was 1655. Because an average of approximately 100 JSPS Research Fellows were selected from each year, we considered this to be a sufficient sample size with the above-mentioned method of checking the personal history. The "JSPS" column in Table 1 shows the number of IDs for which a history of a JSPS Research Fellows as of 2015 was confirmed from among the "New extractions" author IDs for the year. The reason for the small number in the "JSPS" column in Table 1 around 2015 is that the period for JSPS Research Fellows generally follows the next few years after a new selection.

Betweenness Centrality of Co-authorship Networks and Its Normalization
After constructing a bipartite graph with the paper ID and author ID as vertices, we constructed co-authorship networks that extracted the author IDs exclusively. We generated the co-authorship networks as an undirected, unweighted graph, forming one for each year. Therefore, the vertex representing the author ID is missing, unless the author published at least one paper annually. Edges representing co-authorship relationships cannot be maintained unless at least one paper is co-authored every year. The separated graphs that were not connected by edges were numbered Component 1, Component 2, and so on, in descending order of the number of vertices. The density of the vertices in Fig. 1 represents the magnitude of the betweenness centrality in the graph. For graph g consisting of n vertices, the betweenness centrality b i of vertex i indicates the degree to which i is located on the shortest path of the combination of two vertices (j, k) other than i. This is expressed as where g jk represents the number of shortest paths connecting vertex j and vertex k, and g jk (i) represents the number of paths passing through vertex i in g jk . The   betweenness centrality of the co-author networks shown in Fig. 1 is obtained as shown in Table 2.
In Table 2, y i , c i , and y * i , together with b i , represent values obtained by normalizing or classifying b i with n = 400 . There are several methods of normalizing the betweenness centrality. We opted to use normalization by ranking. Specifically, the following two methods were applied.
With the first normalization method, b i is ranked in order of size as n, (n − 1), ⋯ , each year, and rank 1 (⋅) is applied to give the mean ranking when the same ranking occurs. The normalized value y i is then In addition to being able to unify the maximum value in any year as max(y i ) = 1 , this method is useful for extracting a certain percentage of the upper class of people, such as those with a high betweenness centrality b i in the top 1.0% (Excellent), top 2.5% (Very good), top 5.0% (Good), and below this (Fair). Let the class c i of y i be First, we focus on researchers whose b i is in the top 5% ( y i ≥ 0.95 ). A cumulative logit model was applied to the time series variation of the class c i . Let the probability P(y = c | x) that the response y according to time series x is classified as class c be Here, l and l denote regression coefficients, and m represents the number of classes. Here, m = 4 , as shown in Eq. (3). For time series x, newly extracted timing was employed, such that x = 1 from 2001 to 2015.
Although the normalization method described above is suitable for the analysis of the upper class of the betweenness centrality, it is unsuitable for analyzing the average transition of the whole. Owing to the structure of the co-authorship networks, many equivalent values such as b i = 0 occur, the minimum value becomes min(y i ) ≠ 0 , and the frequency distribution becomes discrete from this lowest level to the middle level.
Therefore, with the second normalization method, when the same rank occurs, rank 2 (⋅) gives the maximum rank that is consecutive with the previous rank, and the normalized value y * i is given as: .
When adopting this method, max(y * i ) = 1 , min(y * i ) = 0 , and the distribution is not discretized. The logistic growth curve is derived as follows: and fitted to the change of y * i according to time series x. R 3.6.0 (data.  Figure 2 and Table 3 show the results of applying Eq. (4) to the transition of the upper class c of the betweenness centrality. The horizontal axis in Fig. 2 represents

Transition of the Upper Class of Betweenness Centrality
.   Table 1, the data of the newly extracted 1,259,230 authors for the last 5 years were integrated, and the composition ratios of class c were tabulated. The red stacked bar chart shows the composition ratios of class c for the newly extracted researcher group. For the sake of comparison, the blue bars show the results of re-extracting data for 690 researchers with a personal history (until 2015) as a JSPS Research Fellow, and totaling the composition ratios again. Data with x > 1 have missing values in either group, but these are excluded from the composition ratios of the chart. From Fig. 2, the probability is low so that the betweenness centrality falls within the top 5% of the total at the stage of x = 1 , when newly extracted from the database for the first time. However, we confirmed the tendency for it to increase over time. In particular, the composition ratio of the upper class of JSPS Research Fellows tended to increase at an accelerating rate. As a result of applying the Kolmogorov-Smirnov (KS) test, the composition ratio showed a significant upward bias ( p < 0.05 ) from the stage of x ≥ 4 , compared to the newly extracted researcher group. The cumulative logit model was applied to the transition of the composition ratio of the JSPS Research Fellow group, and a significant regression coefficient was shown ( p < 0.001).
To summarize the above, the betweenness centrality of researchers, such as those selected as JSPS Research Fellows, increases at an accelerated rate compared to general new entrants. After approximately 4-8 years have passed since they were new entrants, it is highly probable that they fall within the top 5%. Specifically, it was shown that the growth of the normalization value y of the betweenness centrality proposed in this paper is possible to the 0.95 point.

Transition of Representative Values of Betweenness Centrality and h-Index
To construct a betweenness centrality growth model from representative values rather than the upper ranks of JSPS Research Fellows, and to compare its characteristics with the h-index, the author IDs with an h-index of 1 or more by 2015 were extracted again from the target author IDs, and the transition of the normalized betweenness centrality y * and h-index was obtained. The transition of the h-index obtained at each time of each year was the subject of analysis, and the h-index at the newly extracted x = 1 was calculated by summing the number of citations in the period up to x ≤ 1 for paper IDs published before x ≤ 1 . The h-index for x = 2 was calculated by summing the number of citations in the period up to x ≤ 2 for the paper IDs published before x ≤ 2.
There were 467,003 newly extracted researchers who met the extraction conditions, of which 463 were JSPS Research Fellows. Figure 3 shows the median value of y * as a scatter diagram, and the results from applying the logistic curve with Eq. (6) to the transition. Figure 4 shows the mean value of the h-index as a scatter diagram and the results from applying a logistic curve similar to the transition. The blue plot represents the JSPS Research Fellow group, and the red plot represents the   newly extracted researcher group. Tables 4 and 5 show the regression coefficients of the growth curves. Figure 3 shows that the betweenness centrality y * of the JSPS Research Fellow group grows at an accelerated rate over time and significantly fits to the logistic curve ( p < 0.001 ). The regression coefficient K of this growth curve was estimated to be 0.626, as shown in Table 4, and it was shown to approach this value after 8 years. By contrast, the y * of the newly extracted researcher group was almost 0, even after 8 years since the first academic publication, and the regression coefficient K could not be estimated, even when applying Eq. (6). Therefore, other regression coefficients were estimated at K = 1 , but the regression coefficients related to the rise of the curve were not significant ( p > 0.05).
In contrast to this feature, Fig. 4 shows that the logistic curve fits well for the time series change of the h-index for both groups. It should be noted that there is no difference in growth between the two groups from new extraction to 4 years thereafter, and a difference of only approximately 0.5 arises even after 8 years. The h-index calculated from the number of academic publications and the number of citations is an integer value such as 0, 1, 2, ⋯ , as defined. Therefore, the difference in the h-index value between the two groups of less than 1 is not sufficient to evaluate differences in these two groups of researchers.
According to these results, the h-index has a lagging property, as explained in Sect. 1. At first, no differences appeared between the general group of newly extracted researchers and the JSPS Research Fellow group. In contrast to the characteristic of a difference that slowly and steadily occurs, the betweenness centrality is an index that is able to show differences between the two groups at an early stage.

Extraction of a Researcher Group that Shows Correlation with the JSPS Research Fellow Betweenness Centrality Growth Model
We here propose using the Pearson product-moment correlation coefficient r as a simple method of widely searching the database for researchers whose betweenness centrality is increasing in the same way as the JSPS Research Fellow group. The procedure for the proposed method is as follows. Step 1. Extract the document information for the specific academic field to be searched from the database and normalize it using Eq. (5) to calculate the betweenness centrality of the co-authorship networks. The betweenness centrality is calculated on a yearly basis and made into time series data.
Step 2. Build a JSPS Research Fellow model for the field. When constructing a median model, as shown by the blue line in Fig. 3, Eq. (6) is applied to approximately 8 years of time series data, to observe the accelerating increase to the betweenness centrality.
Step 3. Evaluate the correlation between the JSPS Research Fellow model and the n year data of the newly extracted researcher group. The threshold for the correlation coefficient r is the value at which the null hypothesis is rejected in a test of no correlation for a sample size n. That is, according to a t distribution of n − 2 degrees of freedom, the test statistic t 0 in the uncorrelated test is Therefore, this is solved for r, and it becomes the threshold value. For example, if n = 4 at a significance level of 5%, then researchers who show a correlation of r > 0.900 are extracted. If n = 8 , then those with a correlation of r > 0.621 are extracted. This method can be applied even if part of the time series data is missing. Owing to the nature of Eq. (7), however, r cannot be evaluated unless n > 2. Table 6 shows the number of author IDs correlated with the median model of the JSPS Research Fellow group constructed by the above procedure ( Fig. 3 and Table 4). Figure 5 shows the betweenness centrality of researchers newly extracted in 2002 and the transition to the h-index by group. The values in parentheses in Table 6 indicate the number of newly extracted researchers who were confirmed as  Table 6, when the search condition is n ≤ 4 , that is, when searching for researchers who published their first paper within 4 years, approximately 100 researchers were extracted per year, and when the search condition was n ≤ 8 , approximately 1000 researchers were extracted as promising young researchers. In addition, from Fig. 5a, the betweenness centrality of the group revealed by the proposed method transitions in the same way as the betweenness centrality of the JSPS Research Fellow group for that year. Regarding the h-index, we confirmed that the group revealed by the proposed method and the JSPS Research Fellow group transitioned in the same way, as shown in Fig. 5b. The number of JSPS Research Fellows included in the group revealed by the proposed method was smaller than the population, as shown in parentheses in Table 6. This is because the proposed method uses only the statistical property of the betweenness centrality, and not an estimator trained by machine learning using multiple variables for the purpose of predicting future JSPS Research Fellows with high accuracy. Those who were extracted by the proposed method were found to be Research Fellows with RIKEN and the National Institute of Advanced Industrial Science and Technology-leading Japanese research institutions-and postdoctoral researchers from research institutions in other countries. The proposed method is thus considered to be effective as a means of widely searching for researchers who have not been accepted as a JSPS Research Fellow, yet have experience becoming a fellow through a similar peer review process, or people who may be evaluated by such a peer review process over the next few years.

Evaluation Timing
The results of this study suggest that by using the proposed method, researchers with the same centrality transition characteristics as JSPS Research Fellows can be extracted earlier than they would be by using the h-index, provided that approximately 4 years of co-authorship information can be obtained. This is because citation-based indexes such as the h-index are lagging and require time for a striking difference to arise between excellent researchers and general researchers from citations by other researchers. By contrast, the centrality of co-authorship networks is an immediate index that is calculated concurrently with the publication of research papers. Furthermore, the betweenness centrality indicates whether a researcher is in a key position to connect research groups or researchers, and it shows the degree of contribution to the betweenness in a research community which is so important for achievements in collaborative research. Accordingly, as results of this study, we found that researchers extracted using the proposed method with the betweenness centrality are indeed promising, and that their h-index tends to grow in the same way as that of JSPS Research Fellows. The JSPS Fellow system aims to train and secure excellent young researchers. A Research Fellow (DC) refers to a student who is enrolled in a doctoral program, and a Research Fellow (PD) refers to one who has recently obtained a doctoral degree (within five years). At present, the selection and evaluation of these young researchers is conducted by a peer review of the prospective researcher's "qualities as a researcher", "research plans", "ability to execute research plans", etc.
We considered these "young researchers" to be prospective JSPS Research Fellows: doctoral students or those who recently acquired their Ph.D. The proposed method is able to evaluate and extract young researchers who published their first paper within 4 years. For example, researchers who published their first paper in the second year of their doctoral program and spent three more years finishing the program correspond to these young researchers. At that stage, young researchers are rarely evaluated highly by citation-based indices such as the h-index, and peer reviews are often used instead. However, it is generally difficult to guarantee a quantitative evaluation and one that offers wide coverage of the evaluation targets with a peer review process. The effort required of evaluators is moreover burdensome. By contrast, the proposed method is quantitative and offers broad coverage by using the betweenness centrality in co-authorship networks constructed from academic literature. As such, the burden on evaluators can be reduced.
Therefore, we expect that the proposed method can provide useful information to companies that are considering hiring researchers who are still doctoral candidates for future research and development, and to academic promotion organizations seeking promising young researchers.

Target Field for Evaluation
Research styles vary by field. For example, researchers tend to work alone in fields that are more theoretical, such as mathematics, and they tend to collaborate in fields that are more experimental, such as biology and physics. In interdisciplinary fields such as biotechnology and bioinformatics, it is considered effective to conduct collaborative research across multiple fields. In addition, JSPS Research Fellows must change their affiliated research institute, to create new research opportunities. These factors can influence the centrality of co-authorship networks. Furthermore, in the field of medicine, for example, research and development is conducted over a long period of time, and continual, protracted collaboration among researchers is often necessary. When taking into account these research styles, the centrality of the coauthorship networks-that is, collaborative networks-can be considered an index suitable for group research in general, and for interdisciplinary and multidisciplinary research in particular.
This paper focused on biology, where group research is common. We did not analyze the effectiveness of the proposed index in other research fields. In recent years, however, collaborative research such as open innovation, industry-academy cooperation, and interdisciplinary research has become more important. Thus, it is desirable to conduct similar analyses on the relationship between the centralities of co-authorship networks in other research fields and affiliated institutions.

Conclusion
To verify whether the betweenness centrality of co-authorship networks composed of academic literature databases is a useful index for quantitative and widely covered evaluations of promising young researchers, we analyzed the betweenness centrality of JSPS Research Fellows who were previously evaluated as promising according to a JSPS peer review process and who showed characteristics of time series variation with the h-index.
The betweenness centrality of the group assessed by JSPS as promising young researchers was compared to the centrality of a general group of researchers. We clarified that the former has a high probability of being positioned at the top as a result of their first academic publication, and that a growth curve applies that increases at an accelerated rate in the first few years. In addition, whereas citationbased indexes such as the h-index indicate the impact of research results in a lagging manner, the betweenness centrality of co-authorship networks is an index that can be measured simultaneously with the publication of papers. Therefore, when the above two groups of researchers were evaluated with the betweenness centrality, differences between the two groups could be observed at earlier stages than evaluations with the h-index. The betweenness centrality of co-authorship networks was thus shown to be an index that can evaluate researchers ahead of the h-index. Furthermore, we demonstrated that if co-authorship information is obtained over approximately 4 years, we can extract from the database promising researchers whose h-index grows in the same way as a JSPS Research Fellow.
This paper focused on biology, where collaborative research is common. Given the increasing importance of collaboration in research activities such as open innovation and industry-academy cooperation, the betweenness centrality of co-authorship networks may be suitable for organizational research and interdisciplinary research such as bioengineering, bioinformatics or astrophysics. Consequently, it will be beneficial to conduct similar analyses in such research fields. Future developments include clarifying the causal relationship between changes to affiliated institutions and changes to the betweenness centrality of the co-authorship networks.