Co-authorship network structure analysis

The analysis of networks of collaboration between scientists reveals features of academic communities that help in understanding the specifics of collaborative scientific work and identifying the notable researchers. In these networks, the set of nodes consists of authors and there exists a link between two authors if they have coauthored one or more papers. This article presents an analysis of the co-authorship network based on bibliometric data retrieved from the distributed economic database. Here we use the simple network model without taking into account the strength of collaborative ties. The data were analyzed using statistical techniques in order to get such parameters as the number of papers per author, the number of authors per paper, the average number of coauthors per author and collaboration indices. We show that the largest component occupies near 90 % of the network and the node degree distribution follows a power-law. The study of typical distances between nodes and the degree of clustering makes it possible to classify the network as a ‘small world’ network.


Background
One class of complex networks is 'collaboration networks', constructed of groups of actors with common interests, such as athletes in the same sport; film actors participating in the general film. This also includes groups of scientists who jointly prepared one or more scientific publications (SP)coauthors. Typically, these SPs are the result of well-documented scientific collaboration. If more than one author shares responsibility for a SP containing the results of specific research, then it is assumed that during the research process the coauthors must work together for some time. In a number of cases, co-authorship guarantees an effective solution to the problems posed in the SP, reducing the number of delusions and errors and, ultimately, ensuring the effectiveness of scientific activities.
For the reviews of factors affecting behavior, performance and motivation of scientific collaborations see works [1,2] and the monograph [3]. The study of co-authorship networks sheds light on the degree of interpenetration of ideas, knowledge and technologies and characterizes the structure of the scientific community. In this case, different levels of cooperation can be considered: between countries, within a country, within an organization, between scientific areas, within a general subject, within a journal or conference proceedings, etc.
Citation networks and co-authorship networks represent different points of view on the SP-set. The source of information about SPs, as a rule, are bibliographic databases (DB). When studying a citation network, SPs are its nodes, and oriented links between nodes are established according a binary citation relation; in co-authorship networks, nodes are SP-authors, and links are established by their co-authorship of one or more SPs. However, there is a mutual influence of collaboration and citation  [4][5][6]). For example, issues such as the impact of the number of researchers coauthoring a paper on its efficiency score or do authors more often cite their coauthors in comparison with other cited authors.
The study of collaboration networks is carried out in two main directions. Empirical measurements provide detailed network characteristics both at the network level and at the individual node level. An example of this approach is a series of papers [7][8][9] that present the results of a study of several large databases related to a number of scientific fields. It is shown that the corresponding co-authorship networks have common properties, such as a short node-to-node distance and a large clustering coefficient [10]; the degree distribution follows a power law. Another approach is to build network models and study their properties. Collaboration networks are modeled as undirected graphs, directed graphs, and hypergraphs and various coauthor weighting schemes are applied. The study of the dynamic properties of real networks and network models makes it possible to identify the structural principles that govern the evolution of networks; dynamic properties, in turn, can explain static ones [11,12].
This paper presents a method for constructing a co-authorship network based on bibliographic information extracted from the RePEc database. A preliminary analysis of the initial data is presented. The parameters of the constructed co-authorship network are investigated and compared with the parameters of the "small world" network.
Co-authorship networks reflect professional interaction between scientists. Their research reveals much about the structure of scientific communities, allows predicting the results of their activities and to build maps of science.

Network constructing
Co-authorship networks are most commonly defined as undirected and weighted graphs, here we use this approach. Let P be the set of papers indexed in the database, P = {p1, p2, …, pm} and V = {v1, v2, …, vn} the set of authors appearing in P. Let E be the set of pairs e = (vi, pj) ∈ E if vi is the author or co-author of pj. The network of publications (network of authorship) N pub = (V, P, E) is a bipartite directed network, its nodes are divided into two disjoint independent sets V and P. The adjacency matrix A = (aij) (authorship matrix) of the corresponding graph is an n × m matrix whose rows correspond to authors and columns to publications: Let l(pj) be the number of authors of pj: We assume that each SP analyzed below has at least two authors (co-authors), i.e. l(pj) > 1 for each publication pj. For vi, vj ∈ V, let's define the binary co-authorship relation R ca : vi and vj are the authors of p.
(3) A co-authorship network is a weighted undirected network N ca = (V, E, w), V is the set of authors, Several weighting schemes are commonly used to assign weights to co-authorship links. In the simplest case, we are considering in this article, w(e = (vi, vj)) = 1, regardless of how many papers vi and vj have coauthored and how each author contributed to the publication, i.e., the N ca network can be considered unweighted. Then the graph corresponding to the unweighted co-authorship network can be represented by (0, 1) adjacency matrix U = (uij): The matrix U can be obtained from the matrix A' = A • A  by replacing all nonzero elements with 1. Elements of main diagonal are set to 0.

Data extracting
The problem of recognizing individual authors based on the text of SP in a database of considerable size has been recognized as rather laborious and depends on how well the database is documented. The task includes not only parsing the headlines of articles and extracting such components as title, author, affiliation but also identifying authors. Two authors may have the same name, or the same author may be identified differently in different publications (full name or initials). This leads to ambiguity and erroneous conclusions. One way to avoid ambiguity is to exclude individuals with the same last name and first initial, see [13]. For pre-processing on the raw data in order to associate names of authors with unique people see [6]. In our case, metadata contained in the RePEc database [14] was used when selecting the authors and publications. Authors create individual 'profiles' using Author Service provided by the DB RePEc. The author's profile contains publication IDs, authored or coauthored by this author. For the date of data extracting (2020.01.31) the number of registered authors is 70549 and the number of declared publications is 470333. Among the publications there are those that are declared two or more times, we consider that these publications have more than one author. We select only publications declared by several authors, and the number of these authors coincides with the information about the number of authors contained in the publication's metadata. As a result, 91113 publications and 32434 authors that have declared these publications are considered. In total, the selected authors declared 364 989 publications (including those with one author). This filtered data is used as the basis for constructing the co-authorship network. The publications refer to the period from 1954 to 2019. It should be noted that the creation of the DB RePEc dates back to 1997.

Data analysis
Let P be the set of papers and A the set of authors or coauthors of at least one of these papers. Let Pa be the set of publications having a coauthors, P+ the set of coauthored publication and q the maximum number of authors in a single paper. So 1 P P P    ; . j j PP   Table 1 presents some basic coauthorship statistics for the network studied here.

Co-authorship network parameters
We construct the unweighted network of co-authorship N ca = (V ca , E ca ) according (1) -(4) and analyze its structure. Table 2 gives a summary of the basic results. Mean distance L(N ca ) [16] 6.577475 Local clustering coefficient C(N ca ) [10] 0.264428

Summary
The co-authorship network of scientists in economics N ca has been analyzed by using the bibliometric information retrieved from the distributed DB RePEc. In the collection of papers under consideration multi-authorship is less as compared to sole authorship (around 25 %), and, the prevailing trend is the presence of two authors (77 % of papers).
The statistical network analysis is based on a simple network model representing the co-authorship network as an undirected unweighted graph. We find that the giant component of N ca fills a large portion of the network (around 90 % of the total volume). Thus, most authors at the time of the study are indirectly connected via collaboration. The edge density in the giant component is quite low. The second largest component is far smaller than the giant component. The distribution of the number of coauthors of scientists follows a power law; moreover, this distribution follows a similar power-law form to the distribution of papers.
An important property of many real networks is their 'small-worldness'. We find that the giant component of the network constitutes a 'small world' [17] in which the average distance between nodes varies logarithmically with the number of nodes. According [10] we define 'the smallworldness' on the base of the trade off between the mean shortest path length that is similar to that of a matched random graph and the local clustering coefficient, which is higher than that of a random graph.
The relationship between co-authorship and citation needs to be further studied, namely, whether papers published in co-authorship receive more citations than those published by a single author. The parameters of the weighted co-authorship network, for example, the stability of coauthored groups of authors, are also of our interest.