The structure of co-publications multilayer network

Using the headers of scientific papers, we have built multilayer networks of entities involved in research namely: authors, laboratories, and institutions. We have analyzed some properties of such networks built from data extracted from the HAL archives and found that the network at each layer is a small-world network with power law distribution. In order to simulate such co-publication network, we propose a multilayer network generation model based on the formation of cliques at each layer and the affiliation of each new node to the higher layers. The clique is built from new and existing nodes selected using preferential attachment. We also show that, the degree distribution of generated layers follows a power law. From the simulations of our model, we show that the generated multilayer networks reproduce the studied properties of co-publication networks.

name of the researchers(authors) and theirs affiliations. The term institution is used here to refer to an university, a research center or a research institution. The networks are interdependent because of the affiliation relationship between the entities of the three networks. In fact, a researcher is affiliated to at least one laboratory and a laboratory affiliated to at least an institution. We therefore say that these networks are hierarchical and are induced from collaboration between researchers (see Fig. 1). In order to understand and simulate these systems we have to master both how the actors interact and the rules of affiliations to organizations.
We generalize this hierarchical co-publication network by a complex system where an actor is affiliated to an organization that can also be affiliated to a higher level organization and so on. The relationships between entities at the same level are deduced from interaction of those at the lower level. The interaction between the actors (at the lowest level) is the process that induces the relationship between the organization at the other levels. i. e, two researchers are not connected because they are members of the same laboratory but because they co-published a paper. This is the main difference between affiliation network used in this paper and those of Silvio Lattanzi et al. [2] who consider that in social networks there are two types of entities, actors and societies, that are linked by affiliation of the former to the latter.
After some definitions and the state of the art presented in "Multilayer networks generation models" section, the method applied for the work conducted in this paper was firstly to use the headers of 70224 scientific papers of the HAL archives to build co-publications networks. We measured various structural properties of these networks such as degree-distributions, average distance, clustering coefficient, ..."Co-publications multilayer networks" section is dedicated to this first part of our contribution. From this study, we derived a hierarchical network generation model that can reproduce the measured properties. The model is presented in "The hierarchical network generation model" section with some mathematical results regarding the number of nodes, the number of edges and exponent of the power law degree-distribution of the generated networks. Finally in "Simulations" section, we present some simulations results of the proposed generation model and discussions concerning the comparison between the properties of the simulated networks and those of the real-world networks (built from the HAL dataset). "Conclusion" section concludes the paper.

Multilayer networks generation models
The last 15 years have seen the birth of a movement in science: the complex networks theory. It involved the interdisciplinary effort of some of our best scientists in the aim of exploiting the current availability of big data in order to extract the ultimate and optimal representation of the underlying complex systems and mechanisms. The main goals were (i) the extraction of unifying principles that could encompass and describe (under some generic and universal rules) the structural accommodation that is being detected ubiquitously, and (ii) the modeling of the resulting emergent dynamics to explain what is actually seen and experienced from the observation of such systems.
Network theory provides various tools for investigating the structural or functional topology of many complex systems found in nature, technology and society. There are many applications of multilayer graphs in various areas such as biology, transportation and social network [3][4][5][6][7].

Definitions and notations
A network (or graph) is a pair G = (X, E) , where X set of items, which we will call nodes and E is a set of connections between the nodes, called edges. A set of nodes joined by edges is only the simplest type of network; there are many ways in which networks may be more complex than this. For instance, there may be more than one different type of node in a network or more than one different type of edge. Nodes or edges may also have a variety of properties, numerical or categorical associated.
Graphs of directed edges [8] are themselves called directed graphs or sometimes digraphs. One can also have hyperedges-edges that join more than two nodes together. Graphs containing such edges are called hypergraphs [9]. Graphs may also be naturally partitioned in various ways. For example bipartite graphs: graphs that contain nodes of two distinct types, with edges running only between unlike types [10,11]. So-called affiliation networks in which people are joined together by common membership of groups take this form, the two types of nodes representing the people and the groups [2].
There is no consensual definition of multilayer graphs. There are several definitional approaches in the literature [1,12,13]. In this work, we will use the definition of [1]. A multilayer network is a pair M = (G, C) , where G = {G α ; α ∈ {1, . . . , M}} is a family of (directed or undirected, weighted or unweighted) graphs G α = (X α , E α ) (called layers of M ) and is the set of interconnections between nodes of different layers G α and G β with α = β . The elements of C are called crossed layers, and the elements of each E α are called intralayer connections of M in contrast with the elements of each E αβ (α = β)that are called interlayer connections.
This mathematical model is suited to describe phenomena in social systems, as well as many other complex systems. By using this representation, we simultaneously take into account: the links inside the different groups, the nature of the links and the relationships between elements that (possibly) belong to different layers and the specific nodes belonging to each layer involved.
The degree of a node [21] i ∈ X of a multiplex network M = (G, C) is the vector where k α i is the degree of the node i in the layer α . This vector-type node degree is the natural extension of the established definition of the node degree in a monolayer network.
We say that node i, with i = 1, 2, . . . , N is active at layer α if k α i > 0 . We can then associate to each node i a node-activity vector b i = {b [1] i We call node-activity B i of node i the number of layers on which node i is active.

Properties of real-world complex networks
It has been recently shown that most real-world complex networks have some essential properties in common. Three properties received much attention due to the fact that they have unexpected behaviors in real-world complex networks: the average distance between nodes, the clustering and the degree distribution.
Most of real-world complex networks have the small-world property, i.e., short average distance [22,23]. The small-world concept originated from the famous experience made by Milgram [24]. Another property of many real-world networks is the presence of high average clustering coefficient.
The degree distribution which is, for each k, the probability p k that a randomly chosen node has degree k, is completely different from what was expected. Indeed, for almost all real-world complex networks, the degree distribution follows a power law: p k ≈ k −α . The exponent α of the power law is generally between 2 and 3. Such a distribution means that although most nodes have a small degree, the number of nodes with degree k decays only polynomially with k, and therefore there is a significant number of nodes with high degree. It has been shown in the literature that many co-authorship networks follow power law degree distribution [25][26][27].

The state of the art of generation models
There are basically two ways to propose a model for network generations: • The first may consider a set of observed properties as essential, and then sample randomly objects among the ones which have these properties. Proceeding this way, will yields a typical object with the concerned properties [28][29][30]. It is then possible to determine if the retained set of properties is sufficient (do the random objects produced by the model fit well the real one? ) and to study the expected behavior of the object of interest. The relevance of the set of properties is generally checked using other known properties or behaviors of the object.
• The second define's a construction network generation models process inspired from the way the object is really constructed [2,[31][32][33]. This construction process is generally iterated from an initial state, and eventually leads to an appropriate object. The analysis then concerns the properties induced by the construction process: do they fit real-world properties?
For more details, the reader can refer to [34] for a large overview on the model of simple network. Similarly to monolayer networks, most of the models for generation of multilayer networks can be divided also into two classes: • Growing multilayer networks models, in which the number of the nodes grows, and there is a generalized preferential attachment rule [35][36][37]. These models explain multilayer network evolution starting from simple, and fundamental rules for their dynamics. • Multilayer network ensembles, which are ensembles of networks with N nodes in each layer satisfying a certain set of structural constraints [38][39][40]. These ensembles are able to generate multilayer networks with fully controlled set of degree-degree correlations and of overlap.
In [35,36] a growing multiplex model has been proposed: the network has a dynamics dictated by growth, and generalized preferential attachment. Starting at time t = 0 from a duplex network with n 0 nodes (with a replica in each of the two layers) connected by m 0 > m links in each layer, the model proceeds as follows: • Growth: At each time t ≥ 1 a node with a replica node in each of the two layers is added to the multiplex. Each newly added replica node is connected to the other nodes of the same layer by m links. • Generalized preferential attachment: The new link in layers α = 1, 2 is attached to node i with probability � α i proportional to a linear combination of the degree k [1] i of node i in layer 1 and k [2] i of node i in layer 2.
Growing multiplex network models have been proposed in [20,42], where the multiplex network grows by the addition of an entire new layer at each step. In [20], two nodes i and j in the new layer are linked with a probability p ij that depends on the quantity there called node multiplexity Q ij . In particular, p ij can be either positively correlated with Q ij , or negatively correlated to it. In the first case two nodes that are active at the same time in many layers are more likely to be connected in the new layer, in the second case two nodes that are active at the same time in many layers have small probability to be connected in the new layer. In [42], instead, every node i of the new layer will be active with a probability P i proportional to the activity of the node B i : where A is a parameter of the model. This model enforces a sort of "preferential attachment" of the new layers to nodes of high activity B i , and a power law distribution P(B i ) of the activities of the nodes. The simplest way to obtain a static generative model for multiplex networks is to generalize the existing methods for single-layer ones [1]. Fixing the degree sequence in each layer, one can use a configuration model to obtain a particular realization of the given set of connectivities. In [43] the authors have made the choice to add interlayer links arbitrarily. A different approach is to keep using a configuration model, but to specify the edges between the layers by means of a joint-degree distribution [44][45][46].
A similar method is to specify the degree sequences together with a probability matrix whose element (i, j) is the fraction of interlayer links between layers i and j. The actual link placement is still achieved via uniform random choice [47]. A generalization of this approach has been proposed in [48]; the authors impose the degree correlations within and between layers by means of a set of matrices that specify the fraction of edges between nodes of given degrees in given layers.

Co-publications multilayer networks
By browsing the headers of scientific papers, we can represent and analyze three networks namely: authors' networks, laboratories' networks and institutions' networks. We referred by header of paper, the description of the title of the paper, the names of authors and their affiliations (Fig. 2 is an example of paper's header). The networks are defined as follows: 1. The network of authors. A node represents an author and, if an author i co-authored a paper with author j, the graph contains an undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes. 2. The network of laboratories. A node represents a laboratory in which at least one author published a paper, an edge links two laboratories if it exists at least one paper co-authored by authors of these two laboratories. 3. The network of institutions. A node represents an institution of the authors who have published at least one paper and, an edge links two institutions if it exists at least one paper co-authored by authors from two laboratories each related to these institutions. The term institution is used to refer to a university, research center or a research institution.
Because of a reciprocal collaboration, these networks are undirected. They can be weighted by the number of publications between the two entities or not according to the goal that is given to the study. In this study, they are unweighted. Note that, for the type of nodes (author, laboratory or institution) considered, network construction is summarized in the subsequent creation of clique. The three generated networks are interdependent because of the affiliation relationship between their entities. So, in addition to the above description, we add new edges that represent affiliations' relations between authors and laboratories and affiliations' relations between laboratories and institutions.
We then say that these networks are multilayered (see Fig. 1) and are deduced from collaboration between authors. Indeed, the actors involved in co-publication are the authors. In theses networks, we have two types of relationship: collaboration(at the same level) and affiliation(between two levels).
The studied networks can be formally represented by M = (G, C) , where: G = {G α ; α ∈ {1, 2, 3}} is a layers of M with G 1 is network of researchers, G 2 is network of laboratories, G 3 is network of institutions. Each G α ; α ∈ {1, 2, 3} is a collaboration network and C is a set the affiliations edges.
We We have analyzed the average number of entities affiliated to an organization. Precisely, we looked the average number of researchers affiliated to a laboratory of a given degree and the number of laboratories affiliated with an institution of a given degree (see Fig. 3). Using linear regression model, we approximated the relation between the average number of entities affiliated to an organization of a given degree and the degree of organization. So the number of node at layer α , affiliated to an organization at layer α + 1 with collaboration degree k can be defined as follows: (1) N α (k) =ρ α k + n α 0 .
We have compared the degree distribution of the three layer (Fig. 4). We found that all the three layers have a degree distribution that follows power law. Looking the densities δ α and the average distance l α ; α ∈ {1, 2, 3} (Table 1), we found that, for all the fields of the HAL dataset: As shown in Table 1, each of the three networks layers has a high clustering coefficient ( C ≈ 0.83 ), a low average distance ( l ≈ 7 ); they are small-world networks.
From all the observations made by analyzing the hierarchical network of the HAL dataset, we designed a network generation model that will reproduce the main properties of such type of networks.

Collaboration and affiliation algorithms
Consider that each actor (elements involve in a collaborations) of our model is affiliated to at least one organization. An organization is also affiliated to at least one higher level (2) δ 3 ≥ δ 2 ≥ δ 1 and l 3 ≤ l 2 ≤ l 1 .  of organization. For simplification purpose, we suppose that, each actor or organization is affiliated to only one organization and the mobility of the actors (as for the authors in laboratories) is not considered. In this context, a node x α = (id, aff ) at layer α is represented by its ID id and the ID aff of his affiliation y α+1 at a layer α + 1 . The actors belong to layer 0. The affiliations of actors are in layer 1. The affiliations of organizations at layer 1 is in layer 2 and so on.
To generate edges between node in the same layers, we propose a growth model for the multilayer collaboration network similar to that of Meleu et al. [33] (Algorithm 1). It is an iterative model that simulates at each step a collaboration between actor (node of layer 0) and creates relationships in networks. The collaboration at any layer α ≥ 1 is deduced by the affiliations of the nodes at layer α − 1 . In Algorithm 1, the selection of old nodes is made according to preferential attachment; an old actor i of degree k i (in-layer degree) is selected with probability proportional to P i = k i / j k j . To create edges in other layers, we proceed recursively: from level 0, we select in layer 1 the affiliation of the nodes in the collaboration then, we create a clique with these nodes at level 1. We create edges in level 2 using the previous affiliation nodes and select their affiliation nodes at layer 2. . . This process is given in Algorithm 2.
Let us define the affiliation vector by the set V = { 0 , 1 , . . . , M−1 } , where α , α > 0 is a probability of a new node at level α − 1 to be affiliated to an old node at level α and 0 is the proportion of old nodes by collaboration at level 0. We can observe that, if M = 1 , this model (represented by Algorithm 2) is the same as to Meleu et al. 's model [33]. So the network at layer 0 has the properties described in model [33]. When we create node at level 0, we decide, using affiliation vector to affiliate this node to old or new node at level 1. In the case of affiliation to a new node, we create node at level 1 and then decide (using the affiliation vector) to affiliate this node to an old or a new node at level 2. This is done recursively for the upper layers. We propose in Algorithm 3, a process to create nodes and affiliate them to their organizations. The node affiliation to organizations follows a preferential attachment.
A generated multilayer network is a pair M = (G, C) , where This can be a collaboration networks of actors or collaboration network at each organization's level.
is the set of affiliations between actors and organizations or between organizations and sub-organization G α .

Properties of the generated networks
Let M = (G, C) be a multilayer network generated by our model.

Proposition 1 The number of nodes in
where t is the number of collaborations generated.

Proposition 2 The number of edges |E α | in the network is:
where t is the number of collaborations generated. Proof 2 While selecting or/and creating η actors by collaboration in layers 0, it is possible that all of them will be affiliated to different organizations in all the other layers α ≥ 1 . Thus, the maximum number of edges created by a collaboration in each of the layer α ≥ 1 is then: On other hand, at each step let us consider that all the old nodes of layer 0 involved in the clique creation are affiliated to the same organization i. e (1 − 0 )η actors are affiliated to the same node x at layer 1. At this level(1), the number of new edges will be equal to the number of edges that link the organization x to all the new organizations added by the creation of a new node at level 0 (i. e affiliation of new actors). This number is: We have shown that, at layer α , 0 ≤ α ≤ M − 1 , we have α i=0 (1 − i ) new nodes. These new nodes generate α+1 i=0 (1 − i ) new nodes at layer α + 1 and we have assumed that, old nodes are affiliated to the same node at layer α + 1 , thus, the edges added at layer α + 1 is the edges of the clique of 1 + η α i=0 (1 − i ) nodes which is: The result is deduced by considering t steps.

Theorem 1 If the average degree at layer
, the degree distribution in layer α of a generated multilayer network follows a power law of parameter γ α as: Proof 4 For α = 0 , the result is shown in Ref. [33]. For α > 0 , according the hypothesis on the average degree: the probability that a new node at layer α will connect to a node of the layer α + 1 is proportional to the degree of this last node in this α + 1 layer.
Since we assume Eq 1, the probability that a node x α−1 is affiliated to an organization y α of degree k α y is proportional to the degree of this node, i.e.: The variation of the nodes of degree k in layer α is impacted by: • the selection of nodes in layer α − 1 affiliated to node of layer α having degree k.
• the selection of old nodes of degree k to affiliate new nodes of layer α − 1 using affiliation vector.
It follows that, the number of nodes of degree k at step t in layer α that gain an edge when the algorithm creates a new collaboration is: Using Eq. 7 we obtain If we denote by p k,t the value of p k when the network has n t vertices, then the variation in n t pk per α i=0 (1 − i )η vertices added is: Looking for stationary solutions p k,t+1 = p k,t = p k as, the variation of the number of nodes of degree k at layer α is then: with: By simplifying in Eq. (9) we obtain: is Legendre's beta-function, which goes asymptotically as a −b for large a and fixed b, and hence .
Theorem 2 If the average degree is 1 + η α i=0 (1 − i ), the degree distribution in layer α > 0 of a generated multilayer network follows a power law of parameter γ α as: (1 − j ) ; similarly to the previous case, and looking for stationary solutions p k,t+1 = p k,t = p k as, the variation of the number of nodes of degree k at layer α is:

Simulations
We have implemented our model using a custom-written Java program. The outputs of this program were used by a custom program written in python and based

Fig. 5
Correlation between degree of organization and the number of members affiliated on simulated network Fig. 6 Comparisons of degree-degree correlation between consecutive layers on real networks (Statistic dataset field in HAL) and simulations on the libraries: NetworkX 2 , Pymnet 3 and powerlaw [49]; to make most of the measurement on the generated networks. To generate the networks, we extracted parameters such as number of collaborations to generate, the affiliation vector and the distribution of numbers of actors by collaboration from the different fields of HAL dataset.  The first behavior that we wanted to observe on the simulated networks is the correlation between the in-layer degree of an organization and the number of members affiliated. Using linear regressions, we compared this correlation between real networks (Fig. 3) and the generated one (Fig. 5) and, it appears that, their values are close. Figure 6 shows the comparison of degree-degree correlation between consecutive layers. Layer 0 is the collaboration network of researchers, layer 1 is the collaboration network of laboratories and the layers 2 the collaboration network of institutions. We can see that on real networks and simulations the curves can correctly be approximated using linear regression. The positive slopes of the obtained regression line mean that researchers, with high number of collaborations are affiliated to organizations with strong capacity of cooperation. This is easily explained on the simulations by the fact that, since a node is affiliated to a single organization, each time this node participates in collaborations, it induces participation in the collaborations of its affiliations organizations.
In Fig. 7, we have depicted the degree distributions of the different layers. We found that the simulations reproduce the power law distributions observed on the three layers of HAL dataset. Indeed, the exponent ( γ ) of the power law degree-distribution in Tables 1, 2 and 3 are close. From Table 3 we observe that, the simulated layers networks have very high clustering coefficients(C ≈ 0.80 ) and high transitivities ( T ≈ 0.2 ). This behavior contributes to create high density and high average degree in the affiliation layers in comparison with the current layer. So the density grows inversely than average distance. We can conclude that, the simulated networks layers are the small-world network.

Conclusion
In this article, we have shown that headers of scientific papers can be used to build copublication networks that are multilayered networks of entities involved in research namely: authors, laboratories, and institutions. Indeed, in addition to the collaborative relationships that exist between entities of each type, there are affiliation relationships of authors to laboratories and laboratories to institutions. We have analyzed the properties of such networks built from data extracted from the HAL platform which is a free archive of scientific publications. Following the observations made on the properties of co-publications multilayer networks, we generalized these networks to a system of actors and organizations such that an actor is affiliated to an organization and each organization is affiliated with a higher level organization. We then said that the graphs are hierarchical and are deduced from the collaboration between the actors. The actors collaborate together and the relationships in the different layers are deduced from these collaboration and their affiliation relationships. We proposed an algorithmic model to build graphs presenting such properties. It is an iterative model that builds a collaboration clique and related affiliations at each step. We showed that the degree distribution in different layers follows a power law and the simulations carried out showed that the studied properties of the generated layers are close to those of the the real-world network built from HAL dataset.
In the future, we are planning to explore the structure and dynamics of communities in such co-publication networks. Indeed, the high clustering coefficient and high transitivity in these graphs suggest the existence of many communities. Before doing this, we will verify the robustness of our model by using other scientific publications archives and perform a more accurate evaluation of the gaps between the values of the properties of the generated networks and those of real-world networks.